Introduction
This is essentially a summary of the book, Apache Cassandra Hands-On Training: Level One.
It’s a fairly basic workbook that takes you through Cassandra from a high-level perspective. You can download a VM from the book’s site that gives you a Cassandra instance, and the book walks you through several exercises where you use Cassandra’s most basic functionality.
I found the book very helpful, and it is much smaller than it looks (it really walks you through the exercises in baby steps, and there is quite a lot of material that you can easily skip over – you’ll see what I mean if you take a look through the pages).
What is Cassandra?
So, you all probably have a pretty good idea of what Cassandra is, whether you’ve used it in your work or not. It’s an open source NoSQL database that boasts scalability, resiliency, fast reads, and fast writes.
Cassandra allows for linear scalability by distributing your database data across different servers. If you want to scale up, you just scale out, by adding another server. Simples.
Cassandra will continue operating even if some servers go down. This is because the same data can be written to more than one server in a redundant fashion. If a server goes down then another server which hosts the same data can step in and pick up the slack.
Cassandra boasts lightning fast writes. Cassandra reads are far slower than Cassandra writes but are far faster than MySQL reads when dealing with massive amounts of data and large numbers of concurrent users. The details behind Cassandra reads and writes are beyond the scope of this introductory article, but more information can be found here http://www.mikeperham.com/2010/03/17/cassandra-internals-reading/ .
Architecture
As mentioned above, with Cassandra your database is spread out across multiple servers. These servers are called nodes, and the nodes together form a cluster. Within a cluster there are no master or slave nodes. Each node is completely equal and has the same functionality as every other node in the cluster. Think of it as like the Hydra, or the Russian Mafia.
There are two abstractions that Cassandra utilises to knit the cluster together: Snitch and Gossip. Snitch is how nodes in a cluster know about the location of each other (IP address, data centre, rack, etc.) Gossip is how nodes in a cluster communicate with each other. Gossip takes place every second, with each node communicating with up to three other nodes, exchanging information about itself and all of the other nodes that it has information about from previous gossip exchanges. While Snitch lets nodes know about the topology of the cluster, Gossip lets nodes know about the state of the nodes in the cluster. For example, is each node up or down.
How does data get distributed in Cassandra? If we have multiple nodes, which data gets sent to which node? Well, for every row of data that is to be written to a Cassandra instance, the row is passed through a hash function (called the partitioner), and then sent to a particular node depending on the hash value that is output by the function. Each node is assigned a range of values when it is added to the cluster, and if a row’s hash value falls within that range then the row goes in that node.
For example, take the following Cassandra table:
home_id | datetime | event | code_used |
H01033638 | 2014-05-21 07:55:58 | alarm set | 2121 |
H01545551 | 2014-05-21 08:30:14 | alarm set | 8889 |
H00999943 | 2014-05-21 09:05:54 | alarm set | 1245 |
If we are to assume that home_id is the partition key (the key by which we will assign the row to a node), then each home_id value will be passed into the partitioner and the partitioner will spit out a value between -263 and 263. Say, for example, that passing the first home_id H01033638 to the partitioner returns a value of -24887391024765843411. If we have two nodes in our cluster, and Node1 is assigned the range -263 to 0 while Node2 is assigned the range 1 to 263, then to which node will the first row of this table be assigned? Answers on a postcard please.
As mentioned previously, resiliency is key to Cassandra’s appeal. Resiliency is achieved by storing data on more than one node. When setting up a database (which we will go into in more detail later), we have the option of specifying a replication factor. If we specify a replication factor of 3 then our row will be written to the node that owns it (determined by the partitioner and token ranges, as discussed above), plus two other nodes in the cluster. If the node that owns the data suffers an unexpected breakdown one of the other two nodes can take ownership of the newly orphaned data. Note that the other two nodes will start off with a range of data which they own, plus the first node’s data which they simply store. If the first node goes down the cluster becomes smaller by one node, and the ranges for the remaining nodes will become larger by some amount.
So, I now have a confession to make. I’ve been lying to you! By default, from Cassandra 2.0 each node is not responsible for one range of hash values. It is responsible for many, by being split up into virtual nodes. The advantage of virtual nodes is that it becomes easier to add a new node to a cluster while maintaining an even distribution of load across all the nodes. In the past, a cluster was expanded by doubling its size, and each existing node had its range simply cut in half. It was considered too difficult to dynamically compute new ranges for each node. Now, by using virtual nodes we can add just one server to the cluster, and it will assume its ‘fair share’ of the load, by taking on some of the virtual nodes that were previously assigned to existing servers. A related advantage of virtual nodes is that is easier to assign different total ranges to different nodes in the cluster. For example, you may want to assign a wider range to a more powerful server and a narrower range to a less powerful one.
Using Cassandra
Communicating with Cassandra
To communicate with Cassandra we use its own query language CQL, which is very similar to SQL. We can talk to our cluster via the CQL shell (cqlsh) at the command line, or via a driver program. The former comes in very handy while doing development work.
An example CQL query looks like this. What do you think it is doing?:
SELECT home_id, datetime, event, code_used FROM activity; |
It is important to note that while CQL is similar to SQL, it lacks some of SQL’s features. This is a weakness of NoSQL databases in general. While they are more scalable and resilient than relational databases, relational databases provide for richer queries.
The Structure of a Cassandra Database
A Cassandra database is defined by a keyspace. There may be many keyspaces in a cluster, but each one represents a distinct database. A keyspace can contain tables, in which our data is stored.
When creating a keyspace we need to give Cassandra the desired name for the keyspace, as well as the desired replication factor and the desired class. The class defines whether the cluster can be spread across one or more data centres (the NetworkTopologyStrategy class) or one only (SimpleStrategy). For example, the below keyspace will be stored in a cluster which is spread across two data centres, dc1 and dc2. We can specify replication per datacentre. In this example we have chosen a replication factor of three in dc1 and two in dc2.
CREATE KEYSPACE vehicle_tracker WITH REPLICATION = {‘ class ’: ‘NetworkTopologyStrategy’, ‘dc1’: 3 , ‘dc2’: 2 }; |
When messing about with a local Cassandra instance you should set up keyspaces with SimpleStrategy and a replication factor of 1:
CREATE KEYSPACE vehicle_tracker WITH REPLICATION = {‘ class ’:’SimpleStrategy’, ‘replication_factor’: 1 }; |
Creating a Table
In order to create a table Cassandra needs to know which keyspace the table will belong to. We get into a keyspace by using the USE command. E.g.
USE vehicle_tracker; |
We’re now in the vehicle_tracker keyspace and any table we create will belong to this keyspace. When we create a table we can define the columns of the table. Remember the table that we looked at earlier:
home_id | datetime | event | code_used |
H01033638 | 2014-05-21 07:55:58 | alarm set | 2121 |
H01545551 | 2014-05-21 08:30:14 | alarm set | 8889 |
H00999943 | 2014-05-21 09:05:54 | alarm set | 1245 |
We can define this table as follows:
CREATE TABLE activity ( home_id text, datetime timestamp, event text, code_used text, PRIMARY KEY (home_id) ); |
We have defined the column names as well as the data types which are allowed in the columns. We have also defined the primary key. As in relational databases, the primary key is a unique value which uniquely identifies each row in a table. In this example, home_id is a unique value and can be used as the primary key. However if home_id was not unique than it could be combined with another value in the row to create a compound primary key:
CREATE TABLE activity ( home_id text, datetime timestamp, event text, code_used text, PRIMARY KEY (home_id, datetime) ); |
The first column in the primary key will be the partition key, which as I mention in a previous section is the value which the partitioner hashes in order to determine which node to assign the data in the row.
Any columns in the primary key other than the partition key are clustering columns. So, if in our above example home_id is not unique, all rows which have the same value for home_id will find themselves in the same partition. Within this partition rows will then be ordered by the value of the clustering columns.
How Data is Stored in Cassandra
When defining our tables in CQL, our data is divided into rows. Internally in Cassandra, however, the data is not stored in rows. Rather, it is stored in partitions. Take the following table, which has the primary key ‘(home_id, datetime)’:
home_id | datetime | event | code_used |
H01474777 | 2014-05-22 07:44:13 | alarm set | 5599 |
H01474777 | 2014-05-21 18:30:33 | alarm turned off | 5599 |
H01474777 | 2014-05-21 07:32:16 | alarm set | 5599 |
While this table has three rows, since each row has the same partition key (the first column of the primary key) the data is stored in a single partition within Cassandra:
H01474777 | 2014-05-22 07:44:13
alarm set 5599 |
2014-05-21 18:30:33
alarm turned off 5599 |
2014-05-21 07:32:16
alarm set 5599 |
You can see that the data in the partition is ordered by the datetime value, since datetime is the clustering column.
How Data is Stored on Disk
When we write data to a table in Cassandra, it is stored to a commit log on disk and to memory. The data in memory is called memcache. When the memcache for a particular table is full it is flushed to disk as an SSTable. The SSTable can be thought of as a snapshot of a table, containing a relatively up to date version of its data. The data in the SSTable can change as you edit the table via CQL commands. The commit log on the other hand, is a record of all commands which is kept so that the commands can be reissued sequentially if a node ever goes down and has to be brought back up from scratch.
Modelling Data
Before we build a data model it is important to understand the limitations of NoSQL distributed databases compared to relational databases. The most important limitation is that SQL joins are not possible. Since the data you are joining may be spread across several different machines, joins would be extremely slow. Therefore, you must think in advance of the queries that you expect to be making, and model your data so that all the data for any one query is contained in a single table. While the data in a single table may be spread across multiple machines, its data is indexed, which is all-important.
This leads us to a further limitation which is that WHERE clauses in Cassandra should reference a partition key. This is because our data is indexed by partition key, and since the data is distributed it would be extremely inefficient to reference data that isn’t indexed. For example, using the previous table which has the primary key (home_id, datetime), the following query would work:
SELECT * from activity WHERE home_id = ‘H01474777’; |
This query would also work, since we can reference the clustering column if we also reference the partition key:
SELECT * from activity WHERE home_id = ‘H01474777’ AND datetime > ‘ 2014 - 05 - 22 00 : 00 : 00 ’; |
This query, however, would blow up since we are referencing a column that isn’t part of the primary key (even though we are still referencing the partition key):
SELECT * from activity WHERE home_id = ‘H01474777’ AND code_used = ‘ 5599 ’; |
There is an exception to the above point re: WHERE clauses only referencing values in the primary key. By creating a secondary index on a non-primary key column, we can reference values in columns other than the primary key. However, querying using secondary indexes is slow – the query has to examine a special secondary indexes table on every node in the cluster.
A final thing to consider about modelling data in Cassandra is this: what if a partition can grow infinitely? For example, imagine a table that stores data about a vehicle and its GPS coordinates at a particular point in time:
vehicle_id | date | Time | Latitude | Longitude |
ME100AAS | 2014-05-19 | 09:16:54 | 51.57278669999999 | -0.129122700000039 |
ME100AAS | 2014-05-19 | 09:59:03 | 51.57278666389424 | -0.129122700046457 |
ME100AAS | 2014-05-22 | 09:33:04 | 51.67881669132745 | -0.129122701111134 |
Say the above table’s partition key is the vehicle_id. We could potentially track the coordinates of one vehicle for a very long time. In that case the partition storing rows for that vehicle_id could grow extremely large. There is a limit to the number of rows that can be stored in a partition in Cassandra, and there is a limit to the amount of data that can fit on disk on one node (all data for a partition must live on one node). While these numbers are very large, they are finite. Therefore it is wise to guard against partitions that have the potential to become extremely large by using composite partition keys. For example, the partition key for the above table could be vehicle_id and date. The values in the two columns are combined and fed into the partitioner.
Conclusion
This concludes our high level overview of Cassandra. If you are interested to learn more, I would recommend that you dive into the book that this article is derived from. It contains many exercises that you can work through, which will give you hands-on experience with issuing CQL commands and manipulating Cassandra databases. The book also contains recommendations for other resources that you can use to continue your education in all things Cassandra.