The ancient Greeks told the story of Cassandra, the daughter of King Priam and Queen Hecuba of Troy. She was one of the most beautiful women of her generation. When offered the gifts of a prophetess by the Greek god Apollo, she quickly accepted, but when she later spurned his amorous advances, Apollo cursed her to always know the truth and never be believed by any to whom she spoke it. Thanks to her gift of prophesy, Cassandra foresaw the trap presented by the Trojan horse, but thanks to her curse of disbelief, no one in Troy would listen to her warnings. They brought the horse within the city walls, and unwittingly invited the Greek soldiers hidden therein into the city, which led to Troy’s fall. Cassandra was taken as a war prize back to Greece by Agamemnon, where she again foresaw the future: his (and her) death, but was again disbelieved—and, sure enough, both he and she were killed.
Modern computer science geeks tell the story of Cassandra a little differently, as Apache Cassandra, another of the “NoSQL” databases—and a popular one at that—in use at a variety of well-known Internet-based companies (YouTube, Netflix and others), and presumably one whose reports are actually taken at face value. (Rumor has it that Cassandra is a pun on another famous prophetess, the Oracle of Delphi.)
To the developer, Cassandra the software can be just as confusing as Cassandra the Trojan. It’s “an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneably consistent, column-oriented database that bases its distribution design on Amazon’s Dynamo and its data model on Google’s Bigtable” (source: “Cassandra: The Definitive Guide,” O’Reilly Media, 2010, p. 14).
Sometimes I think the Greek myths make more sense than my industry does.
Breaking all that down, we see that:
More relevant to this discussion, Cassandra has been gaining momentum within the developer community as a worthwhile tool to have in the toolbox, so it seemed like a good idea to turn our collective columnar gaze upon a column-oriented database. (Pun intended.)
Cassandra is not a relational data store, despite its use of the term “column-oriented.” In fact, it doesn’t really look anything at all like a relational database. Instead of storing a schema, for example, that guarantees the various rows of data in the table are all alike, Cassandra stores “column families” in “keyspaces.” A keyspace is really just an administrative isolation barrier, in much the same way that relational database instances are separated from one another on the same server, but a column family is a completely different beast. Each column family is made up of “rows” identified by a key, but within a row, any number of name/value pairs (columns) can be present, and each row can contain entirely different data elements from the other rows within the column family.
In practical terms, let’s suppose we’re using Cassandra to store a collection of people. Within the keystore “Earth,” we’ll have a column family called “People,” which in turn has rows that look like this:
ColumnName:"Identifier", ColumnValue: <image>
ColumnName:"Title", ColumnValue:"Rock Star"
As you can see, each row contains conceptually similar data, but not all rows will have the same data (though if the variance grows too large, it might get confusing for developers to use). Storing pets in here, for example, would likely create too much chaos. This is why any nontrivial application will likely use dozens or hundreds of different column families.
By the way, I’m lying (slightly) to you when I say that a row is made up of name/value pairs; it’s actually made up of name/value/timestamp triplets, but the Cassandra docs make it pretty clear that the timestamp part of the triplet is only for conflict detection and is never to be used as part of your application logic. Most Cassandra articles essentially tell new Cassandra developers to ignore it.
This all makes more sense once you see it in action, so let’s get Cassandra running.
Before you can do anything with Cassandra, you have to get it installed, and therein lies the first hurdle: Cassandra is, as advertised, an open source project, and like many open source projects, it’s not written in a Microsoft .NET Framework language. Instead, Cassandra is written in Java, and as such requires a relatively modern Java runtime to be installed on your machine in order to execute. Cassandra runs fine with Java 6 (and, in fact, most of the blog posts on the subject suggest it), but should run just as well if not a touch faster with the most recently released Java 7.
(If you’ve never installed Java on your machine before, just plug “Java Runtime Environment 6 (or 7) download” into your search engine of choice and pull down the desired installer for either 32- or 64-bit Windows, depending on your target OS. About the only other thing you’ll need to do is set an environment variable called JAVA_HOME to point to the Java Runtime Environment (JRE) install directory—under a default installation, this will be in C:\Program Files\Java\jre6—and put the JRE’s “bin” subdirectory on the PATH if it’s not already.)
Next, pull down the Cassandra binaries from the Cassandra homepage. Unfortunately for us Windows folks, it’s only available as a .tar.gz file, which, out of the box, Windows isn’t sure what to do with. Dozens of tools are available to unarchive a .tar.gz file, including the command-line “gunzip” and “tar” utilities in Cygwin, if you want to start practicing some Unix-Fu on a Windows box. Dump the contents of the Cassandra download into a convenient directory, such as C:\Prg\apache-cassandra-1.1.0 (which is the latest version, as I write this). Then, as is common with Java projects, you need to create an environment variable that points to the root of the Cassandra install directory, so create a CASSANDRA_HOME environment variable that points to C:\Prg\apache-cassandra-1.1.0 (in my case).
If you’re a little aghast at the primitive conditions here, remember that Java projects like to work on multiple platforms (which means we have to use mechanisms that are common to all platforms, and yeah, environment variables are everywhere, even on Android). The positive side of this is that if you ever work with Cassandra on a non-Windows platform, you’ll be doing the same setup steps: get Java, get Cassandra, unarchive and set environment variables. Unfortunately, it means that our tooling isn’t quite as fancy and GUI-based as we might otherwise be used to.
Speaking of which, firing up Cassandra means hopping on over to the Cassandra install directory and kicking off the batch file “cassandra.bat” found in the “bin” subdirectory. Launch that as “cassandra –f” (the “-f” causes it to run in the foreground), and you should see something like Figure 1.
Figure 1 Installing Cassandra with the Cassandra.bat File
By default, Cassandra is configured to dump data and commit logs into the “var” directory off the root of your filesystem, which Java interprets as C:\. This is more Unix-ism, and is easily configured differently in the “conf/cassandra.yaml” configuration file.
(Convenience note: A company called DataStax Inc. offers an all-in-one installer containing both the Cassandra server and JRE, as well as an HTML-based operation center product, available as a free download. If you’re having difficulties getting it all set up, you might try that instead.)
A running Cassandra server is expecting incoming connections on port 9160 and uses port 7199 for its Java Management Extensions monitoring, which is Java’s rough equivalent to Windows Management Instrumentation. Both ports will, eventually, want to be accessible to client applications and Cassandra monitoring utilities, respectively.
Once Cassandra is up and running on your box, we can connect to the running instance using the Cassandra command-line interface, launched by running “cassandra-cli.bat,” again from the Cassandra “bin” directory (see Figure 2).
Figure 2 Connecting to a Running Cassandra Instance
To create a keyspace, use “create keyspace TestKS” (which must be a unique name), and to create a column family within that keyspace, first type “use <keyspace>,” then “create column family <name>.” No other schema definition is required—the column family is a collection of name/value pairs from then on, remember.
To insert data into the column family, use the “set” command, which requires the name of the column family into which you insert (“TestCF”), the key to use for this row (“TestKey”), the column within the column family to use as the name for this value (“column”) and the value to store there (“value”). However, because Cassandra stores data as binary values, you have to tell Cassandra to interpret the row key, column name and column value as ASCII values using the built-in “ascii” function. This means the whole “set” looks like this:
Retrieving that data is basically the same exercise using the “get” command, like this:
This will return with something like this:
(column=636f6c756d6e, value=76616c7565, timestamp=1338798419726000)
This demonstrates that Cassandra does, indeed, speak gibberish (at least, to us humans—if you look carefully, those binary values are the ASCII values of “column” and “value,” respectively).
We’re out of time, and Cassandra has only been installed. Specifically, a single-node Cassandra cluster is up and running, and nothing has been done to program against it yet. Fortunately, the hardest part of getting started with Cassandra has been completed. In the next installment, I’ll start using .NET libraries to talk to Cassandra, get it to store some data from the .NET applications, pull it back, and then show how to set up a three-node cluster and get it up and running.
For now, though, happy coding!
Ted Neward is an architectural consultant with Neudesic LLC. He has written more than 100 articles and authored or coauthored a dozen books, including “Professional F# 2.0” (Wrox, 2010). He is an F# MVP and noted Java expert, and speaks at both Java and .NET conferences around the world. He consults and mentors regularly—reach him at firstname.lastname@example.org if you’re interested in having him come work with your team. He blogs at blogs.tedneward.com and can be followed on Twitter at Twitter.com/tedneward.
Thanks to the following technical expert for reviewing this article: Kelly Sommers
When to use RDBMS When to use NoSQL http://wisentechnologies.com/it-courses/bigdata-training-in-chennai.aspx
An excellent article Ted which is very understandable to us, new the new NoSQL notion. Thank You! I am eager for more, is the next installment coming out any time soon?
Also, this line: get TestCF[ascii("TestKey")]; should read: get TestCF[ascii('TestKey')]; (needs single quotes or you get a syntax error)
Got stuck on this line for quite a while: To create a keyspace, use “create keyspace TestKS” (which must be a unique name) This should be: “create keyspace TestKS;” (can't expect someone reading an introductory article to know the commands are semicolon terminated on their first command)
More MSDN Magazine Blog entries >
Browse All MSDN Magazines
Subscribe to MSDN Flash newsletter
Receive the MSDN Flash e-mail newsletter every other week, with news and information personalized to your interests and areas of focus.