Riak Presentation at NYC NoSQL, October 5, 2009
What is Riak? A lot of buzzwords being tossed around today are applicable to Riak. Riak is a document-oriented database. Riak is a decentralized key-value store, with the standard get, put, and delete operations. Riak is a distributed, fault-tolerant storage solution. Riak is open source and NoSQL. Riak uses HTTP, JSON, REST. Riak is scalable.
Delving a bit deeper, Riak is a storage solution strongly influenced by Amazon's Dynamo; Eric Brewer's CAP Theorem; the Web, in general; and the Basho team's experience in deploying production network environments. We started building Riak in the Fall of 2007 to host two production applications for Basho, which have been up and running on Riak for most of that time. Our pre-Basho experience influenced the design goals we aimed for, and our targets proved successful. We have now open-sourced Riak for the world to use.
Understanding why Riak is so powerful will required learning a bit of background theory. First: Amazon's Dynamo.
The Dynamo paper defined three terms to describe the behavior of a distributed storage system: N, R, and W. Put simply, N is the number of replicas of each value to store. R is the number of those replicas required to perform a read operation. W is the number of those replicas needed to perform a write operation.
Riak's goal is to expose N, R, and W to the application logic. Exposing these terms allows Riak to adapt to the requirements of individual applications.
Riak exposes the N-value setting on a per-bucket basis. For example, all of the objects in the “artist” bucket would have the same N-value, but that N-value might differ from the N-value for the “album” bucket.
The way in which the locations of the N replicas of a value are chosen has to do with consistent hashing. When a request for a key is made, Riak uses a consistent hashing algorithm to map the requested key to a 160-bit integer. As nodes join the Riak cluster, they claim portions of the 160-bit integer space. The node that has claimed the partition surrounding the hash of the request key is the location of the first replica. The rest of the N replicas are stored on the nodes that have claimed the next N-1 partitions.
This consistent hashing algorithm is very important: it's what enables each Riak nodes to drive any request. Since every node can compute exactly which other nodes need to be contacted to fulfill a request, any node can act as the organizer for any client. There is no master node. There is no single point of failure.
Riak exposes the R-value on a per-request basis. Every time you ask Riak for a value, you can use a different R-value.
R specifies the number of nodes that need to reply with success before Riak gives the requesting client a successful reply. Riak still tries to read all N replicas of the value, but as soon as R successes come in, the value is passed back to the client.
Riak also exposes the W-value on a per-request basis. W specifies the number of nodes that need to reply with success before Riak gives the requesting client a successful reply. Riak still tries to write all N replicas of the value, but as soon as W successes come in, the success of the write is reported to the client.
Giving the client application the power to specify R and W at request time, means that at request time an application can specify exactly how many node outages can be tolerated. It's very simple: for any request, N-R (for read) or N-W (for write) nodes may be down, yet the cluster will still appear completely available.
So, in the examples we've been using, with N=3 and R=W=2, we can have 3-2=1 node down/unreachable/laggy in the cluster, yet the cluster will still appear completely available for our requests.
To stress the point, we could even ratchet the N-value up to 10 (ten), and if we still used R or W of 2, we could have 8 nodes down, yet still observe a completely available cluster. Both reads and writes would be accepted without any objection.
Riak exposes the ability to describe node failure allowances through N/R/W because it's a good way to tweak CAP behavior.
If you're familiar with Eric Brewer's CAP Theorem, you know that there are three aspects to consider when reasoning about data storage: the Consistency of the data, the Availability of the store, and the tolerance the store has to network Partitions. If you've kept up with the research, you also know that it has been proven that it is not possible to “have” all three of these. No real-world data store can serve completely consistent data while being 100% available and handling disconnected networks.
Since you can't have all three, most data storage systems “pick two”. Riak recognizes that, not only is it possible to choose some of each, but that different applications may find it beneficial to choose different amounts of each. It therefore gives application developers the power to choose themselves.
You're likely to choose lots of availability and partition-tolerance, though. You're writing an application that's going to run on real-world hardware, and you want it to be up and ready any time a user wants to use it right? Riak is structured to encourage this (we wanted our apps up all the time too).
This means you're likely to consider trading off consistency. There are many hints to how you can guarantee certain kinds of consistency (like read-your-writes) in the Dynamo paper, so I recommend reviewing that document.
It's easiest to demonstrate how a tradeoff in consistency might be handled. Let's say we have a cluster that is fully connected, and serving version zero of a document.
Suddenly a network split happens. Nodes 0 and 2 are in New York, Nodes 1 and 3 are in Los Angeles, and the transcontinental link breaks down. How does each half of the cluster operate?
If you've set your N/R/W appropriately, both halves of the split will, in fact, continue serving version zero of the document, just as before. Clients will be unaware of the split.
Now suppose that a client changes the document in the New York half of the split (you did specify with N/R/W that this should be allowed, right?). This client has introduced some inconsistency.
It is now the case that clients connecting to the New York half of the split will get version one of the document, while clients connecting to the Los Angeles half of the split will continue to get version zero of the document. The two sets of clients see inconsistent versions of the document.
Now suppose that the transcontinental link heals itself and the cluster becomes fully connected again. What should Riak do with the two different versions of the document?
In this case, Riak's use of vector clocks can automatically determine that version one is the correct version to keep around. Vector clocks are our implementation of Lamport clocks. Rather than being a wall-clock-based timestamp, Lamport clocks are “clocks” structured in such a way that descendency, or successorship, can be determined by simple comparison. Every time a value is stored in Riak, its vector clock is incremented, so when the cluster heals, Riak can determine which versions to keep. In this case, Riak will find that this document's version one is the successor to its version zero, and Riak can safely discard version zero. Once version zero is discarded, the data in the cluster is once again fully consistent.
Things get a little more interesting if, while the network is still split, clients update the document on both sides of the split. Now when the cluster reconnects, the vector clocks indicate that neither version is the successor to the other version.
Riak is unable to determine which version should “win” in this case, so just like its desire to bring N/R/W to the application level, Riak also brings the ability to resolve this conflict to the application. Rather than implementing an arbitrary rule for which value wins, like many other systems, Riak will return both values to the client, indicating a conflict, and ask the application to resolve the situation.
Of course, if you know that what you really want is a simple last-write-wins policy, Riak has a simple flag for enabling that behavior (see the allow_mult bucket property).
After all of that theory, how about a few code samples to demonstrate how easy it is to interact with Riak?
Since Riak is written in Erlang, let's start with an Erlang example. The first line connects our client to the Riak cluster. The second line creates a new object (a.k.a. “document”, “key/value pair”). The third line stores the new object in Riak. The fourth line fetches the object back from Riak. The final two lines change the value of our object and store the changed value back into the cluster.
If you don't want to use Erlang, Riak also ships with libraries for Python...
...Riak also ships with libraries for Ruby...
...Riak also ships with libraries for Java...
...Riak also ships with libraries for PHP...
...in fact, since all of these libraries are communicating with Riak over standard, RESTful HTTP, it's easy to use Riak from any system that supports basic HTTP — even the command-line tool curl.
So it's easy to get your data in and out of Riak, but how are you going to run queries across many objects at once? There's NoSQL here, right? What do you say to a little Map/Reduce, Bob?
Riak's Map/Reduce has much in common with other Map/Reduce systems. Map functions are run on the node where the data is stored, increasing data locality while also distributing the computation across the cluster.
The most notable place where Riak's Map/Reduce differs from other solutions is that Riak does not blindly run map functions over all of the values in a bucket. Instead, Riak expects the client to provide a list of object keys on which the map should be run. Map functions may produce more keys for later map phases to process, but the set of keys for the phase must always be specified.
Of course, you can hand as large of a list of keys to Map/Reduce as you like, so if you want to run your map function on all of the values in a bucket, just include all of the keys in the bucket with the Map/Reduce request. (The list_keys or list_bucket client functions can be helpful here.)
The differences in Riak's Map/Reduce when compared to that of other systems are due to our strong desire to support links. The first question raised when converting from an RDBMS world is, “How do I describe the relationships among my data?” We have found that one of the best answers to this question is, “links.”
For example, if you wanted to describe a relationship between an “artist” record and many “album” records, you would create links to the album records in the artist record. Similarly, you might create links in each of the album records to “track” records for the tracks on that album.
Once these links are in place, and you have described to Riak how to extract these links from your objects, you gain access to a new Map/Reduce phase syntax. In the example here, you can see that our simple syntax allows us to say, “Start at the artist REM, then walk to all of the albums that arist is linked to, then walk to all of the tracks those albums are linked to, then extract the names of those tracks.” The result of this query will be the names of all of the tracks REM has ever produced.
We found this link-walking to be so useful that we even create a URL syntax for it. At the bottom you can see the same link walk expressed as a URL, and performing an HTTP GET on that URL will return to you all of the track objects that the earlier example mapped across.
There is still much more to learn about Riak, including how to build pluggable backends, how to use the eventing system, how we monitor Riak clusters, and how we do inter-cluster replication. Those will have to wait for another presentation.
If you're interested in learning more in the meantime, though, please visit http://riak.basho.com/, where you can read the documentation and download Riak to experiment with yourself. There is also a mailing list at email@example.com, where many current users are anxious to discuss your questions.