August 2011 Archives

I did a talk at the Austin NoSQL group about hosting MongoDB in EC2, and I completely forgot to post anything here on it. I will correct that now! Slides here.

At Famigo, we house all of our valuable data in MongoDB and we also serve all requests from Amazon EC2 instances. We've devoted many mental CPU cycles to finding the right architecture for our data in the cloud, focusing on 3 main factors: cost, reliability, and performance.

If you're hosting anything in the cloud, you must keep one scary question in mind: what if this node disappears? The node could disappear because of an availability zone outage (wuzzup, EC2?), an actual hardware failure (remember, it's still hosted somewhere on something), or a werewolf attack on a datacenter. Regardless of the cause, it's safe to assume that this will happen at some point.

Since our data is crucial to our business and hard downtime is not an option, we want our data replicated across multiple nodes. Not only that, the ideal scenario would have automatic failover, so that if one node dies, another can take its place without any human involvement. In addition, it'd be great if, when a flaky node comes back up, it automagically rejoins in the replication.

MongoDB replica sets satisifies all of these requirements, and they're quite easy to set up. Replica sets are like asynchronous master/slave replication on steroids. A replica set has 1 primary node, and at least one secondary node. Each node has a special oplog collection, which is an ordered list of writes performed on the data. All writes occur first on the primary node (and represented in its oplog). Secondary nodes are alerted to changes in the primary node's oplog, and so all data changes are made quickly in the right order.

One cool feature of replica sets is automatic failover. When a primary node goes down, an election is held and a new primary node is elected within a couple of seconds. Yep, within seconds; there are no electoral college shenanigans here. Another great feature is automatic recovery. When a node falls behind, it catches up by iterating through the primary node's oplog.

Replica sets sound awesome, but they can be a little bit complicated for the following reason: a set must contain at least 3 nodes. Why? The primary node is determined by voting, so you want an odd number to break any ties. Thankfully for our bank account, there are special, lightweight nodes called arbiters that don't actually store any data, but exist solely to vote. While you do need 3 nodes, you don't need 3 full, high-performance nodes.

Replica Set Performance
Replica sets sound as if they could be slow, since all writes must occur on one particular node first, then filter down to every other node. Also, by default, you can't even read from non-primary nodes. So, that means all writes AND all reads must occur on this one node that's also orchestrating all of the replication. Heavens to Betsy, this sounds terrible!

Thankfully, there is a setting you can apply to your replica set to allow reads on your secondary nodes. This setting is rs.slaveOk() and can be entered into your mongo client at the command line. You can't guarantee that the data from these reads are completely up-to-date with the canonical dataset on your primary node, but practically speaking, it's good enough. I've found this to be a very worthwhile trade-off, and so we always set rs.slaveOk().

Thus, with just a few keystrokes, a replica set allows you to scale your reads across many nodes. We scalability nerds should find that pretty exciting.

Node Performance
What are the ideal specs on each node in the replica set? At the very least, each node needs a 64 bit processor. This is because MongoDB uses memory-mapped files for performance, and so any instance with a 32 bit processor can only access 2.5 GB of data.

With regards to RAM, the best rule of thumb I've seen is that a node should be able to fit its working set into memory for best performance. That's not necessarily your whole database. If you're using a tool like top, it actually looks like Mongo is using a boatload of memory. This is an illusion, since much of this is cached.

If each node needs a 64 bit processor and (probably) a lot of RAM, it's starting to sound like it can get expensive. Thankfully, there's a shortcut here. Remember, if your node is an arbiter and only useful for voting, it does NOT need a 64 bit processor, nor does it need a lot of RAM. An arbiter can be the tiniest instance you can possibly get away with. If you have a Casio watch that you bought for $3.50 at a garage sale, I'd seriously consider using that as your arbiter.

On Amazon's EC2 service, I've found that a Large instance works best for each primary or secondary node. These are 64 bit instances with around 8GB of RAM. Currently, you can spin one of those up on-demand for $0.34 an hour. Sadly, these are Amazon's cheapest 64 bit instances.

One cool thing about Rackspace Cloud instances is that they're all 64-bit. For a primary or secondary node, you can choose the cheapest instance with enough RAM. If you're looking for something equivalent to the EC2 Large instance, it's currently $0.48 an hour.

Yes, they are expensive, but you don't have to devote these instances solely to MongoDB. They can also host your API, web server, etc.

For an arbiter, I recommend finding the smallest instance you possibly can. I use an EC2 micro instance here, which you can get for free in certain conditions or spin up an on-demand instance for $0.02 an hour. The Rackspace equivalent is $0.015 an hour.

I should note that if you're willing to prepay for these instances, the price goes down significantly. Also, if you have incriminating photos with which to blackmail executives at Rackspace or Amazon, the price goes down further still.

The end result here is that you get your data replicated across multiple nodes, automated failover and recovery, and scaled reads on highly-performant machines for less than $1 an hour. Not too shabby, MongoDB!

About the Author

The Art of Delightful Software is written by Cody Powell. I'm currently Director of Engineering at TUNE here in Seattle. Before that, I worked on Amazon Video. Before that, I was CTO at Famigo, a venture-funded startup that helped families find and manage mobile content.

Twitter: @codypo
Github: codypo
LinkedIn: codypo's profile
Email: firstname + firstname lastname dot com