January 2012 Archives

Things Get Weird at Scale

| No TrackBacks

Something scary happened on Saturday. At Famigo, we have several different monitoring systems for our production environment. At about 3 AM, they all collectively went nuts. I happened to be up then because my son had a coughing fit, so I checked the site, verified that nothing weird was happening, and chalked it up to code gremlins.

When I woke up a few hours later, I saw roughly 38 gazillion more alerts. Based on the amount of load we were seeing, you'd think Barack Obama had interrupted all network programming to give a Famigo plug (note to Barack: you should do this). Still, I didn't actually see many users on the site, just a lot of load. Also, we're load balanced in production, and each traffic-handling node was under heavy load; it wasn't just one machine.

Maybe It's Not a Problem?
Saturday mornings are a key time for us at Famigo because we send out a report to each activated user, showing what apps their kids played along with some personalized app recommendations. When I saw the load, I immediately thought something was wrong with the email generation process. Maybe we dropped an index somewhere and the queries were suddenly taking a long time? I resolved not to worry any more about this until the emails were done.

Here's the problem: the emails wouldn't finish. They were going so slowly due to the load that we'd be sending these emails for days. I now began to worry more. I canceled our email task and began troubleshooting in earnest.

Bizarre Facts Emerge
As I SSHed into our various boxes, I noticed, via a top command, that it was the Python process serving our website and API that was consuming 100% of the CPUs. That was intriguing. I restarted all of the usual suspects (MongoDB, lighttpd, our web app), only to see things quickly begin to degrade again. Within a few minutes, Python was once again consuming 100% of the CPU.

Like most web apps, we don't really do much work that's CPU-bound. In fact, one of the laws of performance I've learned is that if you do have work that's CPU-intensive, always do it in the background. And yet, we were clearly taxing the CPU. Maybe we accidentally pushed a commit that attempts to generate pi to 1 million digits every time someone made a web request?

It seemed clear that the issue was with our code. Here's another weird thing: we hadn't changed much lately. I went through all of the commits for the past 2 days, and it was all pretty boring stuff. Just to verify, I pointed my dev instance to our production database (don't try this at home, kids) and began to actually navigate through these recent changes. Like I initially thought, there was nothing earth-shattering; it was all pretty standard web stuff.

At this point, I began to think I was hallucinating. I hadn't seen Girl Interrupted, but I imagined that Winona Ryder got committed in that movie because of the mental strain of debugging in production.

A-ha! (Or, How to Solve Problems Through Random Mouse Clicks)
I didn't know what to do. I just began clicking around our site on my dev instance, monitoring page load time in Chrome. Everything looked just fine on my dev instance at first, and then I began to notice one trivial view taking a little bit longer than it should've. The page itself should've rendered in a second or so, and it was taking closer to 1.5 seconds. The more I reloaded, the longer it took. This was particularly interesting because this view was our application view (here's an example). We can render over 30,000 apps with that view, so while the logic is very simple, it's constantly being rendered because of all the traffic.

Now, we get to the funny part. As I mentioned, we analyze tens of thousands of apps and, if you knew the right app slug, you could actually render any app, even dirty ones, with that view. While we didn't advertise this fact, you could get to stuff like famigo.com/app/super-sexy-sex-time/. Those are clearly not the apps we want families to see, even progressive European families. So, I had recently put in just a bit of work to keep sex apps from rendering. Every time someone requested an app, we'd check to make sure it was in the set of allowed, non-sex apps before we rendered anything.

It's really just one line of code that does the no-sex-app check. The no-sex-app check wasn't being done in the database; I was basically saying, in Python, 'raise 404 if app not in good_apps'. That's so simple! On my development instance, it worked fine. It didn't slow down our unit tests. However, when I wrote a quick bash script to request that URL 10 times simultaneously, things began to explode. In production, when we regularly have 25+ visitors requesting that URL at all times, everything truly burst into flames. It turns out that this particular view was CPU-bound, I just didn't know it until it encountered some scale.

Scale Drives You Mental
I think this is a fascinating bug. It's very simple logic that would work fine if we had 3 or 5 people on the site at any moment. With 10x that traffic, it was catastrophic to the rest of the platform. Imagine how crazy this gets at 100x or 10,000x that traffic. That's what makes scalability fun: gigantic issues at scale often come from very innocuous code. At least for me, the root cause is never what I expect.

How do we prevent this from happening in the future? I'm not entirely sure. It's not really something that fits into a unit test or an integration test. As a stopgap solution, we wrote a Python decorator that wraps all of our views and logs how long they took to render. Based on that, we can calculate how long a view should take to render, and alert ourselves if the render time is outside a reasonable span of time. It's not perfect, but it's a start. Anybody have a better idea?

I am on the record as being a MongoDB fan, admirer, and devotee. I never quite felt the same way about Redis, though. My friends would talk excitedly about Redis and I'd say, "But I have a perfectly good key value store in memcached and a perfectly good document oriented database in MongoDB. Between those two, I will solve all hard problems, excluding global warming!" Slowly, though, I began to run into harder problems.

A Harder Problem
At Famigo, we do many, many distinct, complex queries when it comes to recommending apps for families (eg, give me the top 1000 puzzle games for young adults that are free on the Amazon App Store, sorted by user rating). Doing all these queries on demand proved to be a little bit slow (average query time is about a second), so I decided to cache the results of each distinct query for 8 hours. That's slightly more complex, but it wasn't like I was writing an Erlang compiler in Visual Basic.

Initial Approaches
Take 1 of cache implementation: Use memcached for the cache. Ten minutes later, curse memcached for not having an ordered datatype.

Take 2 of cache implementation: Use MongoDB for the cache. Many minutes later, celebrate success (prematurely).

What did our cached query results look like in MongoDB? Each document in the cache had a cache key (eg, most-popular-puzzle-games-for-young-adults), an expiration date, an ordinal, and a reference to the application document that we wanted to render.

Warning Signs
There were already hints that I was doing it wrong. Case in point: I had to manage all of the cache expiration myself. In MongoDB, you can specify a maximum number of documents that a collection can store (which I was doing; I specified a max of 500k docs), but that's not at all the same thing as caching these results for exactly 8 hours. Speaking of which: hey 10gen, we want TTL collections!

Another sign I was doing it wrong: I had to do a lot of index tuning to make my interactions with the MongoDB cache fast. Every time I checked the cache, I had to specify the cache key, expiration date, and sort by the ordinal; for that to be fast, all of those needed to be covered by an index. While the index sped up my finds, it slowed down my inserts. I had a hell of a time finding the right balance.

Unfortunately, I'm not yet done listing the signs that I wasn't doing it right. You can't delete from a MongoDB capped collection. That's no problem if you're just collecting logs, but from time to time, we must invalidate our cache. Since I couldn't delete these documents from the cache, I had to add another column that stored an Active status, which also required an index, since we had to query by it every time.

How Did It Work? (Spoiler Alert: Not So Great.)
We ended up running in production on my MongoDB app query cache for a month or two. It was definitely faster than performing all of the complex queries in real time (~300ms instead of 1s), but there was a new delay when we had to add results to the cache (~200ms). As both app data and users scaled up by an order of magnitude, it was clear that this would just burst into flames at some point.

A New Solution Emerges!
I decided to try something new. I knew that Redis had a sorted set datatype, so I started to play with that. Rather than cache these app query results in MongoDB, I created a sorted set of app ids for each query. I let Redis handle all of the cache expiration business by setting a TTL value for each key. When I wanted to pull from the cache, I did so, then did a find in MongoDB using the $in operator with all of the app ids, then I reordered that in Python based on the app ordering in Redis. I knew it wasn't as pretty, but was it effective?

For my first test, I merely timed how long it took to add a few hundred results to my Redis-backed cache. That was regularly around 200ms; it was now down to 1 or 2ms. Impressive... but then that should be fast. I refused to be impressed until I started pulling from the cache.

Was it faster to pull the app ids from Redis, use that to pull the documents from MongoDB, then use Python to reorder everything? Actually, yes. Thus far, getting from the cache takes 1/3 of the time that it did before. Meanwhile, adding to the cache is essentially free.

How Not to Do MongoDB, or Any Other Datastore
It turns out that, technically, I was correct. I could use MongoDB as a key-value store for caching, much like I could use my Mazda 3 as an amphibious assault vehicle. In practice, neither would be optimized for those use cases.

A key part of determining your architecture is understanding the strengths and weaknesses of your technology choices. The primary strength of MongoDB is how it allows you to simplify and decouple your data modeling via document-orientation. What about Redis? Its primary strength is how it enables very fast access to a few key data structures, like sets and dictionaries. With both of those stated, it becomes clear the situations in which you can combine MongoDB and Redis to build delightful software.

About the Author

The Art of Delightful Software is written by Cody Powell. I'm currently Director of Engineering at TUNE here in Seattle. Before that, I worked on Amazon Video. Before that, I was CTO at Famigo, a venture-funded startup that helped families find and manage mobile content.

Twitter: @codypo
Github: codypo
LinkedIn: codypo's profile
Email: firstname + firstname lastname dot com