Last week, Famigo was featured in both TechCrunch and VentureBeat. I knew about these in advance, and I also knew that while both would drive a lot of traffic, the traffic wouldn't stick around for long. We needed to scale up, but we didn't need to scale up for long. Add to that the fact that we wanted to scale up temporarily without spending any money and you have a fun little optimization problem!
Before I made any changes, I did a simple load test just to see what we were starting with. I installed httperf, and then directed 20 GETs a second for a minute to the index of our site. The good news is that I didn't kill production entirely. The bad news is that half the requests timed out, and the average response time for all requests across the load test (9+ secs) was unacceptable.
I didn't get too upset here; I expected us to do poorly. After all, we'd spent all of our development time in actually building something people want, not on scaling up for thousands of imaginary users.
On a side note, I had absolutely no idea how much traffic to expect from our PR blitz. Regardless of that, I knew we had to stay up. PR is a long, expensive process involving many, many parties, and a company doesn't get many chances with these outlets or their readers. In order for us to truly benefit from all of this visibility, we had to stay up. As a result, when I was load testing, I used gigantic numbers to ensure we were absolutely prepared.
True App Performance
In my initial load test with httperf, I noticed something interesting: the response time to the initial batch of GETs weren't even that good. The initial problem wasn't one of scalability, but one of plain old web performance. Yay, that's considerably less terrifying!
After a few rounds of profiling our Django requests, I determined that the main bottleneck was with a database query. The queries required in rendering the index of famigo.com are preposterously complex. Why is that? It's because we allow our users to sort and filter apps by every data point in the known universe.
Before I began rewriting any of those insanely tough queries, I enabled profiling in MongoDB and ran the queries again. This showed me something wonderful: the problem wasn't wiht the queries, but with a lack of indexes. (Edit: we weren't lacking indexes entirely, but we had changed the query recently and we were now lacking 2 important indexes.) Hooray for easy solutions! I created the indexes, reran the queries, and saw the immediate improvement. We were now rendering our site index in 1.5 seconds or less, compared to 3 seconds before.
Http Requests and Page Size
My next step was to use YSlow to further analyze our site's performance. Side note: YSlow is awesome, and it singlehandedly redeems Yahoo as a company in my eyes.
I learned a few surprising facts about our site in YSlow. First, holy moly, we send a LOT of data over on each request; initially, this was over 1 MB. Now, if you're on a cable modem or DSL, you probably wouldn't notice the transfer time. Someone on 3G most certainly would, though. There's a scalability aspect to this as well. The larger our response is, the longer it takes our webserver to respond to a request, and thus the fewer requests we can actually respond to.
Now that the site was responding quickly and with a minimum amount of data, I set to work on our cache. We use memcached to cache the results of certain, expensive, out-of-process operations, like complicated queries and requests to certain APIs. Typically we cache all of this for an hour. That's typically an okay trade-off: it keeps the data on our site reasonably fresh as you navigate through it, and it's only when we refresh the cache that requests take a little while.
With all of these new visitors, I wasn't particularly concerned if the data was fresh, we just needed to respond quickly. We needed to approach this like a rapper in a gentleman's club: straight cache, homey.
I cranked our cache life to 24 hours, and then I warmed the cache up myself by clicking every dang page I could find. A random visitor would have to do something very strange in order to request something that hadn't already been cached.
All of our web requests typically go to the same EC2 instance. I began to worry: What would happen if that instance burst into flames from all of the traffic? We had identical VMs that could serve those requests, but there'd be definite downtime as A) I responded to the 'Kaboom!' alert and B) DNS changes propagated.
The clear solution was to use a load balancer to spread our web requests across those EC2 instances. Amazon makes this both simple and impossible. It's trivial to create an elastic load balancer, and point a subdomain (like www.famigo.com) to the load balancer. It is not at all simple to point the domain itself, unless you use Amazon Route 53 to create a hosted zone for your domain, then use Elastic Load Balancing to add a zone apex alias to your hosted zone. I copied that from Amazon's help; I still have no idea what it means.
Due to this, the load balancer handled requests to www.famigo.com, but famigo.com was still pointing to just one instance. I then spent the next several hours hoping journalists would direct visitors to the right URL. Fortunately, they did. For this, I thank the media as a whole.
In the end, we did not go down. The last round of load tests with httperf showed us handling 200 GETs a second indefinitely, with an average response time of about 1.2 seconds. We served thousands of requests without even a hiccup, and we did it without spending any money.