I Ain't Afraid of No Downtime: Scaling Continuous Deployment

| No TrackBacks

I was recently at the DevOpsDays conference, where I got into a conversation about build automation. I mentioned how we practice continuous deployment, so we may deploy to production 20 times a day. The guy replied, "That sounds great for some tiny startup, but what would happen if you had actual users?"

Allow me to respond in 2 parts. First, ouch. Second, continuous deployment is not at odds with a great user experience or high uptime requirements.

Between our website and our API at Famigo, we handle hundreds of thousands of HTTP calls every day. We've practiced continuous deployment for 2 years. You know how many complaints we've had about a cruddy user experience due to frequent deployments? Zero. Why were these deployments essentially transparent to all of our users? That's a requirement for our build process, and so we've focused on that part as much as the actual act of building and deploying.

How Does It Work?
First, let's talk about what our production environment looks like. We have a few different VMs hosting our web app; these are all based off of the same original image. Our load balancer distributes traffic across these instances evenly. Since all of our web and API is based upon Django, we use virtualenv to manage all of our Python dependencies on each instance. Each instance also runs Jenkins, which does the heavy duty work of building and deploying.

All of the important data comes from MongoDB or Redis. I point that out to just to note that, with this backend, we rarely do schema migrations. Big honking ALTER TABLE statements can cause serious downtime; just ask the guy in the Oracle shirt crying into his keyboard right now.

How Do We Build?
We have one instance that's constantly polling our github repo for changes. When a change is found, it pulls down the repo. Our environment dependencies are part of that repo, so we make a call to virtualenv to ensure the environment is up to date. Then we run all of our tests; there are around 900 of these. When that's done, we rsync the files over to our production directories and restart our fcgi process. We then make a call to the next instance's Jenkins remote access API to kick off a build, and the whole process starts again.

The only portion of the build process that involves any downtime is when we rsync and then restart fcgi. Those steps take maybe a second or two. Since we build and deploy one instance at a time, that second of downtime rolls from machine to machine; in other words, we never have one second of downtime for all users on all instances.

One thing to keep in mind here is that our load balancer constantly pings our instances to ensure they're up. (After all, that's the whole point of these load balancer thingies.) If, for whatever reason, our downtime is longer than a few seconds, the load balancer will stop distributing traffic to that instance until it's back up.

As you can see, you have to be a little bit lucky (unlucky, rather) to ever see downtime here. You need to hit one particular instance with a request during its 1 second of downtime while the load balancer is sending traffic there with the load balancer not having realized the instance is down.

Does That Downtime Even Matter?
Please break out your slide rule, as we're going to do some math. Per instance, if we do 20 deployments with 1 second of downtime for each, that's 20 seconds. There are 86400 seconds in a day. 20/86200 is, in purely mathematical terms, teensy weensy. (I don't know how to calculate downtime across all instances because of the load balancer and its outage detection, so I'm just sticking with one instance here.)

Now, if we were processing credit cards or something like that, 20 seconds of downtime per day due to deployments would be unacceptable. (Note: we don't do that.) On the contrary, if your traffic is largely mobile, as ours is, then 20 seconds a day is nothing. In fact, we expect far worse. The reason is that, in the land of mobile, you get in the habit of trying and retrying everything related to the network, because the coverage can be so spotty.

Continuous deployment does not necessarily mean giant swaths of downtime throughout the day. In fact, as you scale up in environment infrastructure, deployment smarts, and hopefully users, you gain tools that can make this downtime negligible. Now, back to my actual users.

No TrackBacks

TrackBack URL: http://www.codypowell.com/mt/mt-tb.cgi/32

About the Author

The Art of Delightful Software is written by Cody Powell. I'm currently Director of Engineering at TUNE here in Seattle. Before that, I worked on Amazon Video. Before that, I was CTO at Famigo, a venture-funded startup that helped families find and manage mobile content.

Twitter: @codypo
Github: codypo
LinkedIn: codypo's profile
Email: firstname + firstname lastname dot com