Production Ain't Pretty: A Case for Excessive Application Logging

| 6 Comments | No TrackBacks
Your application is elegant and well designed.  It's covered by unit tests, the domain is effectively modeled, and your key algorithms are scalable and optimized.  
Jules Winnfield.jpg
You know the methods like Jules Winfield knows Ezekiel 25:17.  You may even be looking forward to pushing it to production, thinking, "Hey, we got it right for once!"  However, unless you have littered your beautiful codebase with thousands of hacked-up log lines for the most trivial operations, you don't stand a chance in hell of supporting this app.

What's the biggest difference between how developers and QA use a system versus how actual production users use the system?  It's simple: developer and QA know what the hell they're doing.  Production users are going to click every button at the wrong time, enter bizarre and incorrect information into forms, and fail to read any and all messaging from the system.  Not only will these barbarians be savaging your system, but there are thousands of them and they expect you to help them.

You can certainly try to support these users and fix the bugs they report without any logs.  After all, you have their data, you have your app, and you can just reproduce their errors manually.  That's nice thinking, but you're wrong.  Remember, you know how to use your system.  You're not going to load 87,000 records into your grid, or try to delete 87,000 records at once, or type an 87,000 word stream-of-consciousness diatribe about gluten-free products into a memo field.  Good luck reproducing that big ball of insanity via the user's data alone.

What are some of the benefits of excessive logging across your application?

First, you know exactly what users were doing when something blew up.  This is always much more helpful than a user-submitted bug report; the users submitting these bug reports have an imperfect knowledge of the system and they're not going to remember exactly what they were doing when catastrophe struck.  You can reproduce erroneous conditions exactly if you've been logging like a fiend, though.

Second, it's much easier to determine if a production issue is an actual bug or if it's some random hiccup in your production environment.  Databases crash, routers get rebooted, and third-party APIs go down for updates.  Maybe your production environment only experienced this hiccup for 1.5 seconds, but it could be the precise period where the CEO was hitting the Save button.  Good luck figuring that out without logs.

Third, voluminous, time-stamped logs can give you great insight into production performance.  If you write software like the rest of us, the load your system is under in the development and QA environment is nothing at all like the load it's under in production.  Once under that load, trivial things can slow down... a lot.  Think about a wooly mammoth, stuck in tar, dragging a Winnebago behind him, and know that'll be some random part of your system that you never expected.  If you collect time-stamped logs like my great aunt collects stray cats, then it's a cinch to find these culprits and then begin to tune them appropriately.  
Ziggy.jpg

I think we all agree now that logs are tremendously important. 
 If not, I encourage you to head over to check out the latest Ziggy comic and to leave the grown-up talk to the rest of us.  What exactly do we want to log then?

First, log every time your system goes out of process.  Thus, log all database, filesystem, OS, API, and web service calls.  Since you typically don't control these dependencies, you need to know exactly what you sent to them and what they sent back in order to troubleshoot any weirdness.

Second, log all user input.  It's hard to overestimate the variance you'll see when you start asking your users for information.   Don't be s
danaykroydofficial.jpg
hocked when you see that people are filling in the First Name prompt with 'hi my name is anthony and i am in the market for that crystal skull vodka that dan akroyd sells on the internet'.  Real users can submit strange, strange stuff.

Third, log the input and output for any non-trivial algorithms in your code.  Sure, you've tested this logic to Hades and back, but you'll still find some weird edge cases in production.  For example, one of my beautiful scheduling algorithms one time blew up in production because of a weirdness with Argentinian timezones.  I certainly didn't anticipate that.

Adding all of this logging into your code is certainly going to ugly it up.  It's a good kind of ugliness though, an ugliness that shows preparation for any and all crazy ass questions you'll have to field once you have real users.  And this ugliness definitely beats the alternative, which is slowly being driven insane by trying to fix bugs you can't reproduce.  Think of it as logging as self defense.

No TrackBacks

TrackBack URL: http://www.codypowell.com/mt/mt-tb.cgi/9

6 Comments

If you work in a corporate environment, one thing to keep in mind when logging is that you may be logging personal data about your customers. This can cause you problems if its comes to the attention of legal or compliance departments whose job it is to make sure the corporation protects such data from unauthorized access.

Caveat: I did that once in a system and didn't realise how much all the logging slowed the performance.
On another system I know the logging ate up 300gb of disk space in 18 months.

Bravo. It's our world. We do not need to suffer like CSI techs wondering who did what where when with what. We create the world.

Is logging expensive. Yes. Is it cheaper not to log. No.

Granted, people can log the wrong things in the wrong way and provide no monitors or simple access and they can just hope disk space grows magically but I find that such people never choose to log in the first place.

Remember that logging can also adversely affect performance. I generally do the following:

DEBUG: General "in and out" flow. e.g: "Called this function" or "Returned this value". Effectively a trace of program flow through system. If this is on in production your app will have terrible performance and log files will fill up your disk.

INFO: Generally describes who/what made request and what end result was. Can include calls to external systems. Same as above but not as extreme. Still keep this off in production.

WARN: Unexpected or undesired outcomes which are recoverable. Keep these on in production but use sparingly.

ERROR: Unrecoverable error. Requires immediate attention. Keep on in production. Log trace to be able to understand what caused error.

FATAL: I don't use it. It seems superfluous.

But what's important is to have the ability to change these log levels quickly and easily while the application is running or with little to no downtime. If WARN or ERRORS start appearing suddenly, then crank the log level down to get more detail.

I am selling the domain and site www.microscopiccolitis.org, please visit the site for information. I thought you might be interested.

About the Author

The Art of Delightful Software is written by Cody Powell. I'm a dev manager at Amazon where I work on the Instant Video, Mobile Clients team. Before that, I was CTO at Famigo, a venture-funded startup that helps families find and manage mobile content.

Are you interested in building great mobile apps for Amazon Instant Video? Email me!

Twitter: @codypo
LinkedIn: codypo's profile
Personal blog: Goulash
Email: firstname + firstname lastname dot com

Categories