January 2010 Archives

Your application is elegant and well designed.  It's covered by unit tests, the domain is effectively modeled, and your key algorithms are scalable and optimized.  
Jules Winnfield.jpg
You know the methods like Jules Winfield knows Ezekiel 25:17.  You may even be looking forward to pushing it to production, thinking, "Hey, we got it right for once!"  However, unless you have littered your beautiful codebase with thousands of hacked-up log lines for the most trivial operations, you don't stand a chance in hell of supporting this app.

What's the biggest difference between how developers and QA use a system versus how actual production users use the system?  It's simple: developer and QA know what the hell they're doing.  Production users are going to click every button at the wrong time, enter bizarre and incorrect information into forms, and fail to read any and all messaging from the system.  Not only will these barbarians be savaging your system, but there are thousands of them and they expect you to help them.

You can certainly try to support these users and fix the bugs they report without any logs.  After all, you have their data, you have your app, and you can just reproduce their errors manually.  That's nice thinking, but you're wrong.  Remember, you know how to use your system.  You're not going to load 87,000 records into your grid, or try to delete 87,000 records at once, or type an 87,000 word stream-of-consciousness diatribe about gluten-free products into a memo field.  Good luck reproducing that big ball of insanity via the user's data alone.

What are some of the benefits of excessive logging across your application?

First, you know exactly what users were doing when something blew up.  This is always much more helpful than a user-submitted bug report; the users submitting these bug reports have an imperfect knowledge of the system and they're not going to remember exactly what they were doing when catastrophe struck.  You can reproduce erroneous conditions exactly if you've been logging like a fiend, though.

Second, it's much easier to determine if a production issue is an actual bug or if it's some random hiccup in your production environment.  Databases crash, routers get rebooted, and third-party APIs go down for updates.  Maybe your production environment only experienced this hiccup for 1.5 seconds, but it could be the precise period where the CEO was hitting the Save button.  Good luck figuring that out without logs.

Third, voluminous, time-stamped logs can give you great insight into production performance.  If you write software like the rest of us, the load your system is under in the development and QA environment is nothing at all like the load it's under in production.  Once under that load, trivial things can slow down... a lot.  Think about a wooly mammoth, stuck in tar, dragging a Winnebago behind him, and know that'll be some random part of your system that you never expected.  If you collect time-stamped logs like my great aunt collects stray cats, then it's a cinch to find these culprits and then begin to tune them appropriately.  

I think we all agree now that logs are tremendously important. 
 If not, I encourage you to head over to check out the latest Ziggy comic and to leave the grown-up talk to the rest of us.  What exactly do we want to log then?

First, log every time your system goes out of process.  Thus, log all database, filesystem, OS, API, and web service calls.  Since you typically don't control these dependencies, you need to know exactly what you sent to them and what they sent back in order to troubleshoot any weirdness.

Second, log all user input.  It's hard to overestimate the variance you'll see when you start asking your users for information.   Don't be s
hocked when you see that people are filling in the First Name prompt with 'hi my name is anthony and i am in the market for that crystal skull vodka that dan akroyd sells on the internet'.  Real users can submit strange, strange stuff.

Third, log the input and output for any non-trivial algorithms in your code.  Sure, you've tested this logic to Hades and back, but you'll still find some weird edge cases in production.  For example, one of my beautiful scheduling algorithms one time blew up in production because of a weirdness with Argentinian timezones.  I certainly didn't anticipate that.

Adding all of this logging into your code is certainly going to ugly it up.  It's a good kind of ugliness though, an ugliness that shows preparation for any and all crazy ass questions you'll have to field once you have real users.  And this ugliness definitely beats the alternative, which is slowly being driven insane by trying to fix bugs you can't reproduce.  Think of it as logging as self defense.
You can learn a lot about great functions by looking at horrible functions.  What's one of the antipatterns we see in horrible functions all the time?  How about this: the return type of the function is completely insane.  Yeah, I'm looking at you, Mr. Hash of Hashes!  This is a particularly huge problem when it comes to calling APIs whose code you can't actually read.

Last time, I talked about how the function name is the best spot to communicate the intent of a function.  Return type also comes in very handy there, because it's the proof that the function does what it claims to do.

Imagine you see a function called CalculateSubtotal in an API.  You think to yourself, "A-ha, this calculates a subtotal!"  And then you look a little closer at this function and you see that it's void, it doesn't actually return anything.  Wait, calculate what for the who-now?  Intent destroyed, dogs and cats are living together, and chaos reigns supreme!

Returning void from the function is usually too simple, but perhaps an even bigger problem is returning something too complex.  Imagine now that CalculateSubtotal returns an array of guids.  What in the world do those values correspond to?  They correspond to something, I realize, and perhaps I'm supposed to do something with them, but what and how?  Are these guids identifiers for something in the database?  Maybe they're lookup codes for errors in the logs?  Maybe they're just really beautiful guids that you thought I, a fellow data type enthusiast, would enjoy looking at?

What's the only way to actually see what those complex return-types from a closed API correspond to?  You have to run the code through a debugger.  When you do that, you're delivering a kick to the groin to the stated purpose of a good function: to encapsulate complexity.  Complexity is no longer encapsulated when I, the caller of the function, must first analyze your code to see what in the hell you are doing.  (I am harping on this because I am probably the world's worst offender here.)

There are many times when you will have to return a hash of hashes, or something equally complex, from your function.  What do you do then?  You should create a class.  It'll be an incredibly simple class that a monkey could write between glasses of Tang.  At the very least though, the OrderIDsToLineItemIDs class (or something along those lines) makes much more sense to the rest of Planet Earth than the monstrous, beastly hash of hash of hash of hashes you may be thinking about.

I love to listen to great developers talk about how they write software. If you like the same thing, then you should check out Coders at Work by Peter Seibel. Peter rounded up some amazing folks from the software development world to interview, like Jamie Zawinski, Guy Steele, Ken Thompson, Peter Norvig, and the ultimate shot caller himself, Donald Knuth.

I continue to find must-read passages in this book, like this selection from Peter's interview with Guy Steele. Guy Steele, of course, did hugely important work on Emacs, Scheme, Java, and Common Lisp. I love how he correlates the edge cases of fairy tales and software.

Guy Steele: And I think that the other problem is that people like to focus on the main thing they have in mind and not worry about the edge cases or the screw cases or things that are unlikely to happen. And yet it is precisely in those cases where people are most likely to disagree what the right thing to do is.

Sometimes I'll quiz a student, "What should happen in this case?" "Well, obviously it should do this." And immediately someone else will jump in and say, "No, no, it should do that." And those are exactly the things that you need to nail down in a programming specification of some process.

I think it's not an accident that we often use the imagery of magic to describe programming. We speak of computing wizards and we think of things happening by magic or automagically. And I think that's because being able to get a machine to do what you want is the closest thing we've got in technology to adolescent wish-fulfillment.

And if you look at the fairy tales, people want to be able to just think their minds what they want, waive their hands, and it happens. And of course the fairy tales are full of cautionary tales where you forgot to cover the edge case and then something bad happens.

Peter Seibel: Fantasia and the perils of recursion, of course.

GS: Fantasia and recursion, yes. Or, "I wish I was the richest man in the country" - well, that makes everybody else extremely poor and you're the same as you were before. That kind of thing happens in fairy tales because people forget that there's more than one way to do something. And if you just think about your main wish and don't think about the details, that leaves a lot not tied down.

PS: So the lesson from fairy tales is that the Gandalfs of the world got there by hard labor, learning the incantations, and there's no shortcut to that?

GS: Yeah.

I've never seen anyone make an analogy between these two bodies of work before, but we often do find ourselves creating software as if we were in the a fairy tale, paying little attention to what could go wrong. The entire book is loaded with great, insightful stuff like this. I encourage everyone to check it out.

There are a lot of ways to tell that a function is going to be less than great.  Perhaps the most obvious sign is the name of that function.  It's actually a lot like real life: if your name were Bedwetter McStupidpants, you're probably destined for a life of crime, insanity, and gum disease.

Here's an example: I spent several months working in a system where the critical function in the system was called... DoStuff2.  Unfortunately, I am being serious; it was a large, critical program for the energy industry and pretty much every use case you could ever draw up ran right through this bleak disaster of a function called DoStuff2.

We'll go through the specifics of just how horrible DoStuff2 was/is, but right now, I'm only looking at the name.  I don't even know where to start on the questions here.  What does DoStuff2 actually do?  And what is this stuff that it's doing this mysterious action to?  And, for the love of God, what happened to DoStuff1?

This is completely obvious, but a function name is your first and best chance to communicate the intent of your function.  If it's not clear from the name what your function does, no one will use it, the function rots, and it's quickly headed down the road of Bedwetter McStupidpants.

The whole idea of a function is to encapsulate complexity, right?  Start with that in your function name.  I don't care if, in the act of performing this function, it uses a worker thread or calls a web service or launches a nuclear strike against the people of the Arctic.  In the function name, I only want to know what it will do for me.

Let's say your function calculates a subtotal.  What should we name that?  Gee, I don't know, perhaps CalculateSubtotal?

Look at the naming convention there: it's verb then noun, and both words are as specific as possible.  It's like a command your dog would understand, if it had a slightly better grasp on vocabulary.  "Rover, eat food!  Rover, go potty!  Rover, calculate subtotal!" If you ever shouted, "Rover, do stuff!" he'd probably pee on your shoes out of spite.

The nice thing about a specific verb-noun naming convention for your functions is that the naming never gets out of control.  You never get monstrosities like PerformCalculation (every calculation in the universe?), SubtotalGenerationInit (just init?  does it ever finish?), PleaseGiveMeAubtotalOrIWillHuntYouDown-AndMakeYouWearAGorillaSuitToWork (definite style points for that one, but slightly lengthy).

Let's conclude.  Your function name should be short, meaningful, and specific; if it's not that, come up with a better name or learn to live with a horrible function.

About the Author

The Art of Delightful Software is written by Cody Powell. I'm currently Director of Engineering at TUNE here in Seattle. Before that, I worked on Amazon Video. Before that, I was CTO at Famigo, a venture-funded startup that helped families find and manage mobile content.

Twitter: @codypo
Github: codypo
LinkedIn: codypo's profile
Email: firstname + firstname lastname dot com