The Power Of Paranoia

| No TrackBacks

Fun things happen when your code lives on the Internet. Take this one component we support and maintain at TUNE. It's old and brittle, but it's a low priority problem since it's usually not on fire and only a handful of folks use it. (When you're working at scale, you must sometimes embrace that mindset.)

One day last month, this codebase went from a lovable oddball to a legitimate problem when it began alerting constantly due to failed health checks. We have thousands of instances, but this particular component drove more PagerDuty alerts than any system company-wide over the past 4 weeks. No one knew why.

Now, this system has squawked at us from time to time in the past. Our usual approach is to patch it up and move on. I'm one of the chief perpetrators of this approach so I guess I'm obligated to defend it here. My reasoning always was, it's this weird little thing with very limited usage; there's practically no way I'll make it worse! This approach usually worked fine: the alarms would silence and we could focus on more valuable problems for the next few months.

I tried this same approach again in early January, and this time, it didn't work. I mean, it mostly worked; after my change, the alarms went from 20 a day to 2 a day. That was still just enough to cross the threshold of annoyance for our oncall engineers, though.

The Problem Was, We Didn't Know the Problem
I dug into the nginx logs to understand all the other requests that were being fired during these alerting moments and I soon realized something interesting about the logs. There were a lot of them! So many logs, in fact, that it was hard to glean anything via my usual cat / grep / awk / sort / uniq toolkit.

When gigabytes of evidence conflict with your own poorly-understood models of a system, it's time to reassess. That's what I did. Rather than slap on another band aid to this codebase, I spent a couple of days getting its access and error logs into our Elasticsearch + Logstash + Kibana stack. (Side note: the ELK stack's development environment is convoluted and insane. Once you finally get it into a state where you can actually debug things, you might as well go nuts and figure out how to parse lots of different logs. The end results are impressive and informative so it's probably worth the frustration.)

Once I was able to visualize request load and latency in Kibana, I realized that absolutely none of us understood how this codebase was being used or the load it was under. If you'd told me, prior to this, that this system was taking 2,000 requests a day, I would've believed you. In fact, I would've said, "Good job, little codebase, for handling that!" And I would've had no idea what I was talking about, because the logs showed 2 million requests a day. The crazy part isn't the load; the crazy part is that our expectations were orders of magnitude off and had been for years.

Once I was able to see that traffic, I realized a few things. First, this system was way more important to our customers than I thought and had been for a while. Second, it had been stretched way beyond its limits. Third, the thinking that led us into this conundrum had probably caused us to undervalue other components.

They're Out to Get You
If I had to draw a single conclusion from all this, it's that ignorance is not an operational strategy. Don't allow a lack of clanging alarms or 0 outraged bug reports soothe you into a false sense of security. If you want to write important code that performs meaningful tasks for your customers, you can't deploy it and hope for the best. You'll never spot your problems when you're operating via blind optimism.

If ignorance isn't an operational strategy, what is? Paranoia. You should code and run your systems like a large group of Internet lunatics are out to abuse the hell out of them.

With properly-directed, well-intentioned paranoia, you can spot your operational problems before they become catastrophes. Focus the paranoia on your process, and ensure that the delivery of metrics, alarms, and run books are required for every major push. Focus the paranoia on your metrics and review them regularly to understand where system performance is headed. Focus the paranoia on your outages and work hard to understand, then address, the root causes of your problems. Focus the paranoia to find the old, janky stuff that doesn't stand up to the new level of rigor.

Once you've successfully changed your mindset and are running your systems like the Internet is out to get you, you've suddenly got a fascinating new lens to view your software through. You can think systematically about quality and scale. You don't have to rely on band-aids for the problem of the week. You can sleep well, knowing that 3 AM alarms will be few and legitimate. Ironically enough, you work your way to worry-free software by harnessing the power of paranoia.

Trust the Anecdote

| No TrackBacks
When I worked at Amazon, I came to dread question mark emails. Every couple of months, someone would email Jeff Bezos with a streaming video problem. He'd add a single '?' in the body and initiate a long string of increasingly-frustrated email forwards that would somehow end in my inbox. All other activities halted until my team and I knew precisely what caused this single anecdote, how we'd prevent this problem going forward, and had summed all that up into an extensively-reviewed email response. Stressful times.

After I'd gone through this experience several times, I got snippy with my boss. I said, "This is a dumb use of our time. We have metrics that show we succeed on playback 99.95%. We're chasing edge cases here." (Note: I would not say that directly to Jeff Bezos.) My boss's reply was both deep and deeply confusing: "When your data disagrees with your anecdote, trust the anecdote." No explanation, just an unsatisfying impression of Confucius.

Fast forward 2 years. I'd left Amazon to join TUNE for a lot of good reasons, question mark emails being one of those.  Somehow, I found myself in a familiar situation.

One day, customers overwhelmed our overseas support team around 2 AM Pacific with reports that they couldn't log into one specific platform. Here's the weird thing: we have gobs and gobs of monitors, alerts, and performance data. All of them showed that everything was just fine at that time. Latency was low, we had plenty of capacity, and there were no large-scale networking issues. "LGTM. Uh, ask them about their wifi connection?" was our shaky answer.

After the second day of these reports, I realized that I'd been presented an anecdote that disagreed with my data. We started digging, iterating through each graph on each dependency of the platform. EC2 cpu utilization and iowait was fine, load balancer was distributing traffic correctly, aggregate db insertion time was within acceptable variance, memcached was memcached-ing, no sun spots, Godzilla hadn't attacked US East, and so forth. We had gone through most of this data already, but as we kept digging, we eventually found a graph that pointed to a new bottleneck in the form of large-scale, but short-lived MySql contention. We cross-referenced a specific instance of contention with previous data. We saw the latency there too when we sampled at a higher rate during that brief period of contention. Eureka! We had identified a major problem and could now go solve it.

Sometimes when you trust the anecdote, you come to find that the anecdote was flat out wrong or points to a problem you can't solve. For example, one of the video streaming issues was ultimately because someone's satellite internet connection at their Mexican beach house was too slow too sustain 1080p. Sorry, man! I don't rule space or Mexico.

Sometimes though, the anecdote allows you to see your data with new eyes. These anecdotes lead you to tools that you've always had at your disposal and never realized, tools that allow you to identify and attack* these problems before the next swarm of angry emails. That's why you trust the anecdote.

* Identify and Attack is the name of my Youtube self defense course. 

Android Studio's Killer Feature

| No TrackBacks

When Google announced Android Studio, I was highly intrigued. I've spent years of my life in Eclipse, and my secret hope has always been that sometime before the heat death of the universe, we'll get a better IDE for Android development. (I'm not here to drop a truth bomb on Eclipse's flaws; hating Eclipse is like hating gravity at this point.)

As an experiment, I began using Android exclusively on my side projects at home. I've grown to like it quite a bit: it has a nice layout editor, I like the refactoring support, and I have fewer issues with IDE stability. However, the part of Android Studio that I like the most is its integration with Gradle, the build tool. I did not expect this.

In my previous big Android projects, I've always had two separate build process. There's the build process within Eclipse itself, used for compiling and execution as an engineer writes code. Then there's this separate build process, usually driven by ant, to create build artifacts; these builds are often part of a continuous integration pipeline and used for testing/submission. (Sure, you could simplify by mandating that everyone builds everything all the time with ant. Here's the problem with that: Eclipse is right there! It's taunting us with its fast, easy builds.)

Two build processes is problematic because they need to generate the same binary. That's not easy; every time you tweak your class path or a dependency in Eclipse, you need to make the equivalent change in your build.xml. I've never found a good way to automate this, or a forcing function for synchronizing these build processes manually. As your project gets bigger and you take on more dependencies, entropy does its thing and inevitably the build processes begin to diverge. This leads a lot of problems. QA finds a mysterious crash in the APK created by ant, which no one can reproduce in Eclipse. We change the classpath in Eclipse, then ant fails and no one realizes it until QA threatens us with pointy sticks. Working through these problems is a waste of everyone's time.

Android Studio has a great solution, which is that it uses Gradle, a first-class build system, for builds within the IDE. Need to create an APK for QA? Just use the exact same Gradle command on the command line. All of your app's dependencies and all steps within the build process are thus made explicit via your Gradle config file. There is one and only one build to rule them all.

Could you do the same thing with Eclipse? Maybe. There is a Gradle plugin for Eclipse; I spent a few hours of quality time with this and made very little progress. I can't imagine integrating that plugin across a large engineering team without a lot of cursing and adult beverages.

I think this choice in tooling is pretty indicative of the difference between Eclipse and Android Studio. With Eclipse, you can do anything, given you get the configuration right. The odds of this occurring straight-away are low. In Android Studio, the defaults are much smarter, obviating the need for much of the configuration. It's convention over configuration all over again.

I have written many design documents in my day. In most cases, the documented design had only a passing relation to the final implementation and rapidly became a fossil on our wiki. In the best case scenario, the document closely matched the implementation and I took great pains to update it with each iteration upon that codebase. I then left the company, and they promptly ditched my entire solution regardless of the awesome doc. In short, my track record with these documents is not exactly stellar. Regardless, I still think design documents are a good idea for a couple of reasons.

The first, most obvious reason that design documents are helpful is that it's a forcing function for actually thinking deeply about software design. I am not advocating a thousand pages of sequence diagrams here. Simplying laying out the components and their dependencies is helpful: it informs the sequencing of the work, you have to think about the trade-offs involved, and it allows you to form initial ideas on the size and scope of the problem. This sort of documentation won't answer all hard questions about this particular design, but it is helpful in calling out many of these questions.

The second, sneakier yet more important reason that design documents are helpful is that they give you an opportunity to analyze an ambiguous problem and propose an elegant solution. It's hard to do that well, and it turns out this can be important for your career. Think of it this way: if you work in a growing organization, there's a bunch of good code being written. Consequently, if you want to differentiate yourself and ensure great opportunities come your way, it may take more than good code.

There are plenty of ways to differentiate yourself (mentoring, championing best practices, officiating the office ping pong tournament). I'd argue that writing and presenting effectively on software design is one of the rarest and thus one of the best. Additionally, the hardest and most important problems often have an element of design. These problems aren't just distributed randomly. If you want to work on them, it benefits you greatly to have a strong record of proposing designs and seeing them through implementation to delivery.

At this point, I have clearly convinced you (odds of this actually being the case: 48%) and you are firing up your text editor to begin your journey to Documentation City (odds of this actually being the case: 0.48%). How do you get started? My suggestion is to start small, with respect to the size of the document, scope of the problem, and amount of review. As you look at more problems in this way, you'll iterate on your approach and find a style that works well for your team and its problems.

Donut Driven Development

| No TrackBacks
We had a problem: our continuous integration environment had turned into a continuous breakage environment.  Around half the time, our I builds would fail, either due to compilation issues or failed unit tests.  Each broken build generated an email to the team, and due to their frequency and verbosity, we began to tune these emails out.  That was even worse, as it meant multiple broken builds could pile onto each other before anyone noticed.  No longer did we have to fix a single bad change, we had to unwind multiple bad changes.  And this wasn't just a build issue: each engineer would end up pulling down this broken code and break their own workspace.  It was like Voldemort himself designed this process to drive us all crazy.  

In all fairness, it was very, very complicated.  We had dozens of engineers contributing to 20 packages, with lots of explicit and transitive dependencies across these packages.  Many times, the build would work in person A's workspace and fail in CI because a dependency had changed or because they had changed some implicit behavior that another package relied upon.  It wasn't easy, in short.

Initially, several of us tried to solve these broken builds the same way: nagging emails.  I sent plenty of these on the sanctity of the build, laying out high-minded build tenets and describing what a big cost it was to the team when our builds failed.  I put all of this onto a wiki that I made all new hires read.  I was seriously considering buying a sandwich board, writing 'THE BUILD APOCALYPSE IS NIGH!  RUN YOUR UNIT TESTS!' on it, and walking up and down the halls.  None of this accomplished very much.

Then, one night, I got desperate.  After a particularly horrendous streak of failures, I sent a team email saying, "Guys, the build is all screwed up and we need to fix it immediately.  If you fix the build, I'll buy you donuts."  Yep, I resorted to bribery.  I'm sure there were other motivational tactics available, but man, I work hard already and there's a donut shop right down the street.

You know what happened next?  Someone fixed the build within the hour.  Once I saw that, I made it a new build policy: if you fix a build that you didn't actually break, you get a dozen donuts the next morning.

Now, my donut incentive plan hasn't fixed all broken builds forever, but it has led to immediate improvement.  Roughly every week, people stay late to fix a broken build specifically so they can claim their donuts.  Contrast that to the previous scenario, where people theoretically would've stayed late to fix a build in order to save team productivity.  That happened once a year, maybe.  Science shows then the donut incentive plan is 5200% more effective in preventing broken builds.

I work with talented people who are paid well; they don't need me to buy them donuts.  And yet, once that tiny, tangible incentive is there, behavior changed dramatically.  What does all of this mean?  Why did this work better than lots of emails about how important successful builds are to team productivity?     I... don't know.  Maybe it's the fact that it's a small sign of appreciation, something which can be shared with teammates.  Maybe it's the fact that the donut incentive is visible, kinda funny, and has since become baked into team culture.  Then again, maybe people just really like donuts.

There are bigger solutions here, like simplifying the build process, and we're working on all of that.  In the meantime though, our existing process need to continue to work.  If you find your team in a similar situation, look to the donut shop.
I was having coffee with a new technical manager recently, and he asked an interesting question.  He said, "I just got assigned to lead this great team, but I don't know how to build trust with them.  How do you do that?"

I believe it's actually easy to earn trust as a manager, provided you understand a few very important things.  It's the team who contributes the key, valuable actions behind great software like writing, reviewing, and designing code, not you.  The people on your team are way better at this than you are, and they have far more context.  As a result, your team's contributions are much more important than your personal contribution.

What's your job then?  Simply put, your job is to facilitate those key, valuable actions.  You're not doing the work; you're empowering your team to do the work.  Your fancy job title could be translated into Software Helper.  (It's a humbling realization, but we'll get through this together.)  Understanding your role and communicating it to the team is the first step to building trust.

Okay, so if you're really a Software Helper, how do you do that?  You ensure the team has the data it needs to make great software decisions: think priorities and deadlines, performance requirements, external dependencies, feature roadmap, etc.  You find people who are blocked and connect them with people who can unblock them.  You find teeny, trivial workplace improvements (eg, Person A wants a track pad instead of a mouse) and pursue them aggressively.  You find where the bodies are buried in your team's codebase, get input from the team on how to address this technical debt, and you ensure the team gets time in the schedule to make these changes.  You find every possible opportunity to let your boss, boss's boss, and boss's boss's boss know about all the great work your team (again, not you) is doing.  You constantly ask, "How can I help?"

If you think of yourself as a capital-m Manager, then you'll obsess over the management part and begin to focus on the wrong things, like how do I get to manage a bigger team, how do I manage higher-profile projects, and how do I get fancier words into my job title.   Your goals are no longer aligned with the team; why would the team trust someone like that? Contrast that with thinking like a Software Helper where your goal is to build something great, just like all of the other engineers.

Being a Software Helper is hard work, and it requires a lot of vigilance and organization.  It's not as hard (or as valuable) as building the actual software, though.  Once you realize that, you start to build trust.

Tragedy of the Common Library

| No TrackBacks

Good intentions can go awry quickly in the world of software development. Imagine this common scenario: there are multiple teams building related projects at the same company. At some point, someone realizes that these teams tend to generate a lot of duplicate code; why not just create a common library that all the teams can reuse? Everyone agrees it's a brilliant plan, and so code begins to shift from each team's codebase into the common library.

Fast forward to two years later. Now everybody is using the common library to solve a lot of hard problems. That's good! At the same time, people would rather have a tickle fight with a komodo dragon than actually wade into the common code and make a significant change to the existing logic (author's note: tickle fight with a komodo dragon is not a euphemism). Why is everyone so afraid of the common code? Since a bunch of different teams have touched the common code, it's a giant mishmash of conflicting coding standards and duplicate abstractions. Even worse, all products now depend on it. The common library has become a Jenga tower, growing taller and more wobbly with each change. Everyone is now afraid to make sweeping changes in there, lest they send the tower tumbling down.

How did the common code get into this state? Well, it's because we're humans and that's we do: make a mess of our common places.

This problem is well known in certain circles as the tragedy of the commons. Wikipedia describes the problem as "the depletion of a shared resource by individuals, acting independently and rationally according to each one's self-interest, despite their understanding that depleting the common resource is contrary to the group's long-term best interests." This theory has been applied to problems like population growth, pollution, traffic jams, and now, janked-up codebases.

So, if this is a common problem because we humans are a bunch of dumb dumbs, what's the solution here? Well, this guy Ronald Coase proposed a solution that won him the Nobel Prize in Economics, and it's actually relevant to our problem. Coase theorized that, if property rights are well defined and negotiation costs are low, then just by assigning property rights, the interested parties will negotiate their way to a solution to the negative side effects.

How would you apply that fancy book learnin' to the common library? Well, you'd start by splitting the common library up into smaller packages, with an established owner for each package. Then, you make it easy for the teams to communicate and negotiate changes to these newly split-up packages.

Let's say that Team A owns the common Logging package and Team B suddenly wants a new feature in that library. In the olden days, Team B would've just hacked this up. Since it's not their codebase, they'd move fast to get this in, leaving few comments and no tests. After all, that's what everybody does in the common library and they know there's no one to call them out on it.

That won't happen here, though, since Team A is now married to this codebase. Team A could propose something like, "You guys do the work, then we do the code review and we'll own this long term." Team B could then make a counter-offer, like "How about you make the change, and in exchange, we'll implement your Feature Request X in the service client library that we own?"

You've now gone from a massive codebase with zero owners to a bunch of small codebases, each with a motivated owner driven to ensure quality throughout the lifetime of that code.

As it turns out, ownership and fast, easy communication solve lots of software problems. Let's do more of these things!

A few other people have written on this topic before me. Good thinking, folks!

It's Not Refactoring, It's Untangling

| No TrackBacks
Recently, I was catching up with a former colleague.  He mentioned a service that I wrote years ago, and how it has since become known as the Career Killer.  Basically everyone who touched the Career Killer ended up leaving the company.  If the company wanted to have > 0 developers, the only solution at this point was to take a few months and refactor this service completely.

I have two things to say about this.  First, that code was at 85% unit test coverage when I left so don't go blaming me.  Second, this huge refactoring?  It's not going to work.

Every codebase has at least one component that is widely hated and feared.  It does too much, it has too many states, too many other entities call it.  When it comes time to pay down technical debt, you should definitely focus on this component.  However, if you have an incomplete understanding of this component and you stop everything to completely rewrite it, your odds of success are low.  That component, as scary and complex as it appears, is actually way more scary and complex than you think. 

How do you think that component got into this unfortunate shape?  Is it because the company hired a nincompoop and let him run wild in the codebase for years?  Or is it because the component was originally a sound abstraction, but its scope of responsibilities had grown over the years due to changing requirements?  (For the sake of my ego, I'm hoping the Career Killer is the latter.)  In all likelihood, this component arrived at its current, scary state via smart people with good intentions.  You know what you are right now?  Smart people with good intentions.  If you proceed with a big refactor, you'll trade one form of technical debt for another.

In order to truly pay this debt down, you need to untangle the complexity around the problem.  You need to spend time looking at all the clients calling this component.  You need to spend time talking with your colleagues, learning more about the component's history and how it's used.  You need to make a few simplifying changes around the periphery of the component and see what works.  Each week, you spend a little more time and untangle the problem just a little bit more.  Given a long enough timeframe, you'll eventually untangle all of the complexity and brought a teeny bit of order to the universe.

Practically speaking, what do you do here?  Rather than 3 full months on a complete refactor, spend 25% of your time over the next year.  It's the same time commitment either way, but with the 25% plan, you get time to analyze and plan.  You get time to untangle.

Ship It!

| No TrackBacks
Several years ago, I had a job that, at the time, seemed like heaven.  We were a new team building a new product.  We were using new technology: C# 2.0 (yes, people were once excited about major releases of C#).  We were using new techniques, like scrum and test-driven development.  It was greenfield development in every possible sense, except for the one where our desks would actually be situated in a green field.

I lived in this environment for a few years.  I learned a lot about software development, technical leadership, and how to build big systems.  Ultimately though, I think I wasted those years.  Why?  We never shipped.

Something magical happens when you ship software: your decisions suddenly have consequences.  You suddenly must consider trade-offs.  Hopefully, people suddenly care.  If they don't, you suddenly must correct that.

What's the big deal about decisions and consequences?  Any fool with a text editor can write code, but only an amazing few can code and make good choices around trade-offs.  That's the most valuable skill a developer can possess: the ability to make hard decisions.  (That's actually a great way to make career choices: opt for the place that'll let you make harder decisions.)  Like riding a bike or juggling chainsaws, the only way to get good at making hard decisions is by doing it a lot.  Each time you make one of these decisions, gather data and iterate accordingly.

When I was working on that project that never shipped, I felt like I was making hard decisions.  We had big meetings and loud arguments about things that seemed important at the time.  You can bet your sweet bippy that we came to conclusions on all sorts of things.  However, since we never shipped, we never got any data about any of the choices we made around things like architecture, code coverage, implementation decisions, featureset, and user interface.  Without that data, we had no way of knowing if we had chosen correctly.  Did we get better at making hard decisions?  Without users, how could you tell?

There was an easy solution to the problem I faced at that job: ship the dang thing.  That decision wasn't up to me, so I should've done the next best thing: join a different team, one that shipped a ton of code.  Even if the codebase is worse and the product is less interesting, find a role where you ship; it's the only way you get better.

Software Karma

| No TrackBacks
I make a lot of jokes at work about code review karma.  Here's the idea: each time a person volunteers to review others' code, that person build their code review karma.  Then, when it comes time for that person's own code to be reviewed, the reviews go smoothly due to the store of code review karma.

Build karma works the same way.  When someone jumps in to help fix the build, they're accruing build karma.  Thanks to build karma, the builds that kick off from that person's own commits are far more likely to be successful.

If you have a lot of time on your hands, you can actually see software karma all over the place: QA karma by helping out your testers, refactoring karma by cleaning up the scariest bits of the code base, planning karma by leading sprint planning, etc.  The more you help, the more the software gods reward you in the future.

I see two explanations behind software karma.  The first is that there actually are software gods who sit upon Mount Codelympus and keep a running tally of all these helpful acts.  They probably keep this tally in Emacs, since that was clearly not designed for mortals.  If there is any truth to this explanation, then let's all quickly sacrifice a 'PHP for Dummies' book to appease them.

The other explanation is less exciting, but slightly more practical.  By reviewing others' code, you build a mental model around what great code looks like.  When you go to write your own code, you apply your model and reap the rewards immediately.  By fixing broken builds, you build a mental model around the build process and how it can go wrong; your own builds are far less likely to go wrong thanks to that model.  Same thing goes for helping QA, refactoring, sprint planning, and so forth.

I'm not taking sides on which explanation is correct, since I don't want to get struck by a lightning bolt or incite a plague.  All I do know is this: if you want to get better as a developer, boost your karma.

About the Author

The Art of Delightful Software is written by Cody Powell. I'm currently Director of Engineering at TUNE here in Seattle. Before that, I worked on Amazon Video. Before that, I was CTO at Famigo, a venture-funded startup that helped families find and manage mobile content.

Twitter: @codypo
Github: codypo
LinkedIn: codypo's profile
Email: firstname + firstname lastname dot com

Best crypto to invest in 2021
Is crypto a good investment?

Recent Comments

  • Theodore Brisentine: I am selling the domain and site, please visit read more
  • Joe User: I tend to agree with avoiding indentation, though I'm not read more
  • Leif Mortensen: I'm reminded of a technique we frequently used in my read more
  • Cody: Right now, we're storing around 5000 documents per day and read more
  • Tom: Can you give some specifics. Like how many objects did read more
  • Daniel Molina Wegener: Writing documentation sometimes is a boring task, but it's read more
  • Cody: Murph, you're probably right that you can't get quite this read more
  • murph: this is rad. what if you work in a place read more
  • Cody: Definitely. I think the reasons for rolling my own app read more
  • DasIch: You do realize that Sphinx exists? read more