I sometimes have dreams where someone’s trying to shoot me. They shoot, and I run, and I don’t look back, and – lucky me! – they miss. Such is the logic of dreams.
Working in product development can sometimes feel like that. You add a feature with a big new data store, and – poof! – an extra gigabyte of heap is magically allocated. Your site gets DDoS’d, and someone else wakes up in the middle of the night to do whatever it is they do when your site gets DDoS’d. Other sites get hacked, and you smile nervously, hoping that someone else is taking care of it for your site. Bandwidth costs. Rack space. Critical firewall firmware upgrades. Ongoing infrastructure upgrades and maintenance. Things that need to happen in the dead of night, during a maintenance window. Things that require real people to move physical objects in The Real World™.
Of course you know that all these things exist, but it’s easy to forget when the bullets keep missing. This is, of course, almost entirely by design. Good fences make good neighbors, and a well-defined sandbox makes a productive engineer. When I was writing video games, we did most of our PlayStation 2 development and testing on Windows with jury-rigged controllers – only the microcode engineers needed the expensive dev stations, the rest of us were just solving platform-agnostic issues. Effective engineers are able to limit the number of variables they consider, and effective engineering organizations provide tools to make the process easy and natural.
Life as a stage hand in operations is different, of course. Your entire job can be divided into a couple of buckets:
- Internal customer support – the endless torrent of requests from people on the other side of the scrim
- Planning ahead – making sure you have enough equipment for traffic growth, new hires, new features, etc.
- Automation / tool development – building stuff that makes everyone’s lives easier (especially yours)
- Dealing with disaster – swooping in and making things right when stuff breaks
- Preventing disaster – doing the hard, thankless, invisible, essential work of anticipating and preventing problems
The last point is the key, and where most developers spend their time running from bullets. How many resources do you really need for site security? Do you really need a DRBD pair with RAID (i.e., 2x servers and 4x hard drives)? Can you push that aging server for one more year? Can you get by with the cheaper, less reliable vendor?
The answer to all these questions is, of course, “it depends.” But for the same reason that you don’t realize how good you feel every day until you get sick, no one notices how well things are working until they break. And if nothing breaks, then it’s perfectly reasonable for your boss to ask why you’ve been spending so much time on this stuff. Maybe you could have spent more time in tool development, or working through the ticket queue. After all, you could have had twice as many people and not had a better result – could you have gotten by with half as many? And it’s important to realize – this is a reasonable question. You need to be asking yourself the same question, constantly.
It’s easy to quantify and graph some things over time (e.g., build time, critical nagios alerts, uptime, ticket count, etc). Other things, not so much (e.g., things that didn’t break because of positive action on your part). There’s a strong temptation on everyone’s part to prioritize concrete, easily measurable, highly visible tasks. But while everyone understands the danger of bullets after they’ve been hit, the challenge is figuring out ways to communicate the importance of the background tasks, and creating a space in which you can prioritize them over their sexier cousins, without waiting for a disaster to prove your point. Sometimes you can create a narrative for a larger goal. Sometimes you have well reasoned arguments for a particular task. Sometimes you just have a gut feeling. You can fight, or you can run.