I’ve been working in DevOps for just about exactly one year now. Different companies call their teams different things, but at TripAdvisor we divide Operations into the following buckets:
- TechOps: Live site hardware. This includes managing the data centers, racking and kickstarting servers, setting up firewalls, routers, load-balancers, networking, DNS, bandwidth, certs, and so on.
- SiteOps: Application software on the live site. This includes managing the release process, dealing with emergent issues, failovers, log transfer, and overall site reliability.
- Warehouse: Live site databases and our Hadoop / Hive cluster.
- CorpIT: Corporate helpdesk support.
- CorpIT Ops: Internal networking, video conferencing, email servers, etc.
- QA: Manual QA
- DevOps: A little of all of the above, specifically for the developers. This includes managing the internal staging environments, development servers and databases, build system, CI, VCSs, Jira, etc.
When I first joined, DevOps was primarily focused on keeping the lights on – the team had only been formed about six months earlier, and they spent most of their time working on tickets, jumping on emergencies, and upgrading the infrastructure to stay one step ahead of the next disaster. Fast forward twelve months, and while we still spend about one and a half developers on tickets and emergent issues, we’re also much more focused on project work.
In the beginning, there was an ongoing debate within the team about what our mission should be. Spending most of our time fixing stuff behind the scenes? Working on big, highly visible projects? Clearly, if everything was constantly falling to pieces we’d soon be looking for new jobs. On the other hand, if all we could claim at the end of the quarter was that – due solely to our efforts – the dev environment hadn’t spontaneously achieved intelligence and enslaved humanity, we’d likewise be updating our resumes and catching up on Game of Thrones.
When you check into a hotel, you expect your key to work in the door, the sheets to be clean, and the room to be pest-free. The hotel doesn’t get any points for these, in the same way that you don’t typically notice walls that don’t bleed and a lack of ominous voices screaming “GET OUT!” These are what might quaintly be termed “bare minimum requirements.”
Likewise, any job has a set of non-negotiable criteria for non-failure. Operations, in particular, is tasked with keeping the trains running on time – if the site (or in our case, the developer environment) goes down, then nothing else you’ve done matters.
There comes a point, though, at which what you’re doing is Good Enough™. The rapidly increasing incremental effort to slide a diminishing fraction of a percentage point up an asymptotic curve is almost never the best use of your time (unless you’re in aerospace, in which case I and everyone I love are very grateful for your pathological attention to detail) – code needs to ship, and wisdom is knowing when enough is enough.
For us, the first step was to assess the situation. We started collecting metrics across a wide surface area – build breakages and their causes, automated email, build times, causes for merge failures, commits, etc. – some of this turned out to be useful, some not, but it was fairly easy to capture once it became a priority. We also sent questionnaires out to the developers, trying to figure out their biggest sources of frustration.
Over the next ten months we worked to improve the development environment for remote engineers, to reduce the flood of email (still too much, but significantly less than it was), and to dramatically reduce build time. We reduced build breakages with new tools that enabled a small but important cultural shift, and made significant security and automation improvements to our infrastructure. We continued to track old metrics (“hey, why did the build time spike 2 minutes as of revision X?”), and added many new ones. Not everything worked out as well as we’d hoped, and any TripAdvisor engineers reading this will know that things are far from perfect, but they’re in a much better state than they were twelve months ago.
The pendulum swings. We’d started out with too much focus on dealing with emergencies, and not enough forward thinking projects. Over the past six months we may have moved a bit more than we should have toward project work. Next quarter there’ll be the big headquarters move, and then…? We’ll see.
As for me, I feel a bit like I’ve moved to a different part of town. Everything’s in the same place, but old hangouts are far away, and opportunities for new experiences are all around. Sometimes this means thinking deep thoughts about risk management, developer needs, and long-range project planning; other times it means hunkering down in the server room and shivering in the HVAC while replacing hard drives, memory sticks, and controller batteries. Or waking up to a 3 am page and digging through ancient csh and python scripts to figure out why an automated database refresh failed.
No one likes being woken up in the middle of the night, and we’d all prefer that things Just Worked™. But they don’t, and when they break, someone needs to jump out of bed and take a look. Looking back on the past year, it’s been a hell of an education. I’m lucky to work with some pretty uniformly amazing and nice people – thanks everyone, for easing my way.
 While the SiteOps team is responsible for the overall release process, release engineers are sprinkled in teams throughout the organization. This is one factor that helps avoid an “us vs. them” adversarial relationship between Operations and the various other engineering teams.
Excellent description of TA Operations!