Operations curriculum

There’s a big difference between a master mechanic and the guy at Jiffy-Lube who knows how to change your fluids; assembly language, C, C++, Java, and Lisp programmers; command line and GUI users; kernel programmers and Linux users. I’ve been using various flavors of *nix for the better part of 30 years, but as I look forward to the intensity of an operations role, I realize I’ve never really gotten beyond the level of advanced beginner.

I need to come up to speed – FAST – and so, I’ve started to plan out my curriculum. First come the books – they’re the easiest piece to define, get recommendations for, and prepare ahead of time. My boss recommended the following, which puts them at the top of the list:

These are high-level books, more about getting into the right frame of mind and understanding strategic goals than the nuts-and-bolts of tracking down a replication error or mitigating a DDoS. Which is fine – I absolutely need this. But I also need to improve my low-level knowledge. Administering my own site, I learned to install OSes, tomcat, apache, basic database administration, and so on – but the stakes were comparatively low, and I could get by with the bare minimum. It’s time to start leveling up.

I’m planning on working my way through a couple of resources. First, I’ve picked up a copy of the UNIX and Linux Administration Handbook. At 1300+ pages it’s a little daunting, and I don’t expect to remember how to do everything – but I do expect a) to remember what can be done, and b) to know where to find it.

Next, I’ll be going through the following websites:

Naturally, the most important part of my education will be the doing. Along these lines, one of my colleagues gave me some interesting advice. When he first joined the team, he would try to shadow-debug live-site issues in parallel with the engineers who were actually working on them. Once an issue was resolved, he’d examine the actual solution, and compare it with what he’d come up with. That’s a pretty hard core way to get a jump on real-world experience. We’ll see how things go.

Lastly, I’ll be putting together a set of notes on what I learn. When I first joined TripAdvisor, I created and maintained a document on our internal Twiki that described everything that had taken me hours to learn, but could have been explained in a couple of sentences. Though it’s since been superceded (and gotten a bit out of date), this page ended up being used as an onboarding document for years. Whether it’s just for myself, or more generally useful, putting together this kind of a cheat sheet is super helpful. I don’t want to make the same mistake twice, or have to look the same thing up twice, and if I can help others avoid the same loss of time, so much the better.

This is, of course, just the first draft of a plan that will likely not survive contact with experience. How can I improve it? I’d love to hear your suggestions!

2 thoughts on “Operations curriculum

  1. This is probably kind of obvious, but during my first months at TA operations I had hard time to understands what server is for, who owns it and what could break if I do the thing I thought I should do. Searching through the old ticketmonkey tickets and change management request helped me a lot to understand the historical background of many of our systems.

  2. In an ops role, it’s vital to be able to iterate over plausible hypotheses about problems as rapidly as possible. :o)

    “Linux System Programming” by Robert Love gives a great flavour of what Linux can do for you. It’s like the classic Stevens book “Advanced Programming in the Unix Environment” without all the hedging about different versions and standards, and is much more concise.

    Speaking of classic Stevens books, you can’t go far wrong with “TCP/IP Illustrated, Volume 1: The Protocols” by Stevens (and now Fall in the second edition).

    If you can absorb a reasonable amount of those two (or at least know where to look), then handling the esoterica of distributed systems running on top of networked Unix boxes should become at least a little less daunting.

    Having a systems- & protocol-level map of how stuff works makes getting from “it’s broken” to “here’s why” much easier.

    On the architecture side of things, apart from what you have already in Web Operations, http://aosabook.org/ is a really amazing resource. Lots of the papers are relevant to the ops world either in describing distributed architectures or giving design background and conceptual overviews of particular software packages.

    I echo your colleague’s advice on shadowing live issues. In the same vein, does your new team already write (preferably blameless!) postmortems of serious outages? Reading them is a really useful way to get a handle on the kinds of failures you can expect to deal with, the strengths and weaknesses of the existing systems, architecture, etc.

    Other ideas and choices so far look good.

    I’ve always enjoyed your writing on teams and management, and I’m looking forward to following your writing in “my own back yard” of ops even more. There are also some (somewhat?) ops-specific team and management issues I expect you’ll run into, so I look forward to hearing your thoughts on those too. :o)

    Enjoy the intellectual firehose, and best of luck!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s