Startup operations priorities

toolboxOne of the things about being in operations is that it introduces you to a whole host of concerns, features, and disciplines that you’d otherwise take completely for granted. In the same way that you rarely think about the work that goes into maintaining the sidewalk outside your house, or the massive international infrastructure devoted to transporting your new iPhone from the factory loading dock to your house, most of what happens in ops is invisible until it breaks.

At startups, though, operations is almost always an also-ran – when you’re trying to iterate into a viable business strategy, expending effort to get to five 9’s and high availability is generally a poor use of time. Of course, too little attention can also be disastrous. But this got me thinking. Developing tooling early can be a huge accelerator. What are the areas in which big companies invest, and startups typically ignore?

Quick productivity wins

If you aren’t thinking about your tool chain, you can miss out on a lot of really easy quick wins. Some of these are particularly useful to add at the beginning, as they’ll have the additional benefit of enforcing positive cultural norms.

  • Static analysis / lint

Setting your compiler to treat warnings as errors is easiest to do before you have tens of thousands of warnings. Same with static analysis and lint tools. Putting this in early, and creating a culture of fixing them at the point of insertion, can remove a whole class of errors from your code base. Don’t believe me? Believe him:

The most important thing I have done as a programmer in recent years is to aggressively pursue static code analysis. Even more valuable than the hundreds of serious bugs I have prevented with it is the change in mindset about the way I view software reliability and code quality.

…I feel the success that we have had with code analysis has been clear enough that I will say plainly it is irresponsible to not use it.

– John Carmack, In Depth: Static Code Analysis

  • Continuous Integration

It takes a couple minutes to set up a Jenkins server and get it to trigger a new build every commit. Minutes now can save a lot of downstream debugging time.

  • Project management software

At TripAdvisor, we recently made the switch over to Jira to track all projects, tickets, and tasks. This is another tool that takes almost no time to set up[1], and is relatively inexpensive. If kanban is more your style, or if you’re into free, there’s always Trello.

  • Internal documentation

Confluence, Twiki, whatever – I recommend standardizing on one method of internal documentation early. Confluence is slick, but as long as the tool meets your needs, it should be fine.

Revision control

It’s the 21st century, and you should be using git, Subversion, or Mercurial. ‘Nuff said.

Security

Most engineers aren’t security experts, and even with good intentions, aren’t constantly keeping up on the latest threats, required OS and software upgrades, etc. It’s one thing to pay attention to the HackerNews front page and hope to notice when a key piece of software has been compromised (e.g., Heartbleed) – it’s another to get daily security alerts, and prioritize them to keep your users and business safe. One good place to start is here.

Backups

It’s sad that we even have to be having this conversation, but some of you aren’t backing up your critical data. Code repos, databases, email, corporate documents, and so on – this needs to be stored someplace safe and secure. Your hard drives could crash, someone could steal your computers, you could get hit with ransomware, your office could burn to the ground… The possibilities are endless, so make sure you’re backing up your data responsibly.

Disaster recovery plan

How would you recover from a database crash? What would you do if you were hit with a DDoS? Or had a catastrophically bad release? What if your entire site went down due to a cut fiber optic cable, or bandwidth provider going bankrupt, or AWS service area going down (as has happened, multiple times)? How fast can you switch your DNS? Maybe the answer is to develop an N+1 solution. Or maybe that’s too expensive, and you’re willing to live with the risk. Either way, these are conversations you should have, document, and revisit regularly.

Monitoring

You need to start monitoring your site and traffic right from the start. First off, you won’t know if the site goes down, has blips in service, or starts to see significantly higher load after a release, unless you’re tracking uptime and performance. Secondly, this data creates a set of historical trends that you can use to identify problems, as well as for capacity planning. If you know that that you typically have three times the traffic between Thanksgiving and Christmas as over the summer, then you can plan ahead for additional servers before you get swamped. Thirdly, this can give you valuable insights into user behavior.

Automation / runbooks

When everything feels like it’s falling to pieces, it’s easy to treat every task as a one-off, get it done, and move on to the next catastrophe. This can make sense in real emergencies, but during normal times you almost always find yourself repeating tasks or needing to explain them to a teammate, in which case enforcing a consistent, reproducible process is crucial, no matter how small your company.

Can you do a build in one step? How about a release? How many steps does it take to provision, set up, and deploy a new server? Are your servers cattle or pets? Are you dependent on cron jobs, or are you putting everything into Jenkins? How hard would it be for someone else to pick up your tasks? In an emergency, you need to do whatever’s necessary to dig yourself out of the hole – but when the crisis is over, you automate. If you can’t automate certain steps (e.g., if you need to interact with a GUI), create a runbook. All common activities, no matter how involved, should be repeatable and – even if done manually – reduced to a set of steps in a well-defined script. Sometimes human intervention will be required, but having a script to follow will dramatically reduce human error and time spent. This can also be done in preparation for certain kinds of crises – you might not know why it will happen, but rest assured that one day your site will go down, and having a script to follow in the midst of disaster will be hugely valuable.

Metrics

If you’re on the business side, you probably care a lot about top and bottom line revenue, traffic numbers, cost of user acquisition vs. life-time value, and so on. As an engineer, you probably care a lot about live site errors, failed automated jobs, time to complete a build, critical pages, and nagios alerts. Everyone has a set of metrics they care about, and gathering and surfacing a set of previously opaque metrics is frequently a terrifying experience. Our servers take how long to start up? Our page load time is what? Our revenue on Tuesday after the release was how much lower? Defining, tracking, and surfacing key metrics (along with historical data for context) is critically important. The business side gets this, but engineering frequently needs a nudge to think through and track their own numbers.

Scalability

Yes, you need to worry about scalability, but in the beginning it’s number ten on a five item list. It’s not that you should aggressively and intentionally choose architectures and technologies that will need to be thrown away within the year, but you shouldn’t be worrying about your millionth user from day one. Don’t provision hundreds of server instances, or architect a massive sharded database to handle the billions of rows you’ll have one day. Set up a strictly enforced interface for accessing data, but don’t worry about putting the data into a service tier until you need to. You may need to completely rewrite your code base at some point (oh, Twitter), but hopefully this will be one of those “good problems to have.”

Non-operations priorities

And, because I can’t resist, here are a couple more things you should be thinking about early.

  • QA consistency

This might include a list of key browsers, platforms, and detailed flows, a test schedule, and a set of automated tests (unit, integration, UI, smoke tests, etc.). If you don’t have a plan, then your QA will be haphazard and miss major pieces of functionality on browsers and platforms your developers don’t use.

  • A/B testing infrastructure

One of the most important early features you can develop is the ability to easily differentiate users, separate them into different buckets (or “slices”), and identify behavior based on slice. Whether it’s that “orange buttons have 10% greater conversion than green buttons”, or “version 2 of landing page X reduces revenue by Y% but increases time on site by Z minutes,” you need to be able to run tests as early and often as possible. The easier you make it, the faster you can surface the data, the quicker you’ll be able to iterate.

  • SEO

Thinking about SEO early is pretty important. First, you don’t want to do something stupid and get penalized. Second, starting good SEO habits early will remove a potentially pernicious source of friction in adoption.

  • Recruiting

Recruiting is sales, and sales needs advertisements, cold calls, networking, follow up and follow-through. You can’t let the advertisements lapse, or let candidates (or recruiters!) grow cold. Consistency here is key.

A final note to the weekend hackers

So all of this is great, but… So what? You aren’t in a startup, you’re just fooling around at home with your own projects. Well, why not set yourself up with Jira to track your personal goals? Why not set up Jenkins to do regular builds? Or Confluence to document your processes?

I know, I know, because you don’t have to. The same way you don’t have to use source control or make backups, right? Oh yeah… The point is, maintaining this software at home is a great way to gain familiarity with the tools, and all are either FOSS or dirt cheap (Jira is the most expensive, at $10-20 for 10 user licenses). You don’t have to go whole hog and start maintaining your own DNS servers[2], but there’s no reason to intentionally avoid quick wins at home.


[1] Of course, you can spend as much time as you’d like tweaking and tuning Jira. But I use it at home (really!), and it was a snap to get up and running.
[2] One of my friends has been administering his own DNS servers at home for over a decade, and you can be pretty sure he knows the ins and outs of how DNS works, common issues, etc.

4 thoughts on “Startup operations priorities

  1. I agree with everything I am starting to use more and more of these practices also in my small pet projects (for example I love Travis and AppVeyor for CI using Linux and Windows).

    However I am not sure about: “It’s the 21st century, and you should be using git, Subversion, or Mercurial. ‘Nuff said.”

    Subversion seems a bit of out of place :)

  2. Very nice post again. \m/
    But I think If you are using Atlassian product such as JIRA(a very powerful tool for issue tracking), I would suggest to use Bamboo for Continuous integration in place of Jenkin. Bamboo is more advance than other continuous integration tools Jenkin and Hudson.

  3. By the way you didn’t mention about Database management, change management which is most important part in an organisation and for developers too.

Leave a comment