Why big companies slow down, and what to do about it

Every company optimizes for something. Sometimes this is an external measure – price, quality, security, customer service, etc. Sometimes it’s internal – hiring, project/risk/change management, and so on. Of course every company wants to do everything perfectly, but when you look at how they make decisions, there’s usually a central organizing theory.

At TripAdvisor, the mantra has always been “speed wins.” The idea being that if you can get a feature out, test it, and iterate faster than the competition, then you’ll always be ten steps ahead of them. This was true when the company had ten people, and true now with 2500. So as we’ve grown, we’ve naturally asked the question, “how can we avoid slowing down? how can we move even faster?” We ask this all the time – at the coffee machine, in 1:1s, at management offsites, at the company meeting. Sure, we care about quality, usability, customer support, all that other stuff – but we figure we’ll get there by being fast.

The bigger you get, of course, the harder this becomes. Consider how things change as an engineering organization grows:

  • Solo

When it’s just you, the codebase is small, you choose the technologies, you know where everything is, and you never have to worry about communicating your knowledge or stepping on someone else’s toes. There are no meetings, no emails, and no distractions.

  • Small

At this stage, the codebase is still small enough and uniform enough that it’s possible for engineers to know all major systems deeply. There generally isn’t much legacy code, and if good decisions were made early on, technical debt will be limited. Everyone knows everyone else’s responsibilities and abilities, so it’s easy to get help when you run into problems. Individuals need to coordinate to some degree, but collisions and redundant work are fairly rare. Most email is personally relevant, and the team is small enough that individual preferences in workspace can typically be accommodated. Having other people around can be intensely motivating, but can also be distracting. Architectural decisions are either made by an alpha nerd, hallway conversation, or the relatively rare formal meeting.

  • Medium

At some point, having one big group of engineers is just too much, and developers get divided into teams. These teams can be project-based, technology-based, or divided between logical areas of a single product. Now you have groups that are working in the same codebase, and may be able to dip into each other’s code, but will be less efficient and are more likely to break things when they do. The engineers still all know each other (the company isn’t that big – yet), and generally know what other people do. Email volume increases, and a growing percentage of the email is no longer personally relevant. More people are involved in technical discussions, and meetings between members of different teams are increasingly common. Some teams may choose to use different technologies (sometimes for good reasons, sometimes not), which increases the learning curve when other engineers try to understand, debug, and/or enhance their code.

  • Large

When a company starts getting big, you move from knowing everyone, to knowing some people only by their email addresses, to not knowing other people at all. Teams proliferate, people move between teams, and it’s hard to keep track of who’s doing what, who has responsibility for what, and whom to contact when there’s a problem. The codebase is too big for any one person to know completely, and old-timers’ memories of why things work the way they do might be dangerously out of date. Technical debt piles up – all of the bad decisions of the past slowly exacting a tax on every project. Management overhead increases significantly, with the org chart stratifying and filling up with team leads, tech managers, directors, and VPs. A large percentage of your email is now completely irrelevant, and time is spent creating filtering programs, or going through mark-and-sweep email reduction campaigns. An embarrassing amount of time is spent in meetings – 1:1s, code design reviews, interviews, feedback sessions, technical training, team meetings, stand ups, etc.

  • Huge

When a company is huge, you no longer know all of the teams, what they do, who’s on them, what technologies they use, or where they’re physically located. There’s no longer a single architect, or a single guiding vision behind the technology – different divisions might use completely different technologies, frameworks, server topologies, databases, hosting solutions, methodologies, etc. Changes in parts of the codebase the team doesn’t even know about can cause catastrophic failures. Politics plays an increasing role in the technical decision-making process. Because you don’t know people, it’s harder to trust their judgment, so it takes longer to get to decisions. Meetings, email, and reporting requirements take an increasingly large percentage of everyone’s time. Individual workspaces become smaller, louder, and generally less developer-friendly.

Here are some problems that get worse as a company grows:

  • Technical debt increases
  • Developers spend more time working in legacy code
  • Developers spend more time working in unfamiliar code
  • Unintended side effects of code changes become more frequent, and potentially more dangerous
  • Not everyone knows each other, so it’s harder to get support
  • People don’t know all the teams
  • Technologies proliferate
  • Irrelevant email volume increases
  • Number of meetings increase
  • Even if the ratio of bugs per developer remains constant (which is unlikely), the absolute number will continue to climb – catastrophic failures become more likely
  • Even if the ratio of senior to junior developer remains constant (which is unlikely), the absolute number of junior programmers will grow faster than that of senior programmers

These are tough problems to solve, and it’s easy to feel fatalistic about some of them. But what if your company were optimizing for speed? What would you do if maintaining speed were the key cultural touchstone at your company, the foundation that would drive usability, quality, revenue, and even employee happiness?

If speed were really your goal, you wouldn’t just hope that it would happen on its own. You wouldn’t just try to communicate a set of cultural expectations. You would set up the organization to support it.

  • Hiring

It goes without saying that you would need to maintain a high bar in terms of technical talent. Mediocrity exacts its own punishing tax on an organization, and avoiding that is almost always far more important than adding capacity. But hiring well is a basic minimum requirement for speed, not a solution in itself. Having great people is a necessary precondition, but you won’t move fast if they spend most of their time fighting against your processes and culture.

  • Standardize on technologies

Allowing individuals and teams to use whatever languages and technologies they want – with no oversight – creates an entirely avoidable set of issues when people try to dive into alien code. Do you really need a combination of bash, csh, perl, multiple versions of python, and ruby scripts? Is there a reason why some people have to be on Windows, others Macs, still others a variety of subtly different Linux distros? Standardizing on one IDE, one major coding language, one scripting language, one database, one OS, VCS, etc. will save you time in tooling, documenting, chasing down obscure bugs, training, and maintenance.

When taken to extremes, this can become bureaucratic and unnecessarily restrictive. The point of standardization isn’t to straitjacket your engineers into The One True Approach, it’s to speed things up by reducing the complexity of the development environment. If writing an internal tool in a rapid development framework like Rails is going to be significantly faster than doing the same thing in C++ or Java (which it almost certainly will), then there needs to be a way for an engineer to make this case (while also taking into account knowledge transfer, maintenance costs, etc.). But then do you also need to introduce Django, Clojure, Meteor, etc.? You need to create a system that’s flexible enough to allow the introduction of new tools, while simultaneously protecting itself from an undisciplined proliferation.

  • Tours of duty

Giving people the opportunity to move between teams (either permanently or for limited swaps) will dramatically decrease the impact of a larger codebase, use of different technologies, and not knowing whom to ask for help. If team X has someone who’s been on team Y, then that developer will be able to help with design questions, do code reviews, troubleshoot bugs, and make introductions when X needs to dip into Y’s codebase.

Tours of duty should ideally be three to twelve months long. Too short a period and the engineer won’t have time to integrate into the team, learn how their system works, understand their codebase, and get to net positive productivity. This is incredibly demoralizing, and leads to an unfair negative impression of the engineer’s performance on the host team. It takes time for someone to come up to speed, and in the short run, both teams are going to move slower. Tours of duty have to be seen as investments, not quick wins.

  • Modularization of codebase

A legacy codebase is frequently a mass of spaghetti code that exacts a tax on every project. The older a codebase gets, the more hidden assumptions lie in wait, and the more you need to know to make even mundane changes safely. Automated tests and documentation can help mitigate this to some degree, but the most common case is that the code you’re modifying was written by some long-gone engineer who, while brilliant, wasn’t particularly skilled in this particular coding language, didn’t feel the need to document, and harbored a deep-seated resentment toward whitespace and multi-letter variable names.

Breaking your codebase into multiple pieces with a well-defined dependency graph is a tempting option, but rewriting an entire codebase from scratch has a well-deserved reputation for being a one way ticket into the valley of the shadow of 90% (because that’s where you’ll be, for years). If you do need to rewrite, the best you can do is to attack some well-defined section of the code, with the criteria that 1) rewriting only that one section will have significant benefits, 2) the timeline is limited, 3) scope creep is aggressively beat back, and 4) the project can stand on its own.

And then, there’s Amazon. A couple years back, Steve Yegge wrote an entertaining brutal fascinating blog post about Amazon that made the following key point:

So one day Jeff Bezos issued a mandate. He’s doing that all the time, of course, and people scramble like ants being pounded with a rubber mallet whenever it happens. But on one occasion — back around 2002 I think, plus or minus a year — he issued a mandate that was so out there, so huge and eye-bulgingly ponderous, that it made all of his other mandates look like unsolicited peer bonuses.

His Big Mandate went something along these lines:

  1. All teams will henceforth expose their data and functionality through service interfaces.
  2. Teams must communicate with each other through these interfaces.
  3. There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team’s data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network.
  4. It doesn’t matter what technology they use. HTTP, Corba, Pubsub, custom protocols — doesn’t matter. Bezos doesn’t care.
  5. All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.
  6. Anyone who doesn’t do this will be fired.
  7. Thank you; have a nice day!

Yegge hypothesized that Bezos was already envisioning cloud services, and who knows? Maybe he was. But when he artificially created this complete separation between teams, forcing each team to work completely independently, only communicating with other teams’ code via well-defined, documented APIs, each team’s output was suddenly no different from an online service or open source library. He had effectively fractured his huge organization into many small organizations that were able to work independently.

There’s an up-front cost to doing this, of course – you lose some productivity per engineer due to the overhead of the additional requirements. But this beats the cost of network effects of N engineers working in the same massive codebase and trying not to step on each other’s toes. [If any of my friends at Amazon are reading this, I’d love to hear your perspective!]

  • Better workspaces

I’ve covered this elsewhere, but decades of research clearly specify the kind of workplace environment that optimizes developer productivity. Smaller organizations tend to have a more flexible culture in this regard. Once a company gets to a certain size, they have to start standardizing rules about who gets offices, windows, bigger cubes, etc. Productivity decisions are made based on architectural trends, visual design, and magical thinking. An organization that wants to move fast uses data, not gut checks, when designing this crucial element.

In some cases, culture can be more powerful than environment. The Google Dublin offices have an open plan (which research shows is actively inimical to productivity), but the development organization has innoculated itself by creating a “library rules” culture. I.e., if you need to talk to someone, you (quietly) let them know, and find a conference room. That was almost certainly not the architect’s original intent, but the engineers didn’t allow a pessimal environment to control their destiny.

  • Create a culture of relevant email, rare meetings, and non-interruption

If you let people set up automated jobs that spam the entire development organization with irrelevant messages, this will become normative, and accepted. If people are encouraged to schedule meetings to discuss topics that could as easily be discussed over email, then that’s what will happen. If it’s common practice to communicate via instant messages, then your engineers will be far more likely to work in a constant state of interruption, never getting into flow.

  • Invest in the long-term health of your codebase

Technical debt kills productivity. Investing time getting rid of dead code and bad decisions pays dividends later on in speed and developer happiness.

Improving the tooling can have a dramatic effect on productivity. Whether it’s obviating, automating, or optimizing common tasks, the less time developers have to spend doing repetitive tasks, or waiting for tools to complete execution, the more likely they are to stay in flow, stay in their seats, stay happy, and get more done. Setting up CI will help you find bugs faster. Automating your config with puppet, chef, etc. will enforce baseline assumptions about the development environment. Faster builds will speed the code-compile-test cycle.

Automated tests to guarantee a routine’s contract are the opposite of technical debt. They’re formalized, enforced documentation. They may take more time to build up front, but they prevent errors from happening downstream, when they’ll take more time to fix.

  • Social work

Look for opportunities to get people from different teams together. Introducing people, helping them to understand what other teams do, and connecting faces with email addresses will help get people over the hump of not knowing whom to contact for help. It will also improve the level of discourse. You’re less likely to be harsh over email with someone you’ve met face-to-face than with someone you know solely through your mail client. High trust environments have less frictional losses, and are able to move faster.

  • Speed up the feedback cycle

The better your metrics, and the faster developers can see the results of their work, the more effectively and faster they’ll be able to move. TripAdvisor recently went from releasing once per week (with the occasional mid-week patch) to releasing twice a day.

There are, or course, a couple of teensy problems with all of this.

  • Short-term incentives

You’re almost always incentivized to sacrifice large, invisible long-term benefits for small, visible short-term gains. Yes, you could increase future productivity by retiring some technical debt, but that would take time during which you weren’t shipping well-defined, measurably successful features. You could train an engineer through a 3-6 month swap, but your team would have lower productivity during some of that time. You could ship a project in N days, or you could write some automated tests and ship it in k*N (k > 1.0), earning the wrath of your boss and business partners. “That’s ok,” you say, “we’ll just build in some extra time when working out the schedule.” Which is great, but when your project starts to slip, non-user facing “features” like automated tests will be the first things to be cut. No one will tell you to do this – you’ll do it yourself, in an attempt to get back on track.

  • Friendly fire

It’s easy to shoot yourself in the foot. You start with the best of intentions, then end up micro-managing. If someone wants to use emacs instead of vi, or IntelliJ instead of Visual Studio, or Debian instead of CentOS, or csh instead of bash, what’s the big deal? Some people like where they’re at, and aren’t interested in tours of duty. Everyone may agree that a situation needs to be fixed, then fight all possible solutions tooth and nail. Change is risky, both technically and socially. It’s hard to find a good balance.

  • We could never do that here

One of the fundamental rules of negotiation is that you shouldn’t start by negotiating with yourself. It’s also our default approach (“I can’t get A, so I’ll ask for 0.05*A. Meh, maybe I’ll just keep my head down and things will change on their own”). Instead of thinking about changes that would be possible, people tend to focus on why things won’t work.

Small successes buy you the credibility and good will needed to fight for bigger changes. You can let yourself get fatalistic and demoralized about all the things that you can’t change, or you can look for small, concrete, measurable wins, then iterate.

2 thoughts on “Why big companies slow down, and what to do about it

  1. One of the aspects of this that I think about often is how to justify the ROI on infrastructure projects. I can work out the numbers – let’s say we will spend 160 hours building some tool, and then it will save 2 hours per week indefinitely. That’s a savings of 50*2=100 hours in the first year from a 160 hour investment, or an ROI of 62%, give or take depending on compounding. I feel like I should be able to make a case that a 62% return is pretty good and we should do the project. Even if it takes twice as long as expected to build, it’s still a respetable return at 30%. If I had authority over the task list, I would do it. Yet I have never worked anywhere that the conversations around which projects are worth doing happen in these terms. My expectation would have been that at companies which have a culture of prioritizing projects in terms of metrics, this would be a natural conversation, but in practice is hasn’t been. What’s your experience?

    • It’s a good point – I’ve never had the experience of the decision-making process explicitly focusing on ROI. For projects with easily measured ROI (e.g., improved tooling with well-defined speed improvements), and even harder ones (e.g., reduction of time spent on bugs due to refactoring of a particularly buggy section of code), you can collect metrics, plot over time, and show pretty graphs that could make the case. For refactorings where it isn’t really possible to measure until afterwards (e.g., switching to a different language, refactoring major sections of the code to be easier to work in), I don’t think you can get to an honest answer. Likewise, refactoring the code so that it’s less painful to work in should improve developer satisfaction, reduce attrition, etc., but how do you capture that?

      Big time estimates scare engineering management, especially for projects with no intermediate milestones – and with good reason. Even when you’re hoping to get a big benefit, that’s long-term and speculative, but the cost is immediate and real. So small projects building into larger projects are almost always the best way to sell things. In my experience, decisions are almost always made based on a combination of gut check (“do I think this will help?”), trust of the engineer/team involved, and either an emergency (e.g., Twitter’s fail whale), slow period (end of year, time between projects), or successful prototype done in someone’s free time.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s