It’s trite, but true – when I first started at HelloShopper (né Scratch), I was overwhelmed trying to come up to speed on a new company, new team, new commute, new set of technologies, new code base, new tools, new operational infrastructure, new risks, new, well… Everything. The existing team had put together a website and mobile app, and had started to rewrite the site from scratch in React between the time I signed and started (a post for another time). The last thing on my mind was the occasional – and fairly rare – ticket to add new analytics events.
Fast forward six months, and our reporting framework had multiple serious problems. Of course, this coincided with an increased urgency in providing accurate, consistent data, and so began the process of re-engineering the plane in flight, trying to get a more comprehensive view while fixing the sins of the past. Playing catch-up meant setting up a logging infrastructure, tracking events consistently, identifying duplicate and missing data, logging all events to persistent logs, and finally ETLing those logs into a redshift cluster.
In retrospect, the problem wasn’t that we hadn’t prioritized analytics tasks. Rather, we hadn’t been treating analytics as a first-class citizen, worthy of significant thought and a systematized approach. Instead of treating each task as a one-off, we should have designed our system up-front. This didn’t necessarily need a lot of complicated machinery, and wouldn’t necessarily have taken a lot of time – we just needed to take some time to think about the problem, set up some fairly simple structures, get our idioms right, and put the right framework in place. More concretely:
- Set up a robust server-side logging system, including log-rolling and archival
- Define a standard format for event data, and (more generally) for log lines
- Implement a standard method for reporting front-end events
- Monitor the logs to prevent failures during the archiving process
(You’ll notice that I didn’t mention ETL – as long as your logs are complete and in a safe place, you can prioritize this whenever you want)
Once we started considering the problem from an architectural standpoint, implementation wasn’t difficult – but it was painful and time-consuming to switch from the old system to the new.
So, where did we end up? Here’s what our current logging infrastructure looks like:
The client machine makes page and API requests to the server, reporting user events through a standard API call. We write everything out to log files, which are renamed, gzipped, and copied to S3 on an hourly basis. An ETL process runs a couple minutes later, processing the uploaded data and inserting it into a set of tables in Redshift.
There are a number of advantages to this system:
- Simple, easy to understand, consistent flow
- The logs are the source of truth
- The ETL process can be enhanced, or bugs fixed, then re-run on old logs
- All processing is handled offline
And two significant disadvantages:
- At present, there’s up to an hour lag between events and reporting; this means that we can’t use this for monitoring (especially post-release), or for diagnosing emergent issues
- If a backend server goes down, we lose all logs that haven’t been archived (this is due to our use of AWS auto-scaling, which terminates and re-provisions servers that become unresponsive)
(Both of these problems, of course, can be at least partially mitigated by reducing the log-rolling/archiving time, and this is part of the plan)
We experimented with an ELK stack (Elasticsearch, Logstash, Kibana), which avoided the last two problems – log lines were (almost) instantaneously transferred into an Elasticsearch instance, giving us the ability to quickly search for errors, monitor the site post-release, and avoid data loss in the case of a server going offline. Which all sounds great, except that we didn’t have any Elasticsearch expertise on the team, weren’t comfortable using Kibana, and were bitten several times by operational issues.
We also sent event data through Segment (and still do, to some degree, though we’ve been slowly phasing that out), but while it was an easy way to get things set up, in some ways it was also the cause of later woes. I.e., it made it easy for us to avoid thinking through our requirements and setting up an appropriate code and data infrastructure. It also created a fair amount of confusion – we were sending events directly to segment from both the browser and backend, and sometimes ended up with duplicate, or inconsistent data. Because it was streaming, there was no persistent source of truth. And, of course, we couldn’t go back and re-ETL the original log data, since it hadn’t been captured consistently.
As I look back on the past year, I think of this as perhaps my biggest mistake. It’s one of the things that’s completely off your radar when you’re at a big company – unless it’s your specific responsibility, logs get logged, aggregated, ETL’d, and made available in the analytics/monitoring platform as part of the background operations magic. Likewise, when you’re starting from scratch, it’s easy to overlook until it’s a problem – at which point it’s almost certain to cost you with interest.