Measuring Everything

Obsession comes in many forms and mine is about collecting metrics

Feb 21, 2021

Welcome to the first post in “Station Wagon Full of Tapes”. I’ve written about my love-hate relationship with testing code, and collecting metrics out of every server call.

Whether it is building a small module of utility functions or a chunky stateful service, we all expect there to be tests associated with the code written. Unit tests, integration tests, there are flavors of it. I, for one, do not enjoy writing tests. At the start of my career, I didn’t get their value either, however after a certain time I started to appreciate why they are needed. In our industry, it is a spectrum of beliefs, as with anything. I’ve had the pleasure of working with strong engineers and that some believe tests are useless, some love it so much, they build their test suites before they do anything else. Interestingly enough, through many teams and companies I’ve worked for, it is generally the first thing that’s cut out when the timeline is tight. And somehow it is something to be reminded to be written during code-reviews over and over again, wonder why. When the task at hand is refactoring existing functionality that I am unfamiliar with, existing unit tests have helped massively by guiding me through the code. I guess it is one of those things in which we appreciate only when we actually need it. Still, I find myself questioning the value of the tests I write, sometimes finding overfitted solutions to trick machines that your code is actually working. End users of your services, or your product’s customers, don't care about your testing suite either, although I could understand that it is a companion to your product code, and gives a sense of security for the long run. This is not a post about me complaining about writing tests (maybe one in the future), but talking about a different companion I’ve found in addition to testing, that I enjoy writing.

Cloud Native Era

Recently, the switch from monolithic structures to microservice architectures, created a need for increased monitoring capabilities. We build our services with strong testing suites but at the end of the day, what I want to know, and what my customers care the most is how do they actually perform out in the wild? In the wild, meaning in a Docker container shipped to a region of cloud services that runs managed Kubernetes that I don’t control or care about — unless of course you own your service deployments and container orchestration, then my thoughts and prayers are with you. Sorry, I digress. Idea of monitoring is not new, but with the increase of cloud-native services being the default way to go, it is more widely known and the open source tooling is only increasing, OpenTelemetry, Prometheus to name some. Using these tools it is easier than ever to start monitoring your applications. Some companies even have the luxury of building a team around observability and instrumentation (More about this and how things are done at Dropbox later).

Making Decisions with Data

I’ve been working as a part of the growth organization for the past year at Dropbox. I’ve read articles about how working for a growth team might be different than a more traditional product team before I started at the company, but I didn’t know that I would end up catching the measurement fever. There are many ways to decide if a certain feature is a positive, neutral or a negative impact on your product. Most common one I’ve encountered in my career has been getting feedback directly from customers, if possible (Don’t have that luxury of asking how each feature is working when it is deployed to many). Some teams go with their gut feelings and do small user research. Some teams don’t measure or ask their customers for feedback, and hope all is well, bold.

That’s not the way we approach things at growth. To be able to actually call a feature successful or not there has to be statistically significant data supporting either side of the thesis. For any experiment this is followed to a T. What that means for an engineer working on these experiments is that basically implementing logging on every user interaction that’s available to be logged. A hover on a button? Better log that. A highlight on text on a tooltip? Yeah, it wouldn’t hurt to log that either. However, it doesn’t stop there. User interaction is not only limited to the DOM elements. Service response times is a key metric to monitor too. Today, not only I log client-side events for analysis for an experiment, but I also wrap functions and service calls with a measurement so that I can see the response time & result for any call when it is out “in the cloud” used by millions. That is the sense of security and control I never were able to achieve by writing tests.

To be able to have insight, being able to visit a dashboard and see how each function is working once deployed is so valuable that it gives me the actual sense of control on these services. Of course, it doesn’t have to stop there. One could implement alerts around their response times at different request sampling percentiles, p75, p90, p95 or p99. (p99 meaning 99% of the requests should be faster than given latency, also the reason behind the domain name.) Some strive for p999 (99.9%) for their services, many hyperscale companies do achieve this service uptime. When was the last time Gmail didn’t end up sending your email to a recipient? I don’t recall such an event.

Reliability is a core component of modern applications and there’s no way to achieve that without proper monitoring.

How to Measure

In this post, I’ve mostly focused on measuring server responses, service calls and their efficiency. Measuring doesn’t have to start there. Client-side measurements are a great way to get insight to your application as well. Making measuring code the culture of your team is the way to achieve a state where adding measurements just like writing tests will be part of the sprint planning when you start your next service. Now, if I am in an unfamiliar area of code, which tends to happen frequently when it is a massive monorepo, first I add some measurements, deploy the updates, and see which functions are the most frequently called and how they fare in terms of response times. I believe it is the best companion to the code I write and fruits of it are available on day 1 on a Grafana dashboard. I find the modules I write the most sturdy versions of themselves when they are combined with proper observability, ready for action.

With many open source tooling being built around this, I encourage you to read more around this, and maybe experiment with it on your next service.

Of course it makes it easier when the tooling is already built to measure any code written whether in Python or Go. I am in that lucky few, and use Vortex. Here’s a very detailed blog post about the architecture of Vortex and how we monitor at Dropbox — Monitoring server applications with Vortex.

Thank you for reading “Station Wagon Full of Tapes”.