Phil Booth

Existing by coincidence, programming deliberately

Showing how poor performance affects user behaviour

"Bad performance hurts user engagement" is a sentiment that feels intuitively true but can be hard to sell in product conversations. The best way to persuade a non-believer is to point them at hard evidence and for that you need to do a few things with your performance data.

Firstly, don't store it in a silo. That might sound obvious but silos occur incidentally for all kinds of organisational reasons that can be hard to resolve. Conway's law is real and causes inertia. If you want to draw meaningful inferences from your data, it must be possible to link it to other events that track user behaviour.

Consider the following graph (numbers redacted):

A chart that shows user engagment on Firefox Accounts is positively correlated with page load performance

Here, the x-axis is time and the y-axis is the count of users that successfully signed in to Firefox Accounts. The different coloured lines represent evenly-sized cohorts of users, grouped according to how quickly the initial page loaded. If web performance had no effect on user behaviour, those lines would roughly overlay each other. Instead, they show that the 10% of users who experienced the fastest initial page load were about twice as likely to sign in to their account, as the 10% of users who experienced the slowest. If you're having trouble convincing others to invest engineering effort in improving performance, having graphs like this can help. And they're only possible if you can directly link performance data to user events.

Secondly, be careful if you're using a time-series database like Graphite. These are often configured to gradually expire data, which is good for storage costs but bad for precision. The expiry is achieved by storing averages of your metrics in progressively coarser-grained buckets as they get older, which reduces precision and can produce strange jumps at the bucket boundaries when rendered in a chart. People looking at those might mistake them for genuine performance changes.

Instead of a time-series database, consider storing your data somewhere that can handle it en masse. We use Redshift for this, maintaining parallel datasets that trade history length for sample rate. We have an unsampled set that contains 3 months of history, a 50%-sampled set that goes back 6 months and a 10%-sampled set that includes 2 years. After the initial sampling is performed for each set, we keep all of the data until it expires. Data is automatically popped from the end of each set as we push it at the start. Limiting the size of the sets keeps query times reasonable. In practice, we rarely need to consult the sampled histories but it's good to have them available for occasional longer-term analysis and maintaining them in parallel means we can always compare like with like. Medians and percentiles are computed on the fly inside queries, which can be scheduled to run at regular intervals in the background so we don't have to wait for them every time we want to view a chart.

Finally, produce multiple visualisations of your data. For instance, the chart I showed earlier has a partner that maps the percentile boundaries to concrete timings:

A chart that maps performance percentiles to concrete timing values

This is useful, but it's a bit annoying to look at a separate chart when you want to know the timing value. So we have another chart that tries to combine the work of both into a single view:

A chart that shows the percent success in concrete timing bands

Here, each band is between concrete timing values and the y-axis is the percent of users in each band that successfully signed in to their account. It tells the same story as before, with some extra spikiness due to the interaction between weekend behaviour and performance. Backing up an argument from one chart with evidence from another makes our case more compelling.

It also helps make our dashboards useful for other purposes, like keeping an eye on performance regressions and tracking how we're doing over time. Here are the same 1-second timing bands rendered to show how many people are in each:

A chart that shows the distribution of users within concrete timing bands

Ideally, we want to see users moving from the red bands at the top of the chart towards the green bands at the bottom. If the opposite occurs, we're doing something wrong and we know how that affects user behaviour.