milesbxf.net

On Call Rotations People Actually Want to Join

·3 mins

A version of this post first appeared in the book 97 Things Every SRE Should Know. Check it out - there’s some wonderful essays in there!

Think about the first time you joined on-call. Were you excited? Why not?

Our industry seems to have accepted that on-call is a necessary evil. Thousands of devs and SREs put themselves through misery and burnout because it’s “part of the job”.

Does it have to be that way? At Monzo, on-call is popular, and when some spaces in our on-call rotation opened up we filled them from a waiting list. But why is creating a good on-call experience and popular rotation so important to us, and how did we get to this point?

On-callers are human. This is what makes on-call so powerful; when safety systems, resilient architecture, and automated remediation stop working, no machine even comes close to matching the capability of a human to react and adapt to a novel failure in a complex system. Unlike machines, humans cannot withstand 24/7 uptime or sustained 100% CPU usage. Burnout sucks: it sucks for the people around them, it sucks for the company losing a smart and capable engineer, but it really sucks for the person. Effective on-call is also humane on-call.

Firstly, people should be adequately incentivized. Many on-callers aren’t paid at all 1; it is only fair that on-callers are compensated for the added burden, responsibility, stress and disruption to normal life that the pager imposes. People are also motivated by the opportunity to progress technically and learn much more about the systems they work on. At Monzo, we encourage and reward this by including on-call behaviours in our progression framework 2.

It’s also important to reduce the pain of being on-call. Reducing the frequency of getting paged is the obvious place to start; whilst we’ll never practically achieve 100% reliability (or there wouldn’t be a need for on-call at all!), we can at least reduce the number of noisy alerts and failures that could have been dealt with by automation. At Monzo we treat every page as an exceptional circumstance; if no action is required, then we tweak thresholds or even delete alerts.

A good on-call experience should start from the moment someone joins the rotation. A common experience is to be thrown in the deep end and expected to handle things alone; at Monzo, every on-caller has a few months of shadowing a more experienced engineer, so that they can practice incident response and gain context with less expectations.

This type of culture shift doesn’t happen overnight, and requires constant effort and frequent iterations. The best people to ask for ideas of improving on-call are on-callers themselves. We hold frequent retrospectives and reflect on ways we can make on-call better. Sometimes it’s nice to have a forum just to vent, but giving people agency and a chance to make things better is a powerful way to improve on-caller wellbeing.

You can, and should, improve on-call. In an age where our systems are getting ever more complex yet critical, we’re increasingly reliant on humans to step in when the automation fails. Building happy, healthy on-call rotations is a superpower and one that you, too, can gain by taking the time and effort to incentivize people, reduce the pain points, and iterate rapidly.


  1. https://oncall.netlify.app/articles/2019-02/on-call-survey-2019 ↩︎

  2. https://progression.monzo.com/engineering/backend ↩︎