Since Google published its SRE book in 2016, a lot of self respecting organisations have rightly recognised SLIs and SLOs as extremely effective tools to manage reliability for your important services. A quick recap of the definitions
An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided. Latency and Availability are good examples for an SLI.
An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. For example, an SLO for a service could be that the latency for 99% ile of its requests are responded to within 400ms.
Google explicitly positioned SLIs and SLOs as a way to balance programmer/developer time spent on innovation and reliability but I have seen this lesson lost on many that I have interacted with. This is fundamentally because the fact that we are hired to create business value and not do those other things doesn’t register itself with most. Patrick McKenzie has a lovely essay about this which I think should be required reading for every programmer.
The lesson that I hope everybody took away from the reliability book is that SLIs and SLOs should establish the threshold at which developers stop their primary task of creating value and spend time improving the reliability of their systems. Instead what I have seen happen is that well intentioned managers take this amazing tool and turn into a referendum of the engineering excellence of their organization. I call this ‘Reliability Theatre’.
Some of the most common examples of this that I have seen are the following though I am sure this is not exhaustive by any means.
Setting unrealistic SLOs disconnected to your customer
Its very likely that as a team you might end up with a wide assortment of services. Not all services that you might own are equally critical to your business. There are ones that bring the big money in, then there are ones your users might give you a huge latitude for. Instead of taking that latitude, managers often set unrealistic SLOs for their teams. The reliability version of the death march is to set a target of reducing the latency SLO of a legacy service to a respectable level when not a single customer of that service has that expectation.
You should instead focus on creating a feedback channel with your customer wherever they might be and understand their expectations.
Setting way too many SLIs
A lot of teams turn every metric that they possibly can into an SLI preferring quantity over quality. Instead of focusing on what your users need you to measure, teams try to set everything that can go wrong as an SLI. In my opinion this eventually just leads to alert fatigue and harms responder well being.
I think the focus should be instead to distil what your users care about into 2 or 3 meaningful SLIs. The chapter “Implementing Service Level Objectives” in the 2nd Google book in the SRE series has some extremely useful pointers to help you do that. When it comes to SLIs, like many things, less is more.
Overdoing Operational Reviews
Now don’t get me wrong, I don’t think SLIs and SLOs are a set and forget activity. Nor do I think that management taking an interest in the operational health of teams is a bad idea. Where I do draw the line is when over enthusiastic managers in an attempt to display their operational prowess convince their teams that they have to use every operational review feature that PagerDuty offers.
Just like everything else, I think operational reviews are an important part of the reliability culture and the frequency and number of operational reviews that you do should be proportional to the criticality of your service.
These are the ways that I have seen well meaning teams mess up managing SLIs and SLOs. The solution is simple. Go back to the basics, strip apart the theatre from reliability. Focus on your customer and your business. Figure out the minimum expectation they have from you. Use that as your SLIs and SLOs. Free your developers to innovate.
Rinse. Repeat.