Availability: SLOs, SLIs, and Error Budgets
January Theme: Availability
Last week we talked about availability and how it’s measured here. Check it out if you missed it!
Today we are going to talk about Service Level Objectives, Service Level Indicators, and Error Budgets.
It would be nice if every application were available 100% of the time, but balanced software engineers understand that perfection is not a realistic goal. As with everything in software, there are trade-offs, and folks need to understand that the trade-offs to get to 100% availability probably aren’t worth it.
“When organizations stop aiming for perfection and accept that all systems will occasionally fail, they stop letting their technology rot for fear of change and invest in responding faster to failure.”
-Marianne Belotti (Kill It with Fire: Manage Aging Computer Systems (and Future Proof Modern Ones))
1. SLO (Service Level Objective)
An SLO is a target or a goal that defines the desired level of availability. It can also be used to measure latency, throughput, or error rates of a service. The SLO is typically expressed as a percentage (ex. 99.9% uptime) and represents the level of service the team aims to provide for users.
Purpose: SLOs are used to define what acceptable service availability looks like, guiding both the development and operations of the system.
Example: An SLO might be that an application should be 99.95% available
2. SLI (Service Level Indicator)
An SLI is a specific metric used to measure the performance of a service against the defined SLO. It is an objective measurement of how well a system is performing in relation to the desired characteristics. If a SLO is the goal, the SLI is how you measure it.
Purpose: SLIs provide the data needed to evaluate whether the service is meeting its SLOs. They are often collected continuously and are used as a way to track service health over time.
Example: The SLI for the availability of an API could be the percentage of requests that are served vs. the number of overall requests.
3. Error Budget
“This is the amount of temporary service degradation that’s deemed acceptable for users. So long as this error budget isn’t exceeded, then riskier deployments – which are those more likely to break the service – might be fine to proceed with. However, once the error budget is used up, pause all deployments that are considered risky.”
-Gergely Orosz (The Software Engineer’s Guidebook)
An Error Budget is the amount of failure allowed for a SLO. It is the difference between perfect availability (100%) and the service's SLO target. The error budget defines the maximum amount of downtime or failures a service can have while still meeting the SLO.
Purpose: The error budget allows for some level of failure, acknowledging that it’s not always realistic to achieve 100% availability. Teams can use the error budget to make decisions about whether to focus on improving availability or shipping new features.
Example: If an SLO is 99.5% uptime, then the error budget is the remaining 0.5% of downtime. This might translate into a certain number of minutes of downtime per month or year. If the system has used up too much of the error budget, the team might prioritize availability work over new feature development.
How They Work Together:
SLIs are the metrics that reflect system performance.
SLOs are the goals or targets for these metrics.
Error Budgets quantify how much deviation from the SLO is acceptable before action is needed.
In practice, teams monitor SLIs to determine whether they are meeting their SLOs. If they aren't, they can track the consumption of the error budget and decide whether to prioritize availability improvements or new features based on how much of the budget has been consumed. If too much of the error budget is consumed, the team might pause feature development to focus on improving the service's availability.
Example:
Let’s say you run a web service with the following definitions:
SLO: The service must be available 99.9% of the time.
SLI: Percentage of time that the service was available within a month.
Error Budget: If the SLO is 99.9%, the error budget is 0.1% of downtime (meaning the service can tolerate up to 0.1% of the time as downtime, which translates to around 43 minutes per month).
If the service exceeds this error budget (such as an unexpected outage), the team might focus on resolving the performance issues instead of adding new features.
This framework helps teams manage availability in a balanced way, ensuring they can meet user expectations while still evolving the system.
How do you define SLOs?
That sounds great, but how do you determine what your SLO should be? It’s generally not a great idea to use historical availability, because that can end up being more strict than necessary if you’ve been really lucky with few outages in the past.
There are a few things to consider for your application:
If you have multiple applications to consider, start with one instead of trying to define SLOs for every application you own. You can add more later but getting the first one is the hardest part!
Think about who your user is and what they will expect from the application in terms of speed and availability.
Consider the critical paths and common ways that users interact with your application. That can help you determine what is the most important to keep available to users.
The Deep Dive
Want to dive deeper into SLOs, SLIs, and Error Budgets? One of the best resources to check out is The Google SRE Workook, particularly this section on implementing SLOs.
Thank you!
If you’ve made it this far, then thank you! I appreciate you taking the time to read this. Here’s a silly web comic I made this week:
Have comments or questions about this newsletter? Or just want to chat? Send me an email at brittany@balancedengineer.com or reach out on Bluesky or LinkedIn.
Thank you for subscribing. If you like this newsletter, please tell your friends about it :)
Great post! Has some really useful information!