Availability: Dependency Management
January Theme: Availability
Last week we talked about SLOs, SLIs, and Error Budgets. Check it out if you missed it!
Today we are going to talk about dependency management for availability.
Dependency Management
Sometimes the availability challenge lies in managing the dependencies that your service relies on.
“You’re only as available as the sum of your dependencies”
-The Calculus of Service Availability
Dependencies are the other services, systems, or components that your service needs to function. Think of your service like a chain, with each link representing a different dependency—whether that’s a database, a third-party API, or even an internal system that processes requests. If any of those links break, your service might fail or experience disruptions.
In this week’s edition, we’ll explore how dependency management affects service availability, and introduce a concept known as the calculus of service availability—a way to calculate and plan for uptime, taking into account the reliability of all your dependencies.
The Dependency Dilemma
Your service doesn’t operate in isolation. It might rely on:
Third-party APIs: Many modern applications depend on external services for things like payment processing, sending emails, or geolocation data.
Databases: Your service may need to interact with databases that store user data, product catalogs, or other critical information.
Microservices: In a microservices architecture, different parts of your system may depend on each other. For instance, the payment system might depend on an inventory system to confirm whether an item is in stock before processing a transaction.
Cloud services: Hosting, file storage, or machine learning services provided by cloud providers (e.g., AWS, Azure, Google Cloud) are often essential dependencies.
Each of these services has its own availability characteristics. What happens if one of them experiences an outage or performs poorly? How does that impact your service’s overall availability?
The Calculus of Service Availability
This is where the calculus of service availability comes in. It’s a way of calculating what the overall availability of your service can be by factoring in the availability of each individual dependency.
Let’s say your service relies on two key dependencies, a database and a third-party API. If these services are down, your service will likely be down too.
But how do we calculate the downtime we can anticipate from these dependencies? Here’s a simplified way to think about it:
Database Availability = 99.95%
API Availability = 99.90%
To get a rough idea of your service’s overall maximum availability, we multiply the availability of each dependency.
This means the maximum availability of your service and based on these two dependencies would be 99.90%, or about 8 hours of downtime per year. This is just a rough estimate, but it illustrates the basic principle: the more dependencies you have, the lower your overall availability becomes, especially if those dependencies aren’t highly reliable.
Getting your 9’s
This means that if you want a highly available service, your dependencies need to be even more robust.
In The Calculus of Service Availability, there is a rule called “The Rule of the Extra 9”.
“If your service aims to offer 99.99 percent availability, then all of your critical dependencies must be significantly more than 99.99 percent available”
-The Calculus of Service Availability
As a general rule, each of your critical dependencies will need one extra 9 than what you want your service to provide. That means that if you’re planning for 99.99% (4-9s) availability, every critical dependency will need 99.999% (5-9s) availability, or one more 9 than your planned availability.
Managing Dependencies to Improve Availability
So, how do you manage dependencies effectively to ensure your service stays available? Here are a few strategies:
1. Redundancy
One of the most effective ways to mitigate the impact of dependency failures is through redundancy. This means having backup systems or services that can take over if the primary service fails. For example:
Database Replication: You could have a secondary database in another location that takes over if the primary one goes down.
API Failovers: If you rely on a third-party API, you might be able to switch to an alternative API provider if the primary one becomes unavailable.
2. Monitoring and Alerts
Keeping an eye on the health of your dependencies is crucial. Use monitoring tools to track the availability of each dependency and set up alerts when they are experiencing issues. This helps you quickly identify and respond to problems before they impact your users.
3. Graceful Degradation
Sometimes, it’s not feasible to have 100% uptime for every dependency. In such cases, you can design your service to degrade gracefully. This means your service continues to function, but with reduced functionality, when a dependency fails. For example:
If a payment gateway is down, you could let users browse your store but disable checkout until the payment system is back online.
If a geolocation service is unavailable, users might still be able to search your app but without location-specific results.
4. Choosing Reliable Dependencies
It’s tempting to rely on cheaper or less well-known services, but if those services are unreliable, it can seriously hurt your availability. Prioritize dependencies that have high reliability and a proven track record of uptime. Sometimes paying a little extra for a premium service can make a huge difference in the long run.
5. Fail Fast and Retry Logic
If a dependency is temporarily unavailable, implement retry logic in your system. This means your service will try again after a short delay before giving up entirely. However, don’t rely solely on retries. You should also have mechanisms to fail fast—meaning you don’t wait indefinitely for a service to come back online. If retries don’t work, it’s better to report an error to the user and let them know there’s an issue.
6. Get to know who you’re depending on
This is critical if you are working somewhere with microservices. If you are depending on other services from within your organization, make a map of what each of your dependencies are and what team they belong to. Then, make sure you get to know at least one person from each of those teams. This will give you a person to contact if you notice issues with that dependency.
The Deep Dive
Want to dive deeper into dependency management? I encourage you to check out The Calculus of Service Availability from ACM Queue!
Thank you!
If you’ve made it this far, then thank you! I appreciate you taking the time to read this.
Next week we will be wrapping up the availability topic for January by talking about incident investigation! This topic is close to my heart, having formerly worked in occupational health and safety.
Have an interesting incident story? If you’d like to be quoted in next week’s edition, reach out to me with your stories at brittany@balancedengineer.com!
Here’s a silly web comic I made this week:
Have comments or questions about this newsletter? Or just want to chat? Send me an email at brittany@balancedengineer.com or reach out on Bluesky or LinkedIn.
Thank you for subscribing. If you like this newsletter, please tell your friends about it :)