Availability: How is it measured?
This edition dives into software availability, its measurement, and why it's crucial for success.
Welcome and Introduction to The Balanced Engineer
Welcome to The Balanced Engineer, your resource for navigating the evolving world of software engineering. Each month, we’ll explore a key theme that’s relevant to generalist software engineers—whether you're fine-tuning your backend skills, mastering frontend development, or stepping into the world of leadership. My goal is to give you a high-level understanding of each topic with concise summaries and curated resources to help you dive deeper if you’re curious to learn more.
Every issue will highlight essential concepts, offer practical tips, and point you toward articles, tools, and talks to expand your expertise. Whether you’re looking to level up your current skill set or explore a new area of the stack, The Balanced Engineer is here to help you grow, stay balanced, and thrive in your engineering career. Let’s dive in!
January Theme: Availability
When it comes to software, availability is everything. But what does it mean to ensure your application is “available”? How can you measure it? Let’s dive into the world of availability and explore why it’s crucial to the success of your app.
What Exactly is Availability?
In simple terms, availability refers to the percentage of time your application is up and running—ready for use whenever your users need it. Imagine this: you open an app or a webpage, and you expect it to load instantly without issues. When your app is available, that’s exactly what happens.
On the flip side, if your users encounter downtime or delays, your app’s availability is compromised. It’s about making sure your service is reliable and responsive, no matter when or where it's accessed.
How Do You Measure Availability?
Measuring availability isn’t one-size-fits-all—it depends on your app’s goals, scale, and the needs of your users. Here are some of the most common ways to measure it:
Uptime Percentage:
The most straightforward way to measure availability is by tracking uptime. This simply refers to the amount of time your application is functioning properly. For example, if your service is operational for 99 hours out of every 100, your availability is 99%. The higher the percentage, the better the user experience.Service-Level Agreements (SLAs):
For some applications, especially those that businesses rely on to run other critical software, SLAs become important. An SLA is a formal agreement that guarantees a certain level of availability. These contracts often set specific uptime goals, ensuring that the service provider is held accountable for any downtime. For instance, cloud services or hosting platforms typically provide SLAs to reassure clients that their systems will be up and running within defined parameters.
The Different '9's of Availability and What They Mean for Downtime
When we talk about high availability, you might hear terms like “99% availability” or “99.99% availability,” but what do these numbers actually represent in terms of downtime? The higher the number of 9’s, the less downtime your service can have. Here’s a breakdown of what the different “9’s” mean for total downtime in a given year:
99% Availability – Also called “two nines”
With 99% availability, your application can experience 3.65 days (87.6 hours) of downtime per year. While this may sound acceptable for some less critical apps, it’s a lot of time lost for most businesses.99.9% Availability – “Three nines”
At this level, your downtime is reduced to 8.76 hours annually. This is often the baseline for many consumer-facing apps and businesses that want a more stable experience for users.99.99% Availability – “Four nines”
With 99.99% availability, you can only afford about 52.56 minutes of downtime per year. This is a common standard for critical systems, like financial apps or enterprise-level software, where even a small amount of downtime can cause significant disruption.99.999% Availability – “Five nines”
At 99.999%, downtime is reduced to just 5.26 minutes per year. This level of availability is typically required for mission-critical systems in industries like healthcare, finance, and telecommunications.99.9999% Availability – “Six nines”
For the most demanding applications, 99.9999% availability allows for just 31.5 seconds of downtime annually. Achieving this level of availability is rare and often only required for extremely high-stakes systems, like those in aerospace or military applications.
The more “nines” you have, the less downtime your users experience, but achieving higher availability comes with increasing complexity, cost, and technical demands. It’s crucial to balance your uptime goals with the resources and infrastructure you have in place.
Why Does Availability Matter?
In today’s digital landscape, downtime isn’t just inconvenient—it can be costly. Whether you’re running an e-commerce site, a SaaS platform, or a customer service app, availability directly impacts user satisfaction, revenue, and brand reputation. A single minute of downtime can lead to missed sales opportunities, frustrated users, and a drop in trust.
I have a strong memory of working at Nike and making a bad change. That change resulted in customers being unable to buy shoes on the Nike.com website for around five minutes.
Five minutes doesn't sound long, luckily because we had feature flags in place to help out. But I remember my tech lead standing behind me while we were waiting for the change. He was counting up the minutes of the outage and multiplying the thousands of dollars per minute we were likely losing as a result.
If your website makes money, outages can cause significant losses. Either through the money lost or through the damage to your brand caused by the customer perception of poor availability. Thus it’s very important to optimize for!
How to Improve Your Application's Availability?
Now that you understand what availability is, how it's measured, and the impact of the “9's,” we will spend the rest of January focusing on techniques to improve your application’s availability.
The Deep Dive
Want to dive deeper into understanding the measurement of software availability? I recommend spending some time learning about CAP Theorem. CAP Theorem refers to a theory that each application can only optimize for 2 of the 3 following things:
Consistency: All users will see the same data
Availability: All users will be able to access the data when they want to
Partition Tolerance: Network failures that are bound to happen in distributed systems
Here are some helpful learning resources that you can dive into, depending on how you like to learn:
Have comments or questions about this newsletter? Or just want to chat? Send me an email at brittany@balancedengineer.com or reach out on Bluesky or LinkedIn.
Thank you for subscribing. If you like this newsletter, please tell your friends about it :)