Building Fault Tolerance into Distributed Systems

                March 17, 2025

            Building Fault Tolerance into Distributed Systems

                This week in The Balanced Engineer Newsletter, we delve into designing fault tolerance into your distributed systems and why failures are inevitable.

            March Theme: Distributed Systems
This week in our distributed systems series, we're focusing on fault tolerance – the art of keeping systems running despite inevitable failures. In a distributed world where components are spread across multiple machines, networks, and data centers, component failure isn't just possible – it's guaranteed.
What is fault tolerance?
Fault tolerance is a system's ability to continue functioning at an acceptable quality, even when one or more of its components fail. It's not about preventing failures (that's reliability), but rather about continuing to operate when failures occur.
As Amazon's Werner Vogels famously said: "Everything fails, all the time." Accepting this reality is the first step toward building truly resilient systems.
Why Failures Are Inevitable in Distributed Systems
Distributed systems face unique failure modes:
1. Networks are hard
Networks can become segmented, causing groups of nodes to be unable to communicate with each other.
2. Hardware Failures
With thousands of servers, disk failures, memory corruption, and power issues become daily occurrences.
3. Software Bugs
Try as we might, no software engineer out there is producing completely bug-free software (unless they aren’t producing software at all). 
4. Clock Drift
Distributed systems often rely on time synchronization, but physical clocks drift at different rates.
5. Resource Exhaustion
Memory leaks, file descriptor limits, and connection pool exhaustion can cause gradual degradation.
Key Strategies for Fault Tolerance
1. Redundancy
The simplest form of fault tolerance involves duplicating critical components:
Active-passive: Standby components take over when primary components fail
Active-active: All components handle workload simultaneously
2. Isolation
Containing failures to prevent them from cascading through the system:
Bulkheads: Partitioning resources to limit failure impact (like ship compartments)
Circuit breakers: Temporarily disabling problematic components
Rate limiting: Preventing resource exhaustion by limiting the amount of requests allowed
3. Graceful Degradation/Fallbacks
Maintaining core functionality when resources are limited:
Serving cached content when databases are unavailable
Disabling non-critical features during partial outages
Implementing tiered service levels
4. Retry Strategies
Smart retry mechanisms can overcome transient failures:
Exponential backoff: Increasing wait time between retries
Jitter: Adding randomness to retry intervals to prevent thundering herds (or massive amounts of traffic to deal with when a service does come back online, which can take down the service a second time)
Idempotent operations: Ensuring operations can be safely repeated
5. State Management
Preserving and recovering state during failures:
Checkpointing progress
Persistent queues
Fault Tolerance Architectures
1. Primary-Backup Systems
A primary node handles all requests, with one or more backup nodes ready to take over:
Examples: MySQL with replicas, traditional active-standby setups
Challenge: Split-brain scenarios when coordination fails
2. Quorum-Based Systems
Operations succeed when acknowledged by a majority of nodes:
Examples: DynamoDB or CosmosDB
Tradeoff: Tunable consistency vs. availability
3. Leaderless Architectures
All nodes can accept writes, with reconciliation happening after network issues:
Examples: Content Delivery Networks (CDNs)
Benefit: High availability during network issues
Testing for Fault Tolerance
You can't claim fault tolerance without testing failure scenarios:
1. Chaos Engineering
Deliberately introducing failures to verify system resilience:
Netflix's Chaos Monkey: Randomly terminates production instances
GameDays: Simulated failure events to train teams
2. Fault Injection Testing
Systematically testing specific failure modes:
Network latency and packet loss
Process termination
Disk I/O errors
3. Disaster Recovery Drills
Regular exercises to verify recovery procedures:
Database failover testing
Zone evacuation
Full region failover
The Deep Dive
The Deep Dive for this week is this lovely Netflix Tech Blog article about fault tolerance in their systems, suggested by Rich Burroughs over on Bluesky.
Final Thoughts
Building fault-tolerant systems isn't just a technical challenge—it's a mindset shift. Instead of asking "How can we prevent failures?" ask "How can we design systems that embrace failures as normal events?"
The most sophisticated fault tolerance mechanisms are useless if they haven't been tested under realistic conditions. Make failure testing a regular part of your development and operations cycles.
Next week, we'll explore geographic distributions and disaster recovery for distributed systems.
What's the most interesting system failure you've encountered, and how did you address it? Share your war stories – I'd love to hear from you!
Thank you!
If you made it this far, then thank you! I get to finally share one of the fun things I have coming out soon.. my first post on the GitHub Blog, titled How GitHub engineers learn new codebases. Check it out and let me know what you think!
Here’s a silly web comic I made this week:

            Have comments or questions about this newsletter? Or just want to be internet friends? Send me an email at brittany@balancedengineer.com or reach out on Bluesky or LinkedIn.

    Read more:

                Availability: Dependency Management
                January Theme: Availability Last week we talked about SLOs, SLIs, and Error Budgets. Check it out if you missed it! Today we are going to talk about...

                Concurrency in Distributed Systems
                This week, I'm diving into concurrency challenges in distributed systems and how to tackle them!

Don't miss what's next. Subscribe to The Balanced Engineer Newsletter:

Start the conversation: