The Balanced Engineer Newsletter logo

The Balanced Engineer Newsletter

Subscribe
Archives
March 17, 2025

Building Fault Tolerance into Distributed Systems

This week in The Balanced Engineer Newsletter, we delve into designing fault tolerance into your distributed systems and why failures are inevitable.

March Theme: Distributed Systems

This week in our distributed systems series, we're focusing on fault tolerance – the art of keeping systems running despite inevitable failures. In a distributed world where components are spread across multiple machines, networks, and data centers, component failure isn't just possible – it's guaranteed.

What is fault tolerance?

Fault tolerance is a system's ability to continue functioning at an acceptable quality, even when one or more of its components fail. It's not about preventing failures (that's reliability), but rather about continuing to operate when failures occur.

As Amazon's Werner Vogels famously said: "Everything fails, all the time." Accepting this reality is the first step toward building truly resilient systems.

Why Failures Are Inevitable in Distributed Systems

Distributed systems face unique failure modes:

1. Networks are hard

Networks can become segmented, causing groups of nodes to be unable to communicate with each other.

2. Hardware Failures

With thousands of servers, disk failures, memory corruption, and power issues become daily occurrences.

3. Software Bugs

Try as we might, no software engineer out there is producing completely bug-free software (unless they aren’t producing software at all).

4. Clock Drift

Distributed systems often rely on time synchronization, but physical clocks drift at different rates.

5. Resource Exhaustion

Memory leaks, file descriptor limits, and connection pool exhaustion can cause gradual degradation.

Key Strategies for Fault Tolerance

1. Redundancy

The simplest form of fault tolerance involves duplicating critical components:

  • Active-passive: Standby components take over when primary components fail

  • Active-active: All components handle workload simultaneously

2. Isolation

Containing failures to prevent them from cascading through the system:

  • Bulkheads: Partitioning resources to limit failure impact (like ship compartments)

  • Circuit breakers: Temporarily disabling problematic components

  • Rate limiting: Preventing resource exhaustion by limiting the amount of requests allowed

3. Graceful Degradation/Fallbacks

Maintaining core functionality when resources are limited:

  • Serving cached content when databases are unavailable

  • Disabling non-critical features during partial outages

  • Implementing tiered service levels

4. Retry Strategies

Smart retry mechanisms can overcome transient failures:

  • Exponential backoff: Increasing wait time between retries

  • Jitter: Adding randomness to retry intervals to prevent thundering herds (or massive amounts of traffic to deal with when a service does come back online, which can take down the service a second time)

  • Idempotent operations: Ensuring operations can be safely repeated

5. State Management

Preserving and recovering state during failures:

  • Checkpointing progress

  • Persistent queues

Fault Tolerance Architectures

1. Primary-Backup Systems

A primary node handles all requests, with one or more backup nodes ready to take over:

  • Examples: MySQL with replicas, traditional active-standby setups

  • Challenge: Split-brain scenarios when coordination fails

2. Quorum-Based Systems

Operations succeed when acknowledged by a majority of nodes:

  • Examples: DynamoDB or CosmosDB

  • Tradeoff: Tunable consistency vs. availability

3. Leaderless Architectures

All nodes can accept writes, with reconciliation happening after network issues:

  • Examples: Content Delivery Networks (CDNs)

  • Benefit: High availability during network issues

Testing for Fault Tolerance

You can't claim fault tolerance without testing failure scenarios:

1. Chaos Engineering

Deliberately introducing failures to verify system resilience:

  • Netflix's Chaos Monkey: Randomly terminates production instances

  • GameDays: Simulated failure events to train teams

2. Fault Injection Testing

Systematically testing specific failure modes:

  • Network latency and packet loss

  • Process termination

  • Disk I/O errors

3. Disaster Recovery Drills

Regular exercises to verify recovery procedures:

  • Database failover testing

  • Zone evacuation

  • Full region failover

The Deep Dive

The Deep Dive for this week is this lovely Netflix Tech Blog article about fault tolerance in their systems, suggested by Rich Burroughs over on Bluesky.

Final Thoughts

Building fault-tolerant systems isn't just a technical challenge—it's a mindset shift. Instead of asking "How can we prevent failures?" ask "How can we design systems that embrace failures as normal events?"

The most sophisticated fault tolerance mechanisms are useless if they haven't been tested under realistic conditions. Make failure testing a regular part of your development and operations cycles.

Next week, we'll explore geographic distributions and disaster recovery for distributed systems.

What's the most interesting system failure you've encountered, and how did you address it? Share your war stories – I'd love to hear from you!

Thank you!

If you made it this far, then thank you! I get to finally share one of the fun things I have coming out soon.. my first post on the GitHub Blog, titled How GitHub engineers learn new codebases. Check it out and let me know what you think!

Here’s a silly web comic I made this week:

A normalized curve showing coolness of a social media site over time. The peak seems to occur when ads are introduced. On the left, increasing in coolness, is Mastodon and Bluesky. On the right in descending order of "totally sucks" we have TikTok, then Instagram, then X (the app formerly known as Twitter), and Facebook at the very very bottom

Have comments or questions about this newsletter? Or just want to be internet friends? Send me an email at brittany@balancedengineer.com or reach out on Bluesky or LinkedIn.

Thank you for subscribing. It would be incredibly helpful if you tell your friends about this newsletter if you like it! :)

Read more:

  • Availability: Dependency Management

    January Theme: Availability Last week we talked about SLOs, SLIs, and Error Budgets. Check it out if you missed it! Today we are going to talk about...

  • Concurrency in Distributed Systems

    This week, I'm diving into concurrency challenges in distributed systems and how to tackle them!

Don't miss what's next. Subscribe to The Balanced Engineer Newsletter:
Start the conversation:
This email brought to you by Buttondown, the easiest way to start and grow your newsletter.