The Balanced Engineer Newsletter logo

The Balanced Engineer Newsletter

Subscribe
Archives
March 24, 2025

Geographic distribution and disaster recovery in distributed systems

This edition dives into multi-region strategies for geographic distribution and disaster recovery.

March Theme: Distributed Systems

In our next installment of this month's distributed systems series, we're tackling geographic distribution and disaster recovery. As software increasingly becomes the backbone of global businesses, spreading your system across multiple geographic regions isn't just a performance optimization—it's essential for business continuity.

Why go Multi-Region?

Spreading your distributed system across multiple regions globally has a few critical purposes.

1. Improved user experience

One of the best ways to reduce latency for users loading your content is to house the content as close to the users as possible. Using content delivery networks (CDNs) that are located closer to users throughout the globe can be an impressive improvement in network speeds.

2. Regulatory compliance

Some entities have data sovereignty laws and regulations which may require data to be stored in specific geographic regions. Even if there aren’t laws or regulations requiring it, some customers may also prefer it to keeping all their data in a single country.

3. Disaster preparedness

Using multiple data centers across different regions can be critical to improve reliability when it comes to catastrophic failures that can occur due to natural disasters, a failure of the regional infrastructure, or human errors that could impact the entirety of one region.

Levels of Geographic Distribution

Multi-region setups exist on a spectrum:

1. Single region, single availability zone

  • What it is: A single physical data center within a single metro area

  • Protection against: Nothing, this is the most vulnerable state to be in

  • Limitations: Anything impacting this data center will impact the entirety of your service, which is… not great!

2. Single region, multiple availability zone

  • What it is: Multiple physically isolated data centers within a single metro area

  • Protection against: Failures that can take down an entire data center, such as power issues or networking problems

  • Limitations: Still vulnerable to regional disasters or widespread outages from things like power in an area

3. Multiple region, active-passive

  • What it is: Primary region handles all traffic with a secondary region on standby to take over traffic if something happens to the primary

  • Protection against: Regional disasters or outages

  • Limitations: Slow recovery time, potentially outdated data in backup regions depending on how replication occurs

4. Multiple region, active-active

  • What it is: Multiple regions which all serve traffic simultaneously

  • Protection against: Nearly all failure scenarios

  • Limitations: Data synchronization is hard, potential issues with consistency of data

5. Global distribution with edge computing

  • What it is: Core services in multiple regions with edge locations for content delivery (CDNs)

  • Protection against: Comprehensive protection with optimal performance

  • Limitations: Most complex and most costly

Challenges from Geographic Distribution

While going multi-region has a significant amount of benefits, it can also be complicated and expensive. There are a few architecture decisions that need to be made when distributing data across regions:

1. Replication strategies

  • Synchronous replication: Writes are confirmed when saved in multiple regions. This is strong consistency of data but higher latency. If data being accurate is more important than data moving quickly, synchronous replication might be the best move.

  • Asynchronous replication: Writes are confirmed locally, then propagated to other regions. This is lower latency and has higher availability, but also a higher potential for data loss. If speed is more important than accuracy, then asynchronous replication might be the best decision.

2. Consistency models

  • Strong consistency: All reads contain data from all previous writes

  • Eventual consistency: Given enough time, all replicas will receive all writes, and there is a small window of time where data may be incorrect

3. Data locality and sharding

  • Geo-sharding: Partition data by geographic relevance (ex. data for customers in Asia is located there)

  • Data locality: Keeping data closest to where it’s most frequently accessed (similar to geo-sharding, but potentially different based on access patterns)

  • Follow-the-sun: Moving primary replicas to match current business hours (keeps the least latency closest to where you have the most users)

Disaster Recovery Planning

There are a few things to consider when planning for disaster recovery, and to be clear, not planning for disaster recovery is still a method of planning for disaster recovery… It’s just planning to fail when a disaster eventually hits!

1. Recovery objectives

  • Recovery Time Objective (RTO): Maximum acceptable downtime to prepare for. Example:

    • Seconds: Critical financial systems

    • Minutes: Consumer-facing applications

    • Hours: Internal business applications

  • Recovery Point Objective (RPO): Maximum acceptable data loss. Example:

    • Zero: Financial transactions

    • Some: Social media posts

    • Most: Analytics data

2. Disaster recovery strategies

  • Backup and restore: Periodically backup data with the ability to manually recover it if needed

  • Pilot light: Have a minimal standby environment that can be rapidly scaled

  • Warm standby: Scaled-down but fully functional copy of production

  • Hot standby: Full production replica ready to take over instantly

  • Multi-site active-active: No failover needed, traffic automatically reroutes

3. Disaster recovery tests

Your disaster recovery plan isn’t worth much if it isn’t regularly tested so that it can be put into place quickly when it’s needed. That can be done through the following types of tests:

  • Tabletop exercises: Teams get together and walk through failure scenarios and how to handle them.

  • Functional testing: Validate individual recovery components work as expected.

  • Full failover tests: Simulate actual disasters and recovery.

  • Chaos engineering: Continuously introduce failures to your environment to build resilience to them.

Final Thoughts

With modern cloud systems being adopted by companies at every level, disaster recovery is easier to plan for than ever, and it can be critical to business continuity.

Start with what your actual business requirements are. What are your businesses actual needs for its RTO and RPO? That will prevent from over-engineering or under-protecting your systems.

Thank you!

If you made it this far, then thank you! We are close to wrapping up March (how did that happen??) and I’m still working out what I’ll be focusing on in April. In the meantime, stay tuned for the final installment in the distributed systems series next week!

Here’s a silly web comic I made this week:

A line graph comparing the amount of time that I spent overthinking a message vs how important the conversation is. At not very important, I start out at more overthinking than I should. It quickly goes up to "A concerning amount" of overthinking at kind of and extremely important.

Have comments or questions about this newsletter? Or just want to be internet friends? Send me an email at brittany@balancedengineer.com or reach out on Bluesky or LinkedIn.

Thank you for subscribing. It would be incredibly helpful if you tell your friends about this newsletter if you like it! :)

Read more:

  • Building Fault Tolerance into Distributed Systems

    This week in The Balanced Engineer Newsletter, we delve into designing fault tolerance into your distributed systems and why failures are inevitable.

  • Understanding Scalability in Distributed Systems

    This edition of The Balanced Engineer Newsletter dives into scalability in distributed systems, covering challenges for reaching scalability and common approaches!

Don't miss what's next. Subscribe to The Balanced Engineer Newsletter:
Start the conversation:
This email brought to you by Buttondown, the easiest way to start and grow your newsletter.