Geographic distribution and disaster recovery in distributed systems

                March 24, 2025

            Geographic distribution and disaster recovery in distributed systems

                This edition dives into multi-region strategies for geographic distribution and disaster recovery.

            March Theme: Distributed Systems
In our next installment of this month's distributed systems series, we're tackling geographic distribution and disaster recovery. As software increasingly becomes the backbone of global businesses, spreading your system across multiple geographic regions isn't just a performance optimization—it's essential for business continuity.
Why go Multi-Region?
Spreading your distributed system across multiple regions globally has a few critical purposes.
1. Improved user experience
One of the best ways to reduce latency for users loading your content is to house the content as close to the users as possible. Using content delivery networks (CDNs) that are located closer to users throughout the globe can be an impressive improvement in network speeds.
2. Regulatory compliance
Some entities have data sovereignty laws and regulations which may require data to be stored in specific geographic regions. Even if there aren’t laws or regulations requiring it, some customers may also prefer it to keeping all their data in a single country.
3. Disaster preparedness
Using multiple data centers across different regions can be critical to improve reliability when it comes to catastrophic failures that can occur due to natural disasters, a failure of the regional infrastructure, or human errors that could impact the entirety of one region.
Levels of Geographic Distribution
Multi-region setups exist on a spectrum:
1. Single region, single availability zone
What it is: A single physical data center within a single metro area
Protection against: Nothing, this is the most vulnerable state to be in
Limitations: Anything impacting this data center will impact the entirety of your service, which is… not great!
2. Single region, multiple availability zone
What it is: Multiple physically isolated data centers within a single metro area
Protection against: Failures that can take down an entire data center, such as power issues or networking problems
Limitations: Still vulnerable to regional disasters or widespread outages from things like power in an area
3. Multiple region, active-passive
What it is: Primary region handles all traffic with a secondary region on standby to take over traffic if something happens to the primary
Protection against: Regional disasters or outages
Limitations: Slow recovery time, potentially outdated data in backup regions depending on how replication occurs
4. Multiple region, active-active
What it is: Multiple regions which all serve traffic simultaneously
Protection against: Nearly all failure scenarios
Limitations: Data synchronization is hard, potential issues with consistency of data
5. Global distribution with edge computing
What it is: Core services in multiple regions with edge locations for content delivery (CDNs)
Protection against: Comprehensive protection with optimal performance
Limitations: Most complex and most costly
Challenges from Geographic Distribution
While going multi-region has a significant amount of benefits, it can also be complicated and expensive. There are a few architecture decisions that need to be made when distributing data across regions:
1. Replication strategies
Synchronous replication: Writes are confirmed when saved in multiple regions. This is strong consistency of data but higher latency. If data being accurate is more important than data moving quickly, synchronous replication might be the best move.
Asynchronous replication: Writes are confirmed locally, then propagated to other regions. This is lower latency and has higher availability, but also a higher potential for data loss. If speed is more important than accuracy, then asynchronous replication might be the best decision.
2. Consistency models
Strong consistency: All reads contain data from all previous writes
Eventual consistency: Given enough time, all replicas will receive all writes, and there is a small window of time where data may be incorrect
3. Data locality and sharding
Geo-sharding: Partition data by geographic relevance (ex. data for customers in Asia is located there)
Data locality: Keeping data closest to where it’s most frequently accessed (similar to geo-sharding, but potentially different based on access patterns)
Follow-the-sun: Moving primary replicas to match current business hours (keeps the least latency closest to where you have the most users)
Disaster Recovery Planning
There are a few things to consider when planning for disaster recovery, and to be clear, not planning for disaster recovery is still a method of planning for disaster recovery… It’s just planning to fail when a disaster eventually hits!
1. Recovery objectives
Recovery Time Objective (RTO): Maximum acceptable downtime to prepare for. Example:
Seconds: Critical financial systems
Minutes: Consumer-facing applications
Hours: Internal business applications
Recovery Point Objective (RPO): Maximum acceptable data loss. Example:
Zero: Financial transactions
Some: Social media posts
Most: Analytics data
2. Disaster recovery strategies
Backup and restore: Periodically backup data with the ability to manually recover it if needed
Pilot light: Have a minimal standby environment that can be rapidly scaled
Warm standby: Scaled-down but fully functional copy of production
Hot standby: Full production replica ready to take over instantly
Multi-site active-active: No failover needed, traffic automatically reroutes
3. Disaster recovery tests
Your disaster recovery plan isn’t worth much if it isn’t regularly tested so that it can be put into place quickly when it’s needed. That can be done through the following types of tests:
Tabletop exercises: Teams get together and walk through failure scenarios and how to handle them.
Functional testing: Validate individual recovery components work as expected.
Full failover tests: Simulate actual disasters and recovery.
Chaos engineering: Continuously introduce failures to your environment to build resilience to them.
Final Thoughts
With modern cloud systems being adopted by companies at every level, disaster recovery is easier to plan for than ever, and it can be critical to business continuity. 
Start with what your actual business requirements are. What are your businesses actual needs for its RTO and RPO? That will prevent from over-engineering or under-protecting your systems. 
Thank you!
If you made it this far, then thank you! We are close to wrapping up March (how did that happen??) and I’m still working out what I’ll be focusing on in April. In the meantime, stay tuned for the final installment in the distributed systems series next week!
Here’s a silly web comic I made this week:

            Have comments or questions about this newsletter? Or just want to be internet friends? Send me an email at brittany@balancedengineer.com or reach out on Bluesky or LinkedIn.

    Read more:

                Building Fault Tolerance into Distributed Systems
                This week in The Balanced Engineer Newsletter, we delve into designing fault tolerance into your distributed systems and why failures are inevitable.

                Understanding Scalability in Distributed Systems
                This edition of The Balanced Engineer Newsletter dives into scalability in distributed systems, covering challenges for reaching scalability and common approaches!

Don't miss what's next. Subscribe to The Balanced Engineer Newsletter:

Start the conversation: