Availability in System Design
Learn Availability in System Design with real-world examples. This guide explains High Availability, Single Point of Failure (SPOF), Redundancy, Failover, Multi-Region deployment, Disaster Recovery, SLA, SLO, Load Balancing, and cloud architecture patterns used by Amazon, Netflix, Uber, and Google.
Introduction
Imagine you open:
- Amazon to buy a product
- Netflix to watch a movie
- Uber to book a ride
- Your Banking App to transfer money
What if the application shows:
❌ Service Unavailable
Please try again later.
Even a few minutes of downtime can result in:
- Millions of dollars in revenue loss
- Customer dissatisfaction
- Business reputation damage
- Regulatory penalties
This is why Availability is one of the most important Non-Functional Requirements in System Design.
Large companies spend millions of dollars every year to ensure their applications remain available 24×7×365.
Learning Objectives
After completing this article, you will understand:
- What is Availability?
- High Availability (HA)
- Downtime Calculation
- Single Point of Failure (SPOF)
- Redundancy
- Failover
- Multi-AZ & Multi-Region
- Disaster Recovery
- SLA, SLO & SLI
- Real-World Examples
- Best Practices
What is Availability?
Availability is the percentage of time a system is operational and accessible to users.
Formula:
Availability = Uptime / (Uptime + Downtime)
Example:
System Running
↓
99.99%
↓
Users Can Access Application
Why Availability Matters
Suppose an online banking application is unavailable.
Users cannot:
- Transfer money
- Check balances
- Pay bills
- Apply for loans
Business impact:
- Financial loss
- Customer frustration
- Regulatory violations
- Brand damage
Availability Example
flowchart TD
A[Users]
B[Banking Application]
C[(Database)]
A --> B
B --> C
If the application server crashes:
Users
↓
Application Down
↓
No Banking Services
Downtime vs Availability
| Availability | Downtime per Year |
|---|---|
| 99% | ~3.65 Days |
| 99.9% | ~8.7 Hours |
| 99.99% | ~52 Minutes |
| 99.999% | ~5 Minutes |
Higher availability requires significantly more engineering effort.
Real-World Availability Targets
| Company | Target |
|---|---|
| Banking | 99.99% |
| Healthcare | 99.99% |
| Netflix | 99.99%+ |
| AWS Services | 99.99%+ |
| Google Cloud | 99.99%+ |
Single Point of Failure (SPOF)
A Single Point of Failure is any component whose failure causes the entire system to stop.
Bad Design
flowchart TD
A[Users]
B[Application]
C[(Database)]
A --> B
B --> C
If the application fails:
Entire system becomes unavailable.
Removing SPOF
flowchart TD
A[Users]
B[Load Balancer]
C[App Server 1]
D[App Server 2]
E[App Server 3]
F[(Primary Database)]
A --> B
B --> C
B --> D
B --> E
C --> F
D --> F
E --> F
Now one server can fail while the system continues serving users.
High Availability (HA)
High Availability means the application continues working even if some components fail.
Characteristics:
- Multiple servers
- Automatic failover
- Redundant databases
- Health checks
- Load balancing
Load Balancer
flowchart TD
A[Users]
B[Application Load Balancer]
C[Spring Boot 1]
D[Spring Boot 2]
E[Spring Boot 3]
A --> B
B --> C
B --> D
B --> E
Responsibilities:
- Distribute traffic
- Detect unhealthy instances
- Route traffic only to healthy servers
Server Failure Example
Normal State
flowchart TD
A[Users]
B[Load Balancer]
C[App 1]
D[App 2]
A --> B
B --> C
B --> D
App 1 crashes.
flowchart TD
A[Users]
B[Load Balancer]
D[Healthy App]
A --> B
B --> D
Users continue using the application.
Database High Availability
Single Database
flowchart LR
A[Application]
B[(Database)]
A --> B
Database failure = Entire application down.
Primary + Standby Database
flowchart TD
A[Applications]
B[(Primary DB)]
C[(Standby DB)]
A --> B
B --> C
If Primary fails:
Standby becomes Primary automatically.
AWS Example
- Amazon RDS Multi-AZ
Multi Availability Zone (Multi-AZ)
flowchart TD
A[Application Load Balancer]
B[Availability Zone A]
C[Availability Zone B]
D[App 1]
E[App 2]
F[(Primary DB)]
G[(Standby DB)]
A --> B
A --> C
B --> D
C --> E
D --> F
E --> G
Benefits:
- Data center failure protection
- Automatic failover
- High Availability
Multi-Region Deployment
Large companies deploy across multiple AWS Regions.
flowchart LR
A[Global Users]
B[US East]
C[US West]
D[Europe]
A --> B
A --> C
A --> D
Benefits:
- Regional outage protection
- Faster response time
- Global availability
Failover
Failover means switching traffic automatically when a component fails.
flowchart LR
A[Primary Server]
B[Failure]
C[Secondary Server]
A --> B
B --> C
Users experience minimal downtime.
Active-Active Architecture
flowchart TD
A[Users]
B[Load Balancer]
C[App 1]
D[App 2]
A --> B
B --> C
B --> D
Both servers serve requests simultaneously.
Advantages
- Better utilization
- Better scalability
- High Availability
Active-Passive Architecture
flowchart TD
A[Primary Server]
B[Standby Server]
A --> B
Only one server handles traffic.
Standby activates during failures.
Disaster Recovery
Suppose an AWS Region becomes unavailable.
flowchart LR
A[Primary Region]
B[Database Replication]
C[Secondary Region]
A --> B
B --> C
Disaster Recovery protects against:
- Natural disasters
- Region outages
- Data center failures
SLA, SLO & SLI
| Term | Meaning |
|---|---|
| SLA | Service Level Agreement |
| SLO | Service Level Objective |
| SLI | Service Level Indicator |
Example
SLA
99.99% Uptime
SLI
Measured Availability = 99.992%
Real-Time Banking Architecture
flowchart TD
A[Customers]
B[AWS WAF]
C[Load Balancer]
D[API Gateway]
E[Spring Boot Services]
F[Redis]
G[(Amazon RDS)]
H[Kafka]
I[Notification Service]
A --> B
B --> C
C --> D
D --> E
E --> F
E --> G
E --> H
H --> I
Real-World Example — Netflix
Netflix uses:
- Multiple AWS Regions
- Thousands of EC2 instances
- Load Balancers
- Auto Scaling
- Chaos Engineering
- CDN
If one server fails,
Users continue watching movies.
Real-World Example — Amazon
Amazon uses:
- Stateless Services
- Multiple Availability Zones
- Database Replication
- Auto Scaling
- Distributed Caching
Customers continue shopping even during infrastructure failures.
Real-World Example — Banking
When you transfer money:
Mobile App
↓
API Gateway
↓
Payment Service
↓
Core Banking
↓
Database
↓
Notification
If one payment server crashes,
Load Balancer routes traffic to another server.
Transaction processing continues.
Monitoring Availability
Monitor:
- Uptime
- Health Checks
- Error Rate
- CPU
- Memory
- Database Health
- API Latency
- Load Balancer Status
Tools
- Amazon CloudWatch
- Datadog
- Grafana
- Prometheus
- Splunk
Common Causes of Downtime
- Server Crash
- Database Failure
- Network Outage
- Memory Leak
- Disk Full
- Configuration Error
- Deployment Failure
- DNS Issues
- Cloud Region Failure
Best Practices
- Remove every Single Point of Failure.
- Use multiple application instances.
- Deploy across multiple Availability Zones.
- Enable automatic health checks.
- Use Multi-AZ databases.
- Implement graceful failover.
- Design stateless services.
- Monitor system health continuously.
- Perform disaster recovery testing.
- Automate infrastructure recovery.
Common Interview Questions
What is Availability?
Availability is the percentage of time a system remains operational and accessible to users.
What is High Availability?
High Availability (HA) is the ability of a system to continue serving users even when one or more components fail.
What is a Single Point of Failure?
A Single Point of Failure (SPOF) is any component whose failure causes the entire application to become unavailable.
What is the difference between Active-Active and Active-Passive?
| Active-Active | Active-Passive |
|---|---|
| All servers handle traffic | Only one server handles traffic |
| Better utilization | Simpler failover |
| Higher scalability | Lower infrastructure utilization |
Why do banks deploy applications across multiple Availability Zones?
To ensure services remain available even if an entire data center experiences an outage.
Summary
In this article, we explored Availability, one of the most important concepts in System Design.
We covered:
- Availability fundamentals
- Downtime calculation
- High Availability
- Single Point of Failure
- Redundancy
- Load Balancing
- Database failover
- Multi-AZ deployment
- Multi-Region architecture
- Disaster Recovery
- SLA, SLO, and SLI
- Real-world banking, Amazon, and Netflix examples
- Best practices
Designing for high availability means expecting failures and building systems that continue operating despite them. Modern cloud-native architectures achieve this through redundancy, failover, health checks, distributed deployments, and continuous monitoring.
Comments
Share a question, correction, or practical insight about this article.
Checking login status...
Loading approved comments...