Availability in System Design

Learn Availability in System Design with real-world examples. This guide explains High Availability, Single Point of Failure (SPOF), Redundancy, Failover, Multi-Region deployment, Disaster Recovery, SLA, SLO, Load Balancing, and cloud architecture patterns used by Amazon, Netflix, Uber, and Google.

Introduction

Imagine you open:

Amazon to buy a product
Netflix to watch a movie
Uber to book a ride
Your Banking App to transfer money

What if the application shows:

❌ Service Unavailable

Please try again later.

Even a few minutes of downtime can result in:

Millions of dollars in revenue loss
Customer dissatisfaction
Business reputation damage
Regulatory penalties

This is why Availability is one of the most important Non-Functional Requirements in System Design.

Large companies spend millions of dollars every year to ensure their applications remain available 24×7×365.

Learning Objectives

After completing this article, you will understand:

What is Availability?
High Availability (HA)
Downtime Calculation
Single Point of Failure (SPOF)
Redundancy
Failover
Multi-AZ & Multi-Region
Disaster Recovery
SLA, SLO & SLI
Real-World Examples
Best Practices

What is Availability?

Availability is the percentage of time a system is operational and accessible to users.

Formula:

Availability = Uptime / (Uptime + Downtime)

Example:

System Running

↓

99.99%

↓

Users Can Access Application

Why Availability Matters

Suppose an online banking application is unavailable.

Users cannot:

Transfer money
Check balances
Pay bills
Apply for loans

Business impact:

Financial loss
Customer frustration
Regulatory violations
Brand damage

Availability Example

flowchart TD
    A[Users]
    B[Banking Application]
    C[(Database)]

    A --> B
    B --> C

If the application server crashes:

Users

↓

Application Down

↓

No Banking Services

Downtime vs Availability

Availability	Downtime per Year
99%	~3.65 Days
99.9%	~8.7 Hours
99.99%	~52 Minutes
99.999%	~5 Minutes

Higher availability requires significantly more engineering effort.

Real-World Availability Targets

Company	Target
Banking	99.99%
Healthcare	99.99%
Netflix	99.99%+
AWS Services	99.99%+
Google Cloud	99.99%+

Single Point of Failure (SPOF)

A Single Point of Failure is any component whose failure causes the entire system to stop.

Bad Design

flowchart TD
    A[Users]
    B[Application]
    C[(Database)]

    A --> B
    B --> C

If the application fails:

Entire system becomes unavailable.

Removing SPOF

flowchart TD
    A[Users]
    B[Load Balancer]

    C[App Server 1]
    D[App Server 2]
    E[App Server 3]

    F[(Primary Database)]

    A --> B

    B --> C
    B --> D
    B --> E

    C --> F
    D --> F
    E --> F

Now one server can fail while the system continues serving users.

High Availability (HA)

High Availability means the application continues working even if some components fail.

Characteristics:

Multiple servers
Automatic failover
Redundant databases
Health checks
Load balancing

Load Balancer

flowchart TD
    A[Users]

    B[Application Load Balancer]

    C[Spring Boot 1]

    D[Spring Boot 2]

    E[Spring Boot 3]

    A --> B

    B --> C
    B --> D
    B --> E

Responsibilities:

Distribute traffic
Detect unhealthy instances
Route traffic only to healthy servers

Server Failure Example

Normal State

flowchart TD
    A[Users]
    B[Load Balancer]

    C[App 1]
    D[App 2]

    A --> B
    B --> C
    B --> D

App 1 crashes.

flowchart TD
    A[Users]
    B[Load Balancer]

    D[Healthy App]

    A --> B
    B --> D

Users continue using the application.

Database High Availability

Single Database

flowchart LR
    A[Application]

    B[(Database)]

    A --> B

Database failure = Entire application down.

Primary + Standby Database

flowchart TD
    A[Applications]

    B[(Primary DB)]

    C[(Standby DB)]

    A --> B

    B --> C

If Primary fails:

Standby becomes Primary automatically.

AWS Example

Amazon RDS Multi-AZ

Multi Availability Zone (Multi-AZ)

flowchart TD
    A[Application Load Balancer]

    B[Availability Zone A]

    C[Availability Zone B]

    D[App 1]

    E[App 2]

    F[(Primary DB)]

    G[(Standby DB)]

    A --> B
    A --> C

    B --> D
    C --> E

    D --> F
    E --> G

Benefits:

Data center failure protection
Automatic failover
High Availability

Multi-Region Deployment

Large companies deploy across multiple AWS Regions.

flowchart LR
    A[Global Users]

    B[US East]

    C[US West]

    D[Europe]

    A --> B
    A --> C
    A --> D

Benefits:

Regional outage protection
Faster response time
Global availability

Failover

Failover means switching traffic automatically when a component fails.

flowchart LR
    A[Primary Server]

    B[Failure]

    C[Secondary Server]

    A --> B
    B --> C

Users experience minimal downtime.

Active-Active Architecture

flowchart TD
    A[Users]

    B[Load Balancer]

    C[App 1]

    D[App 2]

    A --> B

    B --> C
    B --> D

Both servers serve requests simultaneously.

Advantages

Better utilization
Better scalability
High Availability

Active-Passive Architecture

flowchart TD
    A[Primary Server]

    B[Standby Server]

    A --> B

Only one server handles traffic.

Standby activates during failures.

Disaster Recovery

Suppose an AWS Region becomes unavailable.

flowchart LR
    A[Primary Region]

    B[Database Replication]

    C[Secondary Region]

    A --> B
    B --> C

Disaster Recovery protects against:

Natural disasters
Region outages
Data center failures

SLA, SLO & SLI

Term	Meaning
SLA	Service Level Agreement
SLO	Service Level Objective
SLI	Service Level Indicator

Example

SLA

99.99% Uptime

SLI

Measured Availability = 99.992%

Real-Time Banking Architecture

flowchart TD
    A[Customers]

    B[AWS WAF]

    C[Load Balancer]

    D[API Gateway]

    E[Spring Boot Services]

    F[Redis]

    G[(Amazon RDS)]

    H[Kafka]

    I[Notification Service]

    A --> B
    B --> C
    C --> D
    D --> E

    E --> F
    E --> G
    E --> H

    H --> I

Real-World Example — Netflix

Netflix uses:

Multiple AWS Regions
Thousands of EC2 instances
Load Balancers
Auto Scaling
Chaos Engineering
CDN

If one server fails,

Users continue watching movies.

Real-World Example — Amazon

Amazon uses:

Stateless Services
Multiple Availability Zones
Database Replication
Auto Scaling
Distributed Caching

Customers continue shopping even during infrastructure failures.

Real-World Example — Banking

When you transfer money:

Mobile App

↓

API Gateway

↓

Payment Service

↓

Core Banking

↓

Database

↓

Notification

If one payment server crashes,

Load Balancer routes traffic to another server.

Transaction processing continues.

Monitoring Availability

Monitor:

Uptime
Health Checks
Error Rate
CPU
Memory
Database Health
API Latency
Load Balancer Status

Tools

Amazon CloudWatch
Datadog
Grafana
Prometheus
Splunk

Common Causes of Downtime

Server Crash
Database Failure
Network Outage
Memory Leak
Disk Full
Configuration Error
Deployment Failure
DNS Issues
Cloud Region Failure

Best Practices

Remove every Single Point of Failure.
Use multiple application instances.
Deploy across multiple Availability Zones.
Enable automatic health checks.
Use Multi-AZ databases.
Implement graceful failover.
Design stateless services.
Monitor system health continuously.
Perform disaster recovery testing.
Automate infrastructure recovery.

Common Interview Questions

What is Availability?

Availability is the percentage of time a system remains operational and accessible to users.

What is High Availability?

High Availability (HA) is the ability of a system to continue serving users even when one or more components fail.

What is a Single Point of Failure?

A Single Point of Failure (SPOF) is any component whose failure causes the entire application to become unavailable.

What is the difference between Active-Active and Active-Passive?

Active-Active	Active-Passive
All servers handle traffic	Only one server handles traffic
Better utilization	Simpler failover
Higher scalability	Lower infrastructure utilization

Why do banks deploy applications across multiple Availability Zones?

To ensure services remain available even if an entire data center experiences an outage.

Summary

In this article, we explored Availability, one of the most important concepts in System Design.

We covered:

Availability fundamentals
Downtime calculation
High Availability
Single Point of Failure
Redundancy
Load Balancing
Database failover
Multi-AZ deployment
Multi-Region architecture
Disaster Recovery
SLA, SLO, and SLI
Real-world banking, Amazon, and Netflix examples
Best practices

Designing for high availability means expecting failures and building systems that continue operating despite them. Modern cloud-native architectures achieve this through redundancy, failover, health checks, distributed deployments, and continuous monitoring.

Loading likes...

Comments

Share a question, correction, or practical insight about this article.

Loading approved comments...