Full Stack • Java • System Design • Cloud • AI Engineering

Availability in System Design

Learn Availability in System Design with real-world examples. This guide explains High Availability, Single Point of Failure (SPOF), Redundancy, Failover, Multi-Region deployment, Disaster Recovery, SLA, SLO, Load Balancing, and cloud architecture patterns used by Amazon, Netflix, Uber, and Google.


Introduction

Imagine you open:

  • Amazon to buy a product
  • Netflix to watch a movie
  • Uber to book a ride
  • Your Banking App to transfer money

What if the application shows:

❌ Service Unavailable

Please try again later.

Even a few minutes of downtime can result in:

  • Millions of dollars in revenue loss
  • Customer dissatisfaction
  • Business reputation damage
  • Regulatory penalties

This is why Availability is one of the most important Non-Functional Requirements in System Design.

Large companies spend millions of dollars every year to ensure their applications remain available 24×7×365.


Learning Objectives

After completing this article, you will understand:

  • What is Availability?
  • High Availability (HA)
  • Downtime Calculation
  • Single Point of Failure (SPOF)
  • Redundancy
  • Failover
  • Multi-AZ & Multi-Region
  • Disaster Recovery
  • SLA, SLO & SLI
  • Real-World Examples
  • Best Practices

What is Availability?

Availability is the percentage of time a system is operational and accessible to users.

Formula:

Availability = Uptime / (Uptime + Downtime)

Example:

System Running

↓

99.99%

↓

Users Can Access Application

Why Availability Matters

Suppose an online banking application is unavailable.

Users cannot:

  • Transfer money
  • Check balances
  • Pay bills
  • Apply for loans

Business impact:

  • Financial loss
  • Customer frustration
  • Regulatory violations
  • Brand damage

Availability Example

flowchart TD
    A[Users]
    B[Banking Application]
    C[(Database)]

    A --> B
    B --> C

If the application server crashes:

Users

↓

Application Down

↓

No Banking Services

Downtime vs Availability

Availability Downtime per Year
99% ~3.65 Days
99.9% ~8.7 Hours
99.99% ~52 Minutes
99.999% ~5 Minutes

Higher availability requires significantly more engineering effort.


Real-World Availability Targets

Company Target
Banking 99.99%
Healthcare 99.99%
Netflix 99.99%+
AWS Services 99.99%+
Google Cloud 99.99%+

Single Point of Failure (SPOF)

A Single Point of Failure is any component whose failure causes the entire system to stop.

Bad Design

flowchart TD
    A[Users]
    B[Application]
    C[(Database)]

    A --> B
    B --> C

If the application fails:

Entire system becomes unavailable.


Removing SPOF

flowchart TD
    A[Users]
    B[Load Balancer]

    C[App Server 1]
    D[App Server 2]
    E[App Server 3]

    F[(Primary Database)]

    A --> B

    B --> C
    B --> D
    B --> E

    C --> F
    D --> F
    E --> F

Now one server can fail while the system continues serving users.


High Availability (HA)

High Availability means the application continues working even if some components fail.

Characteristics:

  • Multiple servers
  • Automatic failover
  • Redundant databases
  • Health checks
  • Load balancing

Load Balancer

flowchart TD
    A[Users]

    B[Application Load Balancer]

    C[Spring Boot 1]

    D[Spring Boot 2]

    E[Spring Boot 3]

    A --> B

    B --> C
    B --> D
    B --> E

Responsibilities:

  • Distribute traffic
  • Detect unhealthy instances
  • Route traffic only to healthy servers

Server Failure Example

Normal State

flowchart TD
    A[Users]
    B[Load Balancer]

    C[App 1]
    D[App 2]

    A --> B
    B --> C
    B --> D

App 1 crashes.

flowchart TD
    A[Users]
    B[Load Balancer]

    D[Healthy App]

    A --> B
    B --> D

Users continue using the application.


Database High Availability

Single Database

flowchart LR
    A[Application]

    B[(Database)]

    A --> B

Database failure = Entire application down.


Primary + Standby Database

flowchart TD
    A[Applications]

    B[(Primary DB)]

    C[(Standby DB)]

    A --> B

    B --> C

If Primary fails:

Standby becomes Primary automatically.

AWS Example

  • Amazon RDS Multi-AZ

Multi Availability Zone (Multi-AZ)

flowchart TD
    A[Application Load Balancer]

    B[Availability Zone A]

    C[Availability Zone B]

    D[App 1]

    E[App 2]

    F[(Primary DB)]

    G[(Standby DB)]

    A --> B
    A --> C

    B --> D
    C --> E

    D --> F
    E --> G

Benefits:

  • Data center failure protection
  • Automatic failover
  • High Availability

Multi-Region Deployment

Large companies deploy across multiple AWS Regions.

flowchart LR
    A[Global Users]

    B[US East]

    C[US West]

    D[Europe]

    A --> B
    A --> C
    A --> D

Benefits:

  • Regional outage protection
  • Faster response time
  • Global availability

Failover

Failover means switching traffic automatically when a component fails.

flowchart LR
    A[Primary Server]

    B[Failure]

    C[Secondary Server]

    A --> B
    B --> C

Users experience minimal downtime.


Active-Active Architecture

flowchart TD
    A[Users]

    B[Load Balancer]

    C[App 1]

    D[App 2]

    A --> B

    B --> C
    B --> D

Both servers serve requests simultaneously.

Advantages

  • Better utilization
  • Better scalability
  • High Availability

Active-Passive Architecture

flowchart TD
    A[Primary Server]

    B[Standby Server]

    A --> B

Only one server handles traffic.

Standby activates during failures.


Disaster Recovery

Suppose an AWS Region becomes unavailable.

flowchart LR
    A[Primary Region]

    B[Database Replication]

    C[Secondary Region]

    A --> B
    B --> C

Disaster Recovery protects against:

  • Natural disasters
  • Region outages
  • Data center failures

SLA, SLO & SLI

Term Meaning
SLA Service Level Agreement
SLO Service Level Objective
SLI Service Level Indicator

Example

SLA

99.99% Uptime

SLI

Measured Availability = 99.992%

Real-Time Banking Architecture

flowchart TD
    A[Customers]

    B[AWS WAF]

    C[Load Balancer]

    D[API Gateway]

    E[Spring Boot Services]

    F[Redis]

    G[(Amazon RDS)]

    H[Kafka]

    I[Notification Service]

    A --> B
    B --> C
    C --> D
    D --> E

    E --> F
    E --> G
    E --> H

    H --> I

Real-World Example — Netflix

Netflix uses:

  • Multiple AWS Regions
  • Thousands of EC2 instances
  • Load Balancers
  • Auto Scaling
  • Chaos Engineering
  • CDN

If one server fails,

Users continue watching movies.


Real-World Example — Amazon

Amazon uses:

  • Stateless Services
  • Multiple Availability Zones
  • Database Replication
  • Auto Scaling
  • Distributed Caching

Customers continue shopping even during infrastructure failures.


Real-World Example — Banking

When you transfer money:

Mobile App

↓

API Gateway

↓

Payment Service

↓

Core Banking

↓

Database

↓

Notification

If one payment server crashes,

Load Balancer routes traffic to another server.

Transaction processing continues.


Monitoring Availability

Monitor:

  • Uptime
  • Health Checks
  • Error Rate
  • CPU
  • Memory
  • Database Health
  • API Latency
  • Load Balancer Status

Tools

  • Amazon CloudWatch
  • Datadog
  • Grafana
  • Prometheus
  • Splunk

Common Causes of Downtime

  • Server Crash
  • Database Failure
  • Network Outage
  • Memory Leak
  • Disk Full
  • Configuration Error
  • Deployment Failure
  • DNS Issues
  • Cloud Region Failure

Best Practices

  • Remove every Single Point of Failure.
  • Use multiple application instances.
  • Deploy across multiple Availability Zones.
  • Enable automatic health checks.
  • Use Multi-AZ databases.
  • Implement graceful failover.
  • Design stateless services.
  • Monitor system health continuously.
  • Perform disaster recovery testing.
  • Automate infrastructure recovery.

Common Interview Questions

What is Availability?

Availability is the percentage of time a system remains operational and accessible to users.


What is High Availability?

High Availability (HA) is the ability of a system to continue serving users even when one or more components fail.


What is a Single Point of Failure?

A Single Point of Failure (SPOF) is any component whose failure causes the entire application to become unavailable.


What is the difference between Active-Active and Active-Passive?

Active-Active Active-Passive
All servers handle traffic Only one server handles traffic
Better utilization Simpler failover
Higher scalability Lower infrastructure utilization

Why do banks deploy applications across multiple Availability Zones?

To ensure services remain available even if an entire data center experiences an outage.


Summary

In this article, we explored Availability, one of the most important concepts in System Design.

We covered:

  • Availability fundamentals
  • Downtime calculation
  • High Availability
  • Single Point of Failure
  • Redundancy
  • Load Balancing
  • Database failover
  • Multi-AZ deployment
  • Multi-Region architecture
  • Disaster Recovery
  • SLA, SLO, and SLI
  • Real-world banking, Amazon, and Netflix examples
  • Best practices

Designing for high availability means expecting failures and building systems that continue operating despite them. Modern cloud-native architectures achieve this through redundancy, failover, health checks, distributed deployments, and continuous monitoring.


Loading likes...

Comments

Share a question, correction, or practical insight about this article.

Loading approved comments...