Full Stack • Java • System Design • Cloud • AI Engineering

Replication in Distributed Systems

Learn Replication in Distributed Systems from a System Design perspective. Understand why replication is essential, replication models, leader-follower architecture, synchronous vs asynchronous replication, quorum consensus, failover, split-brain, replication lag, and real-world examples from Amazon, Netflix, Uber, Banking, Cassandra, MongoDB, PostgreSQL, and DynamoDB.


Introduction

Imagine you're building an online banking application.

Initially, everything runs on one database server.

Users
   ↓
Spring Boot
   ↓
PostgreSQL

Everything works well.

One day...

The database server crashes.

Result:

  • ❌ Login fails
  • ❌ Money transfer fails
  • ❌ Balance inquiry fails
  • ❌ ATM transactions stop

The entire banking application becomes unavailable.

How do companies like Amazon, Netflix, Google, and Uber prevent this?

The answer is Replication.

Instead of storing data on one server,

they maintain multiple copies of the same data.

If one server fails,

another server immediately takes over.


Learning Objectives

After completing this article, you'll understand:

  • What is Replication?
  • Why Replication?
  • Replication Architecture
  • Primary-Replica Model
  • Multi-Leader Replication
  • Leaderless Replication
  • Synchronous Replication
  • Asynchronous Replication
  • Quorum Consensus
  • Failover
  • Split Brain
  • Replication Lag
  • Best Practices

What is Replication?

Replication means maintaining multiple copies of the same data on different servers.

Example

Database A

↓

Database B

↓

Database C

All databases contain the same information.


Why Replication?

Without Replication

flowchart TD
    USER[Users]
    APP[Spring Boot]
    DB[(Single Database)]

    USER --> APP
    APP --> DB

Problems

  • Single Point of Failure
  • No Disaster Recovery
  • No Read Scaling

With Replication

flowchart TD
    USER[Users]

    APP[Spring Boot]

    PRIMARY[(Primary)]

    REPLICA1[(Replica 1)]

    REPLICA2[(Replica 2)]

    USER --> APP

    APP --> PRIMARY

    PRIMARY --> REPLICA1
    PRIMARY --> REPLICA2

Benefits

  • High Availability
  • Fault Tolerance
  • Read Scalability

Replication Goals

Replication provides:

  • High Availability
  • Disaster Recovery
  • Read Scaling
  • Fault Tolerance
  • Geographic Distribution
  • Backup

Primary-Replica Architecture

The most common replication model.

flowchart TD
    CLIENT[Client]

    APP[Spring Boot]

    PRIMARY[(Primary Database)]

    REPLICA1[(Replica 1)]

    REPLICA2[(Replica 2)]

    CLIENT --> APP

    APP --> PRIMARY

    PRIMARY --> REPLICA1
    PRIMARY --> REPLICA2

Only the Primary accepts writes.

Replicas receive copied data.


Read and Write Flow

flowchart LR
    CLIENT[Client]

    WRITE[Write Request]

    READ[Read Request]

    PRIMARY[(Primary)]

    REPLICA[(Replica)]

    CLIENT --> WRITE
    CLIENT --> READ

    WRITE --> PRIMARY
    READ --> REPLICA

Replication Process

sequenceDiagram
    participant Client
    participant Primary
    participant Replica

    Client->>Primary: INSERT Order
    Primary-->>Client: Success

    Primary->>Replica: Replicate Changes
    Replica-->>Primary: ACK

Synchronous Replication

Primary waits until replicas acknowledge the write.

sequenceDiagram
    participant Client
    participant Primary
    participant Replica

    Client->>Primary: Update

    Primary->>Replica: Replicate

    Replica-->>Primary: ACK

    Primary-->>Client: Success

Advantages

  • Strong consistency
  • No data loss

Disadvantages

  • Higher latency

Asynchronous Replication

Primary immediately responds.

Replication happens later.

sequenceDiagram
    participant Client
    participant Primary
    participant Replica

    Client->>Primary: Update

    Primary-->>Client: Success

    Primary->>Replica: Replicate Later

Advantages

  • Fast writes

Disadvantages

  • Replication Lag

Replication Lag

Primary

Product Price = $99

Replica

Still = $89

Replication hasn't completed.

This delay is called

Replication Lag.


Replication Lag Architecture

flowchart LR
    PRIMARY[(Primary)]

    WAL[Transaction Log]

    REPLICA[(Replica)]

    PRIMARY --> WAL
    WAL --> REPLICA

Read After Write Problem

User updates profile.

Immediately requests profile.

Application reads from Replica.

Replica still has old data.

User sees outdated information.


Solution

flowchart TD
    WRITE[Write Request]

    PRIMARY[(Primary)]

    READ[Immediate Read]

    REPLICA[(Replica)]

    WRITE --> PRIMARY

    PRIMARY --> READ

    PRIMARY --> REPLICA

Read from Primary immediately after writes.


Multi-Leader Replication

Instead of one Primary,

multiple databases accept writes.

flowchart LR
    DB1[(Leader 1)]

    DB2[(Leader 2)]

    DB3[(Leader 3)]

    DB1 <--> DB2
    DB2 <--> DB3
    DB1 <--> DB3

Advantages

  • Regional writes
  • Better availability

Challenges

  • Conflict resolution

Leaderless Replication

Every node accepts reads and writes.

flowchart TD
    CLIENT[Client]

    N1[(Node 1)]

    N2[(Node 2)]

    N3[(Node 3)]

    CLIENT --> N1
    CLIENT --> N2
    CLIENT --> N3

Used by

  • Cassandra
  • DynamoDB
  • Riak

Quorum Consensus

Instead of waiting for every replica,

wait for a majority.

Formula

R + W > N

Where

  • N = Total Replicas
  • R = Read Quorum
  • W = Write Quorum

Quorum Example

flowchart TD
    CLIENT[Client]

    N1[(Replica 1)]

    N2[(Replica 2)]

    N3[(Replica 3)]

    CLIENT --> N1
    CLIENT --> N2
    CLIENT --> N3

Write succeeds when the majority acknowledges.


Failover

If Primary fails,

Replica becomes Primary.

flowchart TD
    PRIMARY[(Primary)]

    REPLICA[(Replica)]

    NEWPRIMARY[(New Primary)]

    PRIMARY -. Failure .-> REPLICA

    REPLICA --> NEWPRIMARY

Automatic failover reduces downtime.


Split Brain

A dangerous scenario.

Two servers both think they are Primary.

flowchart LR
    P1[(Primary A)]

    P2[(Primary B)]

    P1 -. Network Partition .- P2

Both accept writes.

Data becomes inconsistent.


Preventing Split Brain

Solutions include

  • Leader Election
  • Quorum Consensus
  • Distributed Locks
  • ZooKeeper
  • etcd
  • Raft Consensus

Read Scaling

Replication distributes read traffic.

flowchart TD
    CLIENT[Users]

    LB[Read Load Balancer]

    R1[(Replica 1)]

    R2[(Replica 2)]

    R3[(Replica 3)]

    CLIENT --> LB

    LB --> R1
    LB --> R2
    LB --> R3

Spring Boot Architecture

flowchart TD
    CLIENT[React]

    API[Spring Boot]

    PRIMARY[(Primary PostgreSQL)]

    REPLICA1[(Replica 1)]

    REPLICA2[(Replica 2)]

    CLIENT --> API

    API --> PRIMARY
    API --> REPLICA1
    API --> REPLICA2

Writes go to Primary.

Reads go to Replicas.


PostgreSQL Replication

flowchart LR
    PRIMARY[(PostgreSQL)]

    WAL[Write Ahead Log]

    REPLICA[(Replica)]

    PRIMARY --> WAL
    WAL --> REPLICA

PostgreSQL uses WAL Streaming Replication.


MySQL Replication

flowchart LR
    PRIMARY[(MySQL)]

    BINLOG[Binary Log]

    REPLICA[(Replica)]

    PRIMARY --> BINLOG
    BINLOG --> REPLICA

MongoDB Replica Set

flowchart TD
    PRIMARY[(Primary)]

    SECONDARY1[(Secondary)]

    SECONDARY2[(Secondary)]

    PRIMARY --> SECONDARY1
    PRIMARY --> SECONDARY2

Automatic leader election occurs during failures.


Cassandra Replication

flowchart TD
    NODE1[(Node 1)]

    NODE2[(Node 2)]

    NODE3[(Node 3)]

    NODE1 --> NODE2
    NODE2 --> NODE3
    NODE3 --> NODE1

Every node stores replicated copies.

No Primary node exists.


Amazon Example

Amazon uses replication for:

  • Orders
  • Product Catalog
  • Customer Accounts
  • Inventory

Different services use different replication strategies depending on consistency requirements.


Netflix Example

Netflix replicates

  • Viewing History
  • Recommendations
  • Metadata

across multiple AWS Regions.


Uber Example

Uber replicates

  • Driver Data
  • Rider Data
  • Trip History

to provide low latency worldwide.


Banking Example

Core banking systems replicate

  • Customer Accounts
  • Transactions
  • Ledger Data

Strong consistency is preferred over availability.


Advantages

  • High Availability
  • Disaster Recovery
  • Read Scaling
  • Fault Tolerance
  • Data Redundancy
  • Better Performance

Challenges

  • Replication Lag
  • Conflict Resolution
  • Network Failures
  • Split Brain
  • Storage Cost
  • Operational Complexity

Monitoring

Monitor

  • Replication Lag
  • Replica Health
  • Failover Time
  • Read Latency
  • Write Latency
  • Network Latency
  • CPU Usage
  • Storage Usage

Tools

  • PostgreSQL pg_stat_replication
  • MySQL Performance Schema
  • MongoDB Atlas
  • Amazon CloudWatch
  • Prometheus
  • Grafana
  • Datadog

Common Mistakes

❌ Sending writes to replicas

❌ Ignoring replication lag

❌ No automatic failover

❌ Poor monitoring

❌ Reading stale data after writes

❌ No backup strategy


Best Practices

  • Keep one authoritative primary for transactional workloads.
  • Route reads to replicas whenever appropriate.
  • Monitor replication lag continuously.
  • Use automatic failover mechanisms.
  • Test disaster recovery regularly.
  • Choose synchronous or asynchronous replication based on business requirements.
  • Use quorum-based replication for distributed databases.
  • Combine replication with backups; replication is not a replacement for backups.

Common Interview Questions

What is Replication?

Replication is the process of maintaining multiple copies of the same data across different servers to improve availability, scalability, and fault tolerance.


What is the difference between Synchronous and Asynchronous Replication?

Synchronous Asynchronous
Waits for replica acknowledgment Replicates later
Strong consistency Eventual consistency
Higher latency Lower latency
Less chance of data loss Possible data loss during failures

What is Replication Lag?

Replication Lag is the delay between a successful write on the primary node and the moment when replicas receive and apply the same change.


What is Split Brain?

Split Brain occurs when multiple nodes incorrectly believe they are the primary node and accept writes independently, leading to conflicting data.


What is Quorum?

Quorum is a consensus mechanism where a read or write operation succeeds only after receiving acknowledgments from a required number of replicas, typically a majority.


Summary

Replication is one of the most important building blocks of distributed systems. By maintaining multiple synchronized copies of data, organizations can improve availability, fault tolerance, and read scalability while minimizing downtime.

In this article, we covered:

  • Replication fundamentals
  • Primary-Replica architecture
  • Synchronous vs Asynchronous Replication
  • Multi-Leader and Leaderless Replication
  • Quorum Consensus
  • Replication Lag
  • Failover
  • Split Brain
  • Spring Boot architecture
  • PostgreSQL, MySQL, MongoDB, and Cassandra examples
  • Banking, Amazon, Netflix, and Uber case studies
  • Monitoring
  • Best practices

Replication is a foundational concept for building resilient distributed systems. Combined with partitioning, sharding, load balancing, and consensus algorithms, it enables applications to remain available and scalable even when individual servers fail.


Loading likes...

Comments

Share a question, correction, or practical insight about this article.

Loading approved comments...