Full Stack • Java • System Design • Cloud • AI Engineering

Leader Election in Distributed Systems

Learn Leader Election from the ground up. Understand why leader election is needed, how distributed systems elect a leader, heartbeat mechanisms, election timeouts, quorum voting, leader failover, split-brain prevention, and how systems like Kubernetes, ZooKeeper, etcd, and Apache Kafka coordinate distributed clusters.


Introduction

Imagine you're designing an enterprise banking platform.

The application runs on multiple servers deployed across AWS.

US-East-1

↓

US-West-2

↓

Europe

Each region contains multiple application instances.

All servers can receive requests simultaneously.

Now imagine this situation:

Customer transfers $10,000.

Two different servers receive the same request.

Server A

Withdraw $10,000

Server B

Withdraw $10,000

If both process the transaction,

the customer may lose $20,000.

This is unacceptable.

Distributed systems solve this by electing one server as the Leader.

Only the leader performs critical operations.

All remaining servers become Followers.

This process is known as Leader Election.


Learning Objectives

By the end of this article, you'll understand:

  • What is Leader Election?
  • Why Leader Election is Required
  • Leader vs Followers
  • Active-Active vs Active-Passive
  • Leader Responsibilities
  • Follower Responsibilities
  • Heartbeats
  • Election Timeout
  • Leader Failure Detection
  • Quorum
  • Split Brain
  • Production Examples

Why Do We Need Leader Election?

Suppose you have three application servers.

flowchart TD
    C[Client]

    S1[Server 1]
    S2[Server 2]
    S3[Server 3]

    C --> S1
    C --> S2
    C --> S3

If every server modifies shared data independently,

problems occur.


Problems Without Leader Election

Imagine three servers updating inventory.

Inventory = 5

Server A

Sell 3

Server B

Sell 4

Server C

Sell 2

Each server believes inventory is available.

Result

Negative Inventory

Overselling

Data Corruption

Real Banking Example

Customer Balance

$50,000

ATM

Withdraw $20,000

Mobile Banking

Withdraw $15,000

Internet Banking

Withdraw $30,000

If all servers execute simultaneously,

the final balance becomes incorrect.


Distributed System With Leader

flowchart TD
    CLIENT[Clients]

    LB[Load Balancer]

    LEADER[(Leader)]

    F1[(Follower 1)]

    F2[(Follower 2)]

    CLIENT --> LB

    LB --> LEADER

    LEADER --> F1
    LEADER --> F2

Only one node performs writes.

Followers replicate data.


What is Leader Election?

Leader Election is the process of selecting one node to coordinate the cluster.

Only one leader exists.

Remaining nodes become followers.


Leader Responsibilities

The Leader performs

  • Write Operations
  • Transaction Coordination
  • Data Replication
  • Heartbeats
  • Cluster Metadata Updates
  • Configuration Changes
  • Distributed Lock Management

Leader Architecture

flowchart TD
    LEADER[Leader]

    WRITE[Write Requests]

    REPLICATION[Replication]

    HEARTBEAT[Heartbeats]

    CONFIG[Cluster Configuration]

    LEADER --> WRITE
    LEADER --> REPLICATION
    LEADER --> HEARTBEAT
    LEADER --> CONFIG

Follower Responsibilities

Followers

  • Replicate Data
  • Receive Heartbeats
  • Participate in Elections
  • Become Leader if necessary
  • Optionally Serve Read Requests

Cluster Architecture

flowchart LR
    L[(Leader)]

    F1[(Follower)]

    F2[(Follower)]

    F3[(Follower)]

    L --> F1
    L --> F2
    L --> F3

Leader vs Followers

Leader Followers
Accepts Writes Replicate Data
Sends Heartbeats Receive Heartbeats
Coordinates Cluster Wait for Leader
One Node Multiple Nodes

Client Request Flow

sequenceDiagram
    participant Client
    participant Leader
    participant Follower1
    participant Follower2

    Client->>Leader: Update Order

    Leader->>Follower1: Replicate

    Leader->>Follower2: Replicate

    Follower1-->>Leader: ACK

    Follower2-->>Leader: ACK

    Leader-->>Client: Success

Why Followers Cannot Accept Writes

Imagine two leaders.

Leader A

Inventory = 10

Leader B

Inventory = 8

Different clients receive different values.

Eventually

Database Corruption

Active-Passive Architecture

One active node.

Others remain passive.

flowchart LR
    CLIENT[Client]

    ACTIVE[(Leader)]

    PASSIVE1[(Follower)]

    PASSIVE2[(Follower)]

    CLIENT --> ACTIVE

    ACTIVE --> PASSIVE1
    ACTIVE --> PASSIVE2

Common in Banking.


Active-Active Architecture

Multiple servers process requests.

flowchart LR
    CLIENT[Client]

    NODE1[(Node 1)]

    NODE2[(Node 2)]

    NODE3[(Node 3)]

    CLIENT --> NODE1
    CLIENT --> NODE2
    CLIENT --> NODE3

Requires sophisticated conflict resolution.


Heartbeats

How do followers know the leader is alive?

The leader continuously sends

Heartbeat Messages


Heartbeat Flow

sequenceDiagram
    participant Leader
    participant Follower1
    participant Follower2

    loop Every 2 Seconds
        Leader->>Follower1: Heartbeat
        Leader->>Follower2: Heartbeat
    end

Heartbeats are tiny messages.

Purpose

  • Verify leader is alive
  • Prevent unnecessary elections
  • Synchronize metadata

Heartbeat Architecture

flowchart TD
    LEADER[(Leader)]

    HB1[Heartbeat]

    HB2[Heartbeat]

    F1[(Follower)]

    F2[(Follower)]

    LEADER --> HB1
    HB1 --> F1

    LEADER --> HB2
    HB2 --> F2

Election Timeout

Followers wait for heartbeats.

If none arrive,

they assume

Leader Failed

Each follower starts a timer.

Example

Node Timeout
Node A 150 ms
Node B 220 ms
Node C 310 ms

Random timeouts reduce simultaneous elections.


Timeout Flow

flowchart TD
    START[Receive Heartbeat]

    WAIT[Wait]

    CHECK{Heartbeat Received?}

    RESET[Reset Timer]

    ELECTION[Start Election]

    START --> WAIT
    WAIT --> CHECK

    CHECK -->|Yes| RESET
    CHECK -->|No| ELECTION

Leader Failure

Suppose

Leader crashes.

flowchart TD
    LEADER[(Leader)]

    F1[(Follower)]

    F2[(Follower)]

    F3[(Follower)]

    LEADER -. Crash .-> F1

    LEADER -. Crash .-> F2

    LEADER -. Crash .-> F3

Followers detect

Heartbeat Timeout

Election starts.


Failure Detection Timeline

sequenceDiagram
    participant Leader
    participant Follower

    Leader->>Follower: Heartbeat
    Leader->>Follower: Heartbeat

    Note over Leader: Server Crash

    Note over Follower: Timeout Expires

    Follower->>Follower: Start Election

Leader Election Steps

flowchart TD
    A[Leader Failure]

    B[Heartbeat Timeout]

    C[Follower Becomes Candidate]

    D[Request Votes]

    E[Receive Majority]

    F[Become Leader]

    A --> B
    B --> C
    C --> D
    D --> E
    E --> F

Quorum

Leader election requires

Majority Votes

Formula

Majority = (N / 2) + 1

Quorum Table

Nodes Votes Required
3 2
5 3
7 4
9 5

Quorum Architecture

flowchart TD
    CANDIDATE[(Candidate)]

    N1[(Node)]

    N2[(Node)]

    N3[(Node)]

    N4[(Node)]

    N5[(Node)]

    CANDIDATE --> N1
    CANDIDATE --> N2
    CANDIDATE --> N3
    CANDIDATE --> N4
    CANDIDATE --> N5

If the candidate receives 3 votes in a 5-node cluster,

it becomes the Leader.


Why Majority?

Imagine

Five Nodes

Only two vote.

Network partition occurs.

Another group also elects another leader.

Now

Two leaders exist.

Majority voting prevents this.


Split Brain

One of the biggest distributed system failures.

Two leaders exist simultaneously.

flowchart LR
    L1[(Leader A)]

    L2[(Leader B)]

    CLIENT1[Client]

    CLIENT2[Client]

    CLIENT1 --> L1
    CLIENT2 --> L2

    L1 -. Network Partition .- L2

Both accept writes.

Data becomes inconsistent.


Real World Examples

Banking

Leader coordinates

  • Money Transfers
  • Account Updates
  • Ledger Entries

Apache Kafka

Leader handles

  • Message Writes
  • Partition Coordination

Followers replicate logs.


Kubernetes

Leader coordinates

  • Scheduling
  • Controller Manager
  • Cluster State

ZooKeeper

Leader manages

  • Configuration
  • Locks
  • Metadata
  • Cluster Membership

etcd

Leader coordinates

  • Kubernetes State
  • Configuration
  • Distributed Locks

Advantages of Leader Election

  • Prevents conflicting writes
  • Simplifies distributed coordination
  • Supports automatic failover
  • Maintains consistency
  • Enables distributed locking
  • Foundation for consensus algorithms

Challenges

  • Leader failure
  • Election latency
  • Network partitions
  • Split brain
  • Leader bottleneck
  • Cluster reconfiguration

Summary

In this part, we learned:

  • What is Leader Election?
  • Why Leader Election is required
  • Leader and Follower architecture
  • Active-Passive vs Active-Active
  • Heartbeats
  • Election Timeout
  • Leader Failure Detection
  • Quorum
  • Split Brain
  • Real-world examples from Banking, Kafka, Kubernetes, ZooKeeper, and etcd

Loading likes...

Comments

Share a question, correction, or practical insight about this article.

Loading approved comments...