Full Stack • Java • System Design • Cloud • AI Engineering

Latency and Throughput in System Design

Learn Latency and Throughput from a System Design perspective with real-world examples. This guide explains response time, throughput, bottlenecks, concurrency, queueing, performance optimization, and the techniques used by Amazon, Netflix, Uber, and banking systems.


Introduction

Imagine you open Amazon to buy a product.

You click "Buy Now".

How long should it take?

  • 50 ms ✅
  • 150 ms ✅
  • 500 ms 😐
  • 5 seconds ❌

Now imagine Amazon during Black Friday.

Millions of customers are purchasing products simultaneously.

The system must not only respond quickly but also process millions of requests every second.

This introduces two important performance metrics:

  • Latency → How fast does one request complete?
  • Throughput → How many requests can the system handle?

Every high-scale system—including banking applications, Netflix, Uber, Google Search, and payment gateways—is designed by balancing these two metrics.


Learning Objectives

After completing this article, you will understand:

  • What is Latency?
  • What is Throughput?
  • Response Time
  • Network Latency
  • Processing Latency
  • Database Latency
  • Bottlenecks
  • Concurrency
  • Queueing
  • Performance Optimization
  • Real-world Examples
  • Best Practices

What is Latency?

Latency is the time taken to complete one request.

Example

Customer clicks Login

↓

Request Sent

↓

Server Processes Request

↓

Response Returned

↓

250 ms

Latency answers:

"How long does one operation take?"


Latency Flow

flowchart LR
    A[Client]
    B[API Gateway]
    C[Spring Boot]
    D[(Database)]

    A --> B
    B --> C
    C --> D

Each step contributes to total latency.


What is Throughput?

Throughput is the number of requests processed in a given period of time.

Examples

  • 500 Requests/Second
  • 20,000 Transactions/Minute
  • 5 Million Messages/Hour

Throughput answers:

"How much work can the system perform?"


Throughput Example

flowchart LR
    A[Users]
    B[Load Balancer]
    C[Application Cluster]

    A --> B
    B --> C

If one application can process:

100 Requests/Second

Then:

10 Servers

↓

1000 Requests/Second

Latency vs Throughput

Latency Throughput
Time per request Requests processed
Measured in ms Requests/sec
Lower is better Higher is better
User Experience System Capacity

Real-World Example

Imagine a supermarket.

Customer waits:

2 Minutes

↓

Checkout Completed

This is Latency.

Now imagine:

100 Customers

↓

Processed Every Minute

This is Throughput.


Types of Latency

Large enterprise systems have multiple latency components.

flowchart TD
    A[Total Latency]

    A --> B[Network]
    A --> C[Application]
    A --> D[Database]
    A --> E[External APIs]

Network Latency

Network delay occurs while data travels between client and server.

flowchart LR
    A[Browser]

    B[Internet]

    C[AWS Load Balancer]

    D[Application]

    A --> B
    B --> C
    C --> D

Typical Causes

  • Long geographic distance
  • DNS lookup
  • Slow internet
  • VPN

Application Latency

Processing inside Spring Boot.

flowchart LR
    A[Controller]

    B[Service]

    C[Repository]

    D[(Database)]

    A --> B
    B --> C
    C --> D

Common causes

  • Complex logic
  • Large loops
  • Reflection
  • Blocking operations

Database Latency

flowchart LR
    A[Application]

    B[(Database)]

    A --> B

Common causes

  • Missing indexes
  • Table scans
  • Slow joins
  • Locks
  • Large transactions

Total Request Time

Network

30 ms

+

Application

70 ms

+

Database

120 ms

=

220 ms

Performance Bottlenecks

A bottleneck limits the overall system performance.

flowchart LR
    A[Client]

    B[API]

    C[Database]

    D[Slow Query]

    A --> B
    B --> C
    C --> D

The slowest component determines overall latency.


Queueing

Requests sometimes wait before processing.

flowchart LR
    A[Users]

    B[Queue]

    C[Application]

    A --> B
    B --> C

Long queues increase latency.


Concurrency

Multiple users accessing simultaneously.

flowchart TD
    A[Users]

    B[Thread Pool]

    C[Application]

    A --> B
    B --> C

Concurrency improves throughput.


Scaling Throughput

flowchart TD
    A[Users]

    B[Load Balancer]

    C[App 1]

    D[App 2]

    E[App 3]

    A --> B

    B --> C
    B --> D
    B --> E

More servers

Higher Throughput


Caching Reduces Latency

flowchart LR
    A[Client]

    B[Application]

    C[Redis Cache]

    D[(Database)]

    A --> B
    B --> C
    C --> D

Without Cache

Database

250 ms

With Cache

Redis

5 ms

Asynchronous Processing

Long-running work should happen in the background.

flowchart LR
    A[Order Service]

    B[Kafka]

    C[Email]

    D[Analytics]

    E[Inventory]

    A --> B
    B --> C
    B --> D
    B --> E

Benefits

  • Lower latency
  • Better throughput

CDN Reduces Latency

flowchart LR
    A[Users]

    B[CloudFront]

    C[S3]

    A --> B
    B --> C

Images are served from the nearest edge location.


Real-Time Banking Example

Money Transfer

flowchart TD
    A[Customer]

    B[API Gateway]

    C[Payment Service]

    D[Fraud Service]

    E[(Database)]

    F[Kafka]

    G[Notification]

    A --> B
    B --> C
    C --> D
    D --> E
    D --> F
    F --> G

Customer receives an immediate response.

SMS is processed asynchronously.


Real-World Example — Netflix

Netflix minimizes latency by:

  • CDN (Open Connect)
  • Distributed caching
  • Regional deployments
  • Adaptive streaming
  • Load balancing

Millions of videos stream simultaneously with low buffering.


Real-World Example — Amazon

Amazon improves throughput by:

  • Horizontal scaling
  • Auto Scaling Groups
  • Read replicas
  • Redis caching
  • Event-driven architecture

Real-World Example — Uber

Ride request flow:

Ride Request

↓

Driver Matching

↓

Payment

↓

Notification

Driver matching must happen in milliseconds.

Notifications happen asynchronously.


Performance Monitoring

Monitor

  • Average Latency
  • P95 Latency
  • P99 Latency
  • Throughput
  • Requests/sec
  • CPU
  • Memory
  • Queue Length
  • Database Response Time

P50, P95 and P99 Latency

Metric Meaning
P50 Median response time
P95 95% of requests complete within this time
P99 99% of requests complete within this time

Example

P50

120 ms

P95

240 ms

P99

650 ms

Architects pay close attention to P95 and P99, not just the average latency.


Common Performance Optimization Techniques

Technique Benefit
Redis Cache Lower latency
Load Balancer Higher throughput
Auto Scaling Better scalability
Database Indexing Faster queries
CDN Faster static content
Kafka Async processing
Connection Pooling Reduced DB overhead
Compression Faster network transfer

Common Mistakes

❌ Calling the database multiple times

❌ Missing indexes

❌ Loading unnecessary data

❌ Blocking API calls

❌ No caching

❌ Long database transactions

❌ Large payloads

❌ Synchronous notifications


Best Practices

  • Cache frequently accessed data.
  • Keep APIs lightweight.
  • Optimize SQL queries.
  • Add proper indexes.
  • Use asynchronous processing.
  • Scale horizontally.
  • Use CDNs for static content.
  • Monitor P95 and P99 latency.
  • Perform load testing before production.
  • Continuously identify bottlenecks.

Common Interview Questions

What is Latency?

Latency is the time taken for a single request to travel through the system and return a response.


What is Throughput?

Throughput is the number of requests or transactions a system can process within a given time period.


Can a system have low latency but poor throughput?

Yes. A system may respond quickly to a few users but fail to handle large numbers of concurrent requests.


How does caching improve latency?

Caching stores frequently accessed data in memory, reducing expensive database lookups and improving response time.


Why do architects monitor P95 and P99 latency?

Average latency can hide slow requests. P95 and P99 reveal how the system performs under heavy load and help identify tail-latency issues that affect user experience.


Summary

In this article, we explored two of the most important performance metrics in System Design:

  • Latency
  • Throughput

We covered:

  • Latency fundamentals
  • Throughput fundamentals
  • Response time
  • Network and database latency
  • Bottlenecks
  • Queueing
  • Concurrency
  • Caching
  • Asynchronous processing
  • CDN
  • Real-world examples
  • Performance monitoring
  • P95 and P99 latency
  • Best practices

Modern distributed systems achieve excellent performance by combining efficient algorithms, caching, horizontal scaling, asynchronous messaging, optimized databases, and continuous performance monitoring. Understanding the trade-off between latency and throughput is essential for designing scalable, high-performance enterprise applications.


Loading likes...

Comments

Share a question, correction, or practical insight about this article.

Loading approved comments...