Latency and Throughput in System Design

Learn Latency and Throughput from a System Design perspective with real-world examples. This guide explains response time, throughput, bottlenecks, concurrency, queueing, performance optimization, and the techniques used by Amazon, Netflix, Uber, and banking systems.

Introduction

Imagine you open Amazon to buy a product.

You click "Buy Now".

How long should it take?

50 ms ✅
150 ms ✅
500 ms 😐
5 seconds ❌

Now imagine Amazon during Black Friday.

Millions of customers are purchasing products simultaneously.

The system must not only respond quickly but also process millions of requests every second.

This introduces two important performance metrics:

Latency → How fast does one request complete?
Throughput → How many requests can the system handle?

Every high-scale system—including banking applications, Netflix, Uber, Google Search, and payment gateways—is designed by balancing these two metrics.

Learning Objectives

After completing this article, you will understand:

What is Latency?
What is Throughput?
Response Time
Network Latency
Processing Latency
Database Latency
Bottlenecks
Concurrency
Queueing
Performance Optimization
Real-world Examples
Best Practices

What is Latency?

Latency is the time taken to complete one request.

Example

Customer clicks Login

↓

Request Sent

↓

Server Processes Request

↓

Response Returned

↓

250 ms

Latency answers:

"How long does one operation take?"

Latency Flow

flowchart LR
    A[Client]
    B[API Gateway]
    C[Spring Boot]
    D[(Database)]

    A --> B
    B --> C
    C --> D

Each step contributes to total latency.

What is Throughput?

Throughput is the number of requests processed in a given period of time.

Examples

500 Requests/Second
20,000 Transactions/Minute
5 Million Messages/Hour

Throughput answers:

"How much work can the system perform?"

Throughput Example

flowchart LR
    A[Users]
    B[Load Balancer]
    C[Application Cluster]

    A --> B
    B --> C

If one application can process:

100 Requests/Second

Then:

10 Servers

↓

1000 Requests/Second

Latency vs Throughput

Latency	Throughput
Time per request	Requests processed
Measured in ms	Requests/sec
Lower is better	Higher is better
User Experience	System Capacity

Real-World Example

Imagine a supermarket.

Customer waits:

2 Minutes

↓

Checkout Completed

This is Latency.

Now imagine:

100 Customers

↓

Processed Every Minute

This is Throughput.

Types of Latency

Large enterprise systems have multiple latency components.

flowchart TD
    A[Total Latency]

    A --> B[Network]
    A --> C[Application]
    A --> D[Database]
    A --> E[External APIs]

Network Latency

Network delay occurs while data travels between client and server.

flowchart LR
    A[Browser]

    B[Internet]

    C[AWS Load Balancer]

    D[Application]

    A --> B
    B --> C
    C --> D

Typical Causes

Long geographic distance
DNS lookup
Slow internet
VPN

Application Latency

Processing inside Spring Boot.

flowchart LR
    A[Controller]

    B[Service]

    C[Repository]

    D[(Database)]

    A --> B
    B --> C
    C --> D

Common causes

Complex logic
Large loops
Reflection
Blocking operations

Database Latency

flowchart LR
    A[Application]

    B[(Database)]

    A --> B

Common causes

Missing indexes
Table scans
Slow joins
Locks
Large transactions

Total Request Time

Network

30 ms

+

Application

70 ms

+

Database

120 ms

=

220 ms

Performance Bottlenecks

A bottleneck limits the overall system performance.

flowchart LR
    A[Client]

    B[API]

    C[Database]

    D[Slow Query]

    A --> B
    B --> C
    C --> D

The slowest component determines overall latency.

Queueing

Requests sometimes wait before processing.

flowchart LR
    A[Users]

    B[Queue]

    C[Application]

    A --> B
    B --> C

Long queues increase latency.

Concurrency

Multiple users accessing simultaneously.

flowchart TD
    A[Users]

    B[Thread Pool]

    C[Application]

    A --> B
    B --> C

Concurrency improves throughput.

Scaling Throughput

flowchart TD
    A[Users]

    B[Load Balancer]

    C[App 1]

    D[App 2]

    E[App 3]

    A --> B

    B --> C
    B --> D
    B --> E

More servers

↓

Higher Throughput

Caching Reduces Latency

flowchart LR
    A[Client]

    B[Application]

    C[Redis Cache]

    D[(Database)]

    A --> B
    B --> C
    C --> D

Without Cache

Database

250 ms

With Cache

Redis

5 ms

Asynchronous Processing

Long-running work should happen in the background.

flowchart LR
    A[Order Service]

    B[Kafka]

    C[Email]

    D[Analytics]

    E[Inventory]

    A --> B
    B --> C
    B --> D
    B --> E

Benefits

Lower latency
Better throughput

CDN Reduces Latency

flowchart LR
    A[Users]

    B[CloudFront]

    C[S3]

    A --> B
    B --> C

Images are served from the nearest edge location.

Real-Time Banking Example

Money Transfer

flowchart TD
    A[Customer]

    B[API Gateway]

    C[Payment Service]

    D[Fraud Service]

    E[(Database)]

    F[Kafka]

    G[Notification]

    A --> B
    B --> C
    C --> D
    D --> E
    D --> F
    F --> G

Customer receives an immediate response.

SMS is processed asynchronously.

Real-World Example — Netflix

Netflix minimizes latency by:

CDN (Open Connect)
Distributed caching
Regional deployments
Adaptive streaming
Load balancing

Millions of videos stream simultaneously with low buffering.

Real-World Example — Amazon

Amazon improves throughput by:

Horizontal scaling
Auto Scaling Groups
Read replicas
Redis caching
Event-driven architecture

Real-World Example — Uber

Ride request flow:

Ride Request

↓

Driver Matching

↓

Payment

↓

Notification

Driver matching must happen in milliseconds.

Notifications happen asynchronously.

Performance Monitoring

Monitor

Average Latency
P95 Latency
P99 Latency
Throughput
Requests/sec
CPU
Memory
Queue Length
Database Response Time

P50, P95 and P99 Latency

Metric	Meaning
P50	Median response time
P95	95% of requests complete within this time
P99	99% of requests complete within this time

Example

P50

120 ms

P95

240 ms

P99

650 ms

Architects pay close attention to P95 and P99, not just the average latency.

Common Performance Optimization Techniques

Technique	Benefit
Redis Cache	Lower latency
Load Balancer	Higher throughput
Auto Scaling	Better scalability
Database Indexing	Faster queries
CDN	Faster static content
Kafka	Async processing
Connection Pooling	Reduced DB overhead
Compression	Faster network transfer

Common Mistakes

❌ Calling the database multiple times

❌ Missing indexes

❌ Loading unnecessary data

❌ Blocking API calls

❌ No caching

❌ Long database transactions

❌ Large payloads

❌ Synchronous notifications

Best Practices

Cache frequently accessed data.
Keep APIs lightweight.
Optimize SQL queries.
Add proper indexes.
Use asynchronous processing.
Scale horizontally.
Use CDNs for static content.
Monitor P95 and P99 latency.
Perform load testing before production.
Continuously identify bottlenecks.

Common Interview Questions

What is Latency?

Latency is the time taken for a single request to travel through the system and return a response.

What is Throughput?

Throughput is the number of requests or transactions a system can process within a given time period.

Can a system have low latency but poor throughput?

Yes. A system may respond quickly to a few users but fail to handle large numbers of concurrent requests.

How does caching improve latency?

Caching stores frequently accessed data in memory, reducing expensive database lookups and improving response time.

Why do architects monitor P95 and P99 latency?

Average latency can hide slow requests. P95 and P99 reveal how the system performs under heavy load and help identify tail-latency issues that affect user experience.

Summary

In this article, we explored two of the most important performance metrics in System Design:

Latency
Throughput

We covered:

Latency fundamentals
Throughput fundamentals
Response time
Network and database latency
Bottlenecks
Queueing
Concurrency
Caching
Asynchronous processing
CDN
Real-world examples
Performance monitoring
P95 and P99 latency
Best practices

Modern distributed systems achieve excellent performance by combining efficient algorithms, caching, horizontal scaling, asynchronous messaging, optimized databases, and continuous performance monitoring. Understanding the trade-off between latency and throughput is essential for designing scalable, high-performance enterprise applications.

Loading likes...

Comments

Share a question, correction, or practical insight about this article.

Loading approved comments...