Latency and Throughput in System Design
Learn Latency and Throughput from a System Design perspective with real-world examples. This guide explains response time, throughput, bottlenecks, concurrency, queueing, performance optimization, and the techniques used by Amazon, Netflix, Uber, and banking systems.
Introduction
Imagine you open Amazon to buy a product.
You click "Buy Now".
How long should it take?
- 50 ms ✅
- 150 ms ✅
- 500 ms 😐
- 5 seconds ❌
Now imagine Amazon during Black Friday.
Millions of customers are purchasing products simultaneously.
The system must not only respond quickly but also process millions of requests every second.
This introduces two important performance metrics:
- Latency → How fast does one request complete?
- Throughput → How many requests can the system handle?
Every high-scale system—including banking applications, Netflix, Uber, Google Search, and payment gateways—is designed by balancing these two metrics.
Learning Objectives
After completing this article, you will understand:
- What is Latency?
- What is Throughput?
- Response Time
- Network Latency
- Processing Latency
- Database Latency
- Bottlenecks
- Concurrency
- Queueing
- Performance Optimization
- Real-world Examples
- Best Practices
What is Latency?
Latency is the time taken to complete one request.
Example
Customer clicks Login
↓
Request Sent
↓
Server Processes Request
↓
Response Returned
↓
250 ms
Latency answers:
"How long does one operation take?"
Latency Flow
flowchart LR
A[Client]
B[API Gateway]
C[Spring Boot]
D[(Database)]
A --> B
B --> C
C --> D
Each step contributes to total latency.
What is Throughput?
Throughput is the number of requests processed in a given period of time.
Examples
- 500 Requests/Second
- 20,000 Transactions/Minute
- 5 Million Messages/Hour
Throughput answers:
"How much work can the system perform?"
Throughput Example
flowchart LR
A[Users]
B[Load Balancer]
C[Application Cluster]
A --> B
B --> C
If one application can process:
100 Requests/Second
Then:
10 Servers
↓
1000 Requests/Second
Latency vs Throughput
| Latency | Throughput |
|---|---|
| Time per request | Requests processed |
| Measured in ms | Requests/sec |
| Lower is better | Higher is better |
| User Experience | System Capacity |
Real-World Example
Imagine a supermarket.
Customer waits:
2 Minutes
↓
Checkout Completed
This is Latency.
Now imagine:
100 Customers
↓
Processed Every Minute
This is Throughput.
Types of Latency
Large enterprise systems have multiple latency components.
flowchart TD
A[Total Latency]
A --> B[Network]
A --> C[Application]
A --> D[Database]
A --> E[External APIs]
Network Latency
Network delay occurs while data travels between client and server.
flowchart LR
A[Browser]
B[Internet]
C[AWS Load Balancer]
D[Application]
A --> B
B --> C
C --> D
Typical Causes
- Long geographic distance
- DNS lookup
- Slow internet
- VPN
Application Latency
Processing inside Spring Boot.
flowchart LR
A[Controller]
B[Service]
C[Repository]
D[(Database)]
A --> B
B --> C
C --> D
Common causes
- Complex logic
- Large loops
- Reflection
- Blocking operations
Database Latency
flowchart LR
A[Application]
B[(Database)]
A --> B
Common causes
- Missing indexes
- Table scans
- Slow joins
- Locks
- Large transactions
Total Request Time
Network
30 ms
+
Application
70 ms
+
Database
120 ms
=
220 ms
Performance Bottlenecks
A bottleneck limits the overall system performance.
flowchart LR
A[Client]
B[API]
C[Database]
D[Slow Query]
A --> B
B --> C
C --> D
The slowest component determines overall latency.
Queueing
Requests sometimes wait before processing.
flowchart LR
A[Users]
B[Queue]
C[Application]
A --> B
B --> C
Long queues increase latency.
Concurrency
Multiple users accessing simultaneously.
flowchart TD
A[Users]
B[Thread Pool]
C[Application]
A --> B
B --> C
Concurrency improves throughput.
Scaling Throughput
flowchart TD
A[Users]
B[Load Balancer]
C[App 1]
D[App 2]
E[App 3]
A --> B
B --> C
B --> D
B --> E
More servers
↓
Higher Throughput
Caching Reduces Latency
flowchart LR
A[Client]
B[Application]
C[Redis Cache]
D[(Database)]
A --> B
B --> C
C --> D
Without Cache
Database
250 ms
With Cache
Redis
5 ms
Asynchronous Processing
Long-running work should happen in the background.
flowchart LR
A[Order Service]
B[Kafka]
C[Email]
D[Analytics]
E[Inventory]
A --> B
B --> C
B --> D
B --> E
Benefits
- Lower latency
- Better throughput
CDN Reduces Latency
flowchart LR
A[Users]
B[CloudFront]
C[S3]
A --> B
B --> C
Images are served from the nearest edge location.
Real-Time Banking Example
Money Transfer
flowchart TD
A[Customer]
B[API Gateway]
C[Payment Service]
D[Fraud Service]
E[(Database)]
F[Kafka]
G[Notification]
A --> B
B --> C
C --> D
D --> E
D --> F
F --> G
Customer receives an immediate response.
SMS is processed asynchronously.
Real-World Example — Netflix
Netflix minimizes latency by:
- CDN (Open Connect)
- Distributed caching
- Regional deployments
- Adaptive streaming
- Load balancing
Millions of videos stream simultaneously with low buffering.
Real-World Example — Amazon
Amazon improves throughput by:
- Horizontal scaling
- Auto Scaling Groups
- Read replicas
- Redis caching
- Event-driven architecture
Real-World Example — Uber
Ride request flow:
Ride Request
↓
Driver Matching
↓
Payment
↓
Notification
Driver matching must happen in milliseconds.
Notifications happen asynchronously.
Performance Monitoring
Monitor
- Average Latency
- P95 Latency
- P99 Latency
- Throughput
- Requests/sec
- CPU
- Memory
- Queue Length
- Database Response Time
P50, P95 and P99 Latency
| Metric | Meaning |
|---|---|
| P50 | Median response time |
| P95 | 95% of requests complete within this time |
| P99 | 99% of requests complete within this time |
Example
P50
120 ms
P95
240 ms
P99
650 ms
Architects pay close attention to P95 and P99, not just the average latency.
Common Performance Optimization Techniques
| Technique | Benefit |
|---|---|
| Redis Cache | Lower latency |
| Load Balancer | Higher throughput |
| Auto Scaling | Better scalability |
| Database Indexing | Faster queries |
| CDN | Faster static content |
| Kafka | Async processing |
| Connection Pooling | Reduced DB overhead |
| Compression | Faster network transfer |
Common Mistakes
❌ Calling the database multiple times
❌ Missing indexes
❌ Loading unnecessary data
❌ Blocking API calls
❌ No caching
❌ Long database transactions
❌ Large payloads
❌ Synchronous notifications
Best Practices
- Cache frequently accessed data.
- Keep APIs lightweight.
- Optimize SQL queries.
- Add proper indexes.
- Use asynchronous processing.
- Scale horizontally.
- Use CDNs for static content.
- Monitor P95 and P99 latency.
- Perform load testing before production.
- Continuously identify bottlenecks.
Common Interview Questions
What is Latency?
Latency is the time taken for a single request to travel through the system and return a response.
What is Throughput?
Throughput is the number of requests or transactions a system can process within a given time period.
Can a system have low latency but poor throughput?
Yes. A system may respond quickly to a few users but fail to handle large numbers of concurrent requests.
How does caching improve latency?
Caching stores frequently accessed data in memory, reducing expensive database lookups and improving response time.
Why do architects monitor P95 and P99 latency?
Average latency can hide slow requests. P95 and P99 reveal how the system performs under heavy load and help identify tail-latency issues that affect user experience.
Summary
In this article, we explored two of the most important performance metrics in System Design:
- Latency
- Throughput
We covered:
- Latency fundamentals
- Throughput fundamentals
- Response time
- Network and database latency
- Bottlenecks
- Queueing
- Concurrency
- Caching
- Asynchronous processing
- CDN
- Real-world examples
- Performance monitoring
- P95 and P99 latency
- Best practices
Modern distributed systems achieve excellent performance by combining efficient algorithms, caching, horizontal scaling, asynchronous messaging, optimized databases, and continuous performance monitoring. Understanding the trade-off between latency and throughput is essential for designing scalable, high-performance enterprise applications.
Comments
Share a question, correction, or practical insight about this article.
Checking login status...
Loading approved comments...