Database Indexing Fundamentals in System Design
Learn Database Indexing from a System Design perspective. This guide explains how database indexes work, B-Tree and Hash indexes, clustered and non-clustered indexes, composite indexes, covering indexes, query optimization, execution plans, and real-world examples from Amazon, Banking, Netflix, and Uber.
Introduction
Imagine you have a Customers table with 100 million records.
A customer logs into your banking application.
The application executes:
SELECT * FROM customers
WHERE customer_id = 100001;
How does the database find one row among 100 million rows?
Without an index,
the database may scan every row.
100 Million Rows
↓
Sequential Scan
↓
Slow Response
With an index,
the database jumps directly to the required row.
Index
↓
Locate Row
↓
Fast Response
Indexes are one of the most powerful performance optimization techniques in relational databases.
Learning Objectives
After completing this article, you will understand:
- What is an Index?
- Why Indexes are Important
- Sequential Scan vs Index Scan
- B-Tree Index
- Hash Index
- Clustered Index
- Non-Clustered Index
- Composite Index
- Covering Index
- Query Execution Plan
- Best Practices
What is a Database Index?
A Database Index is a data structure that improves the speed of retrieving rows.
Think of it like the index in a book.
Instead of reading every page,
you jump directly to the required topic.
Book Example
Without Index
flowchart TD
A["Open Book"]
B["Page 1"]
C["Page 2"]
D["..."]
E["Page 500"]
A --> B
B --> C
C --> D
D --> E
With Index
flowchart TD
A["Index"]
B["Java"]
C["Page 350"]
A --> B
B --> C
Exactly how database indexes work.
Without Index
flowchart TD
APP["Application"]
DB["Database"]
SCAN["Sequential Scan"]
ROWS["100 Million Rows"]
APP --> DB
DB --> SCAN
SCAN --> ROWS
Every row is checked.
With Index
flowchart TD
APP["Application"]
IDX["Index"]
ROW["Target Row"]
DB["Database"]
APP --> IDX
IDX --> ROW
ROW --> DB
Only a few pages are accessed.
Sequential Scan
Example
SELECT *
FROM customers
WHERE email='[email protected]';
Without index
Row 1
Row 2
Row 3
...
Row 100 Million
Time complexity
O(n)
Index Scan
With an index
Index
↓
Target Row
Time complexity
O(log n)
Much faster.
B-Tree Index
The default index type in most relational databases.
flowchart TD
ROOT["50"]
LEFT["25"]
RIGHT["75"]
ROOT --> LEFT
ROOT --> RIGHT
LEFT --> L1["10"]
LEFT --> L2["30"]
RIGHT --> R1["60"]
RIGHT --> R2["90"]
Advantages
- Fast search
- Fast sorting
- Supports range queries
Used by
- PostgreSQL
- MySQL
- Oracle
- SQL Server
B-Tree Search
Searching
75
Database traverses
Root
↓
75
↓
Found
Instead of scanning every row.
Hash Index
Hash indexes use a hash function.
flowchart LR
KEY["Customer ID = 1001"]
HASH["Hash Function"]
BUCKET["Bucket #42"]
ROW["Target Row"]
KEY --> HASH
HASH --> BUCKET
BUCKET --> ROW
Excellent for
WHERE customer_id=1001
Not suitable for
ORDER BY
BETWEEN
B-Tree vs Hash
| Feature | B-Tree | Hash |
|---|---|---|
| Equality Search | Excellent | Excellent |
| Range Queries | Yes | No |
| ORDER BY | Yes | No |
| Default Index | Yes | No |
Clustered Index
Rows are stored physically in index order.
flowchart TD
IDX["Clustered Index"]
R1["Row 1"]
R2["Row 2"]
R3["Row 3"]
R4["Row 4"]
IDX --> R1
R1 --> R2
R2 --> R3
R3 --> R4
Only one clustered index per table.
Example
Primary Key.
Non-Clustered Index
Data remains separate.
flowchart LR
IDX["Non-Clustered Index"]
PTR["Row Pointer"]
TABLE["Actual Table Data"]
IDX --> PTR
PTR --> TABLE
Multiple non-clustered indexes are allowed.
Clustered vs Non-Clustered
| Clustered | Non-Clustered |
|---|---|
| Data sorted | Pointer to data |
| One per table | Multiple allowed |
| Fast range scans | Good for lookups |
Composite Index
Index on multiple columns.
Example
CREATE INDEX idx_customer
ON customers(last_name, city);
Works well for
WHERE last_name='Smith'
AND city='Dallas';
Composite Index Flow
flowchart LR
A["Last Name"]
B["City"]
IDX["Composite Index"]
ROW["Target Row"]
A --> IDX
B --> IDX
IDX --> ROW
Covering Index
Index contains all required columns.
Query
SELECT name,email
FROM customers
WHERE customer_id=1001;
If the index already stores
- customer_id
- name
Database avoids reading the table.
Very fast.
Unique Index
Guarantees uniqueness.
CREATE UNIQUE INDEX
idx_email
ON customers(email);
Prevents duplicate emails.
Query Execution Plan
Every database decides
how to execute queries.
flowchart LR
SQL["SQL Query"]
OPT["Optimizer"]
PLAN["Execution Plan"]
EXEC["Execute"]
RESULT["Result"]
SQL --> OPT
OPT --> PLAN
PLAN --> EXEC
EXEC --> RESULT
EXPLAIN Example
EXPLAIN
SELECT *
FROM customers
WHERE customer_id=1001;
Output
Index Scan
Using idx_customer
Without index
Sequential Scan
Banking Example
Query
SELECT *
FROM accounts
WHERE account_number='123456789';
Account Number should always be indexed.
Otherwise,
customer login becomes slow.
Amazon Example
Search
Product ID
SKU
Category
Indexes exist on
- Product ID
- Seller ID
- Category
for fast searches.
Netflix Example
Indexes on
- Movie ID
- Genre
- Release Year
speed up catalog searches.
Uber Example
Indexes on
- Driver ID
- Trip ID
- Rider ID
allow fast ride lookups.
Spring Boot Architecture
flowchart LR
REACT["React"]
SPRING["Spring Boot"]
JPA["Spring Data JPA"]
TABLE["PostgreSQL Table"]
INDEX["B-Tree Index"]
REACT --> SPRING
SPRING --> JPA
JPA --> TABLE
INDEX -. "Speeds up queries" .-> TABLE
JPA queries automatically benefit from properly designed database indexes.
Good Index Candidates
Create indexes on
- Primary Keys
- Foreign Keys
- Frequently searched columns
- JOIN columns
- ORDER BY columns
- WHERE clause columns
Avoid Indexing
Avoid indexes on
- Small tables
- Frequently updated columns
- Boolean columns
- Columns with very low selectivity
- Temporary tables
Advantages
- Faster SELECT queries
- Faster JOIN operations
- Better sorting
- Better filtering
- Lower CPU usage
- Improved scalability
Disadvantages
- Increased storage
- Slower INSERT
- Slower UPDATE
- Slower DELETE
- Additional maintenance
Every index must be updated whenever indexed data changes.
Monitoring
Monitor
- Slow Queries
- Sequential Scans
- Index Usage
- Query Latency
- Lock Wait Time
- Deadlocks
- CPU Usage
- Buffer Cache Hit Ratio
Tools
- PostgreSQL EXPLAIN ANALYZE
- pg_stat_statements
- MySQL EXPLAIN
- Oracle AWR Reports
- Datadog
- Grafana
Common Mistakes
❌ Indexing every column
❌ Ignoring slow query logs
❌ Duplicate indexes
❌ Missing indexes on JOIN columns
❌ Not reviewing execution plans
❌ Keeping unused indexes
Best Practices
- Index frequently queried columns.
- Use B-Tree indexes for general-purpose queries.
- Use composite indexes carefully based on query patterns.
- Remove unused indexes periodically.
- Analyze execution plans before creating indexes.
- Monitor slow queries continuously.
- Keep indexes as small as possible.
- Rebuild fragmented indexes when necessary.
Common Interview Questions
What is a Database Index?
A database index is a data structure that allows the database to locate rows quickly without scanning the entire table.
Why are B-Tree indexes the default?
B-Tree indexes support equality searches, range queries, sorting, and ordered traversal, making them suitable for a wide variety of workloads.
What is the difference between a Clustered and a Non-Clustered Index?
A clustered index determines the physical order of rows in a table, while a non-clustered index stores pointers to the actual rows. A table typically has one clustered index but can have many non-clustered indexes.
What is a Composite Index?
A composite index contains multiple columns and is most effective when queries filter or sort using those columns in the same order as the index definition.
Why shouldn't every column be indexed?
Each additional index consumes storage and increases the cost of INSERT, UPDATE, and DELETE operations because the database must maintain every index when data changes.
Summary
Database indexes are fundamental to high-performance applications. They reduce query execution time, improve scalability, and enable efficient searching, sorting, and joining of large datasets.
In this article, we covered:
- Index fundamentals
- Sequential Scan vs Index Scan
- B-Tree
- Hash Index
- Clustered & Non-Clustered Indexes
- Composite Index
- Covering Index
- Query Execution Plans
- Banking, Amazon, Netflix, and Uber examples
- Monitoring
- Best practices
Proper indexing is one of the highest-impact optimizations in database design. Combined with efficient SQL queries, caching, and good schema design, indexes enable applications to serve millions of requests with low latency while minimizing database resource consumption.
Comments
Share a question, correction, or practical insight about this article.
Checking login status...
Loading approved comments...