Full Stack • Java • System Design • Cloud • AI Engineering

Database Indexing Fundamentals in System Design

Learn Database Indexing from a System Design perspective. This guide explains how database indexes work, B-Tree and Hash indexes, clustered and non-clustered indexes, composite indexes, covering indexes, query optimization, execution plans, and real-world examples from Amazon, Banking, Netflix, and Uber.


Introduction

Imagine you have a Customers table with 100 million records.

A customer logs into your banking application.

The application executes:

SELECT * FROM customers
WHERE customer_id = 100001;

How does the database find one row among 100 million rows?

Without an index,

the database may scan every row.

100 Million Rows

↓

Sequential Scan

↓

Slow Response

With an index,

the database jumps directly to the required row.

Index

↓

Locate Row

↓

Fast Response

Indexes are one of the most powerful performance optimization techniques in relational databases.


Learning Objectives

After completing this article, you will understand:

  • What is an Index?
  • Why Indexes are Important
  • Sequential Scan vs Index Scan
  • B-Tree Index
  • Hash Index
  • Clustered Index
  • Non-Clustered Index
  • Composite Index
  • Covering Index
  • Query Execution Plan
  • Best Practices

What is a Database Index?

A Database Index is a data structure that improves the speed of retrieving rows.

Think of it like the index in a book.

Instead of reading every page,

you jump directly to the required topic.


Book Example

Without Index

flowchart TD
    A["Open Book"]
    B["Page 1"]
    C["Page 2"]
    D["..."]
    E["Page 500"]

    A --> B
    B --> C
    C --> D
    D --> E

With Index

flowchart TD
    A["Index"]
    B["Java"]
    C["Page 350"]

    A --> B
    B --> C

Exactly how database indexes work.


Without Index

flowchart TD
    APP["Application"]
    DB["Database"]
    SCAN["Sequential Scan"]
    ROWS["100 Million Rows"]

    APP --> DB
    DB --> SCAN
    SCAN --> ROWS

Every row is checked.


With Index

flowchart TD
    APP["Application"]
    IDX["Index"]
    ROW["Target Row"]
    DB["Database"]

    APP --> IDX
    IDX --> ROW
    ROW --> DB

Only a few pages are accessed.


Sequential Scan

Example

SELECT *
FROM customers
WHERE email='[email protected]';

Without index

Row 1

Row 2

Row 3

...

Row 100 Million

Time complexity

O(n)

Index Scan

With an index

Index

↓

Target Row

Time complexity

O(log n)

Much faster.


B-Tree Index

The default index type in most relational databases.

flowchart TD
    ROOT["50"]

    LEFT["25"]
    RIGHT["75"]

    ROOT --> LEFT
    ROOT --> RIGHT

    LEFT --> L1["10"]
    LEFT --> L2["30"]

    RIGHT --> R1["60"]
    RIGHT --> R2["90"]

Advantages

  • Fast search
  • Fast sorting
  • Supports range queries

Used by

  • PostgreSQL
  • MySQL
  • Oracle
  • SQL Server

B-Tree Search

Searching

75

Database traverses

Root

↓

75

↓

Found

Instead of scanning every row.


Hash Index

Hash indexes use a hash function.

flowchart LR
    KEY["Customer ID = 1001"]
    HASH["Hash Function"]
    BUCKET["Bucket #42"]
    ROW["Target Row"]

    KEY --> HASH
    HASH --> BUCKET
    BUCKET --> ROW

Excellent for

WHERE customer_id=1001

Not suitable for

ORDER BY

BETWEEN

B-Tree vs Hash

Feature B-Tree Hash
Equality Search Excellent Excellent
Range Queries Yes No
ORDER BY Yes No
Default Index Yes No

Clustered Index

Rows are stored physically in index order.

flowchart TD
    IDX["Clustered Index"]
    R1["Row 1"]
    R2["Row 2"]
    R3["Row 3"]
    R4["Row 4"]

    IDX --> R1
    R1 --> R2
    R2 --> R3
    R3 --> R4

Only one clustered index per table.

Example

Primary Key.


Non-Clustered Index

Data remains separate.

flowchart LR
    IDX["Non-Clustered Index"]
    PTR["Row Pointer"]
    TABLE["Actual Table Data"]

    IDX --> PTR
    PTR --> TABLE

Multiple non-clustered indexes are allowed.


Clustered vs Non-Clustered

Clustered Non-Clustered
Data sorted Pointer to data
One per table Multiple allowed
Fast range scans Good for lookups

Composite Index

Index on multiple columns.

Example

CREATE INDEX idx_customer
ON customers(last_name, city);

Works well for

WHERE last_name='Smith'
AND city='Dallas';

Composite Index Flow

flowchart LR
    A["Last Name"]
    B["City"]
    IDX["Composite Index"]
    ROW["Target Row"]

    A --> IDX
    B --> IDX
    IDX --> ROW

Covering Index

Index contains all required columns.

Query

SELECT name,email
FROM customers
WHERE customer_id=1001;

If the index already stores

  • customer_id
  • name
  • email

Database avoids reading the table.

Very fast.


Unique Index

Guarantees uniqueness.

CREATE UNIQUE INDEX
idx_email
ON customers(email);

Prevents duplicate emails.


Query Execution Plan

Every database decides

how to execute queries.

flowchart LR
    SQL["SQL Query"]
    OPT["Optimizer"]
    PLAN["Execution Plan"]
    EXEC["Execute"]
    RESULT["Result"]

    SQL --> OPT
    OPT --> PLAN
    PLAN --> EXEC
    EXEC --> RESULT

EXPLAIN Example

EXPLAIN
SELECT *
FROM customers
WHERE customer_id=1001;

Output

Index Scan

Using idx_customer

Without index

Sequential Scan

Banking Example

Query

SELECT *
FROM accounts
WHERE account_number='123456789';

Account Number should always be indexed.

Otherwise,

customer login becomes slow.


Amazon Example

Search

Product ID

SKU

Category

Indexes exist on

  • Product ID
  • Seller ID
  • Category

for fast searches.


Netflix Example

Indexes on

  • Movie ID
  • Genre
  • Release Year

speed up catalog searches.


Uber Example

Indexes on

  • Driver ID
  • Trip ID
  • Rider ID

allow fast ride lookups.


Spring Boot Architecture

flowchart LR
    REACT["React"]
    SPRING["Spring Boot"]
    JPA["Spring Data JPA"]
    TABLE["PostgreSQL Table"]
    INDEX["B-Tree Index"]

    REACT --> SPRING
    SPRING --> JPA
    JPA --> TABLE
    INDEX -. "Speeds up queries" .-> TABLE

JPA queries automatically benefit from properly designed database indexes.


Good Index Candidates

Create indexes on

  • Primary Keys
  • Foreign Keys
  • Frequently searched columns
  • JOIN columns
  • ORDER BY columns
  • WHERE clause columns

Avoid Indexing

Avoid indexes on

  • Small tables
  • Frequently updated columns
  • Boolean columns
  • Columns with very low selectivity
  • Temporary tables

Advantages

  • Faster SELECT queries
  • Faster JOIN operations
  • Better sorting
  • Better filtering
  • Lower CPU usage
  • Improved scalability

Disadvantages

  • Increased storage
  • Slower INSERT
  • Slower UPDATE
  • Slower DELETE
  • Additional maintenance

Every index must be updated whenever indexed data changes.


Monitoring

Monitor

  • Slow Queries
  • Sequential Scans
  • Index Usage
  • Query Latency
  • Lock Wait Time
  • Deadlocks
  • CPU Usage
  • Buffer Cache Hit Ratio

Tools

  • PostgreSQL EXPLAIN ANALYZE
  • pg_stat_statements
  • MySQL EXPLAIN
  • Oracle AWR Reports
  • Datadog
  • Grafana

Common Mistakes

❌ Indexing every column

❌ Ignoring slow query logs

❌ Duplicate indexes

❌ Missing indexes on JOIN columns

❌ Not reviewing execution plans

❌ Keeping unused indexes


Best Practices

  • Index frequently queried columns.
  • Use B-Tree indexes for general-purpose queries.
  • Use composite indexes carefully based on query patterns.
  • Remove unused indexes periodically.
  • Analyze execution plans before creating indexes.
  • Monitor slow queries continuously.
  • Keep indexes as small as possible.
  • Rebuild fragmented indexes when necessary.

Common Interview Questions

What is a Database Index?

A database index is a data structure that allows the database to locate rows quickly without scanning the entire table.


Why are B-Tree indexes the default?

B-Tree indexes support equality searches, range queries, sorting, and ordered traversal, making them suitable for a wide variety of workloads.


What is the difference between a Clustered and a Non-Clustered Index?

A clustered index determines the physical order of rows in a table, while a non-clustered index stores pointers to the actual rows. A table typically has one clustered index but can have many non-clustered indexes.


What is a Composite Index?

A composite index contains multiple columns and is most effective when queries filter or sort using those columns in the same order as the index definition.


Why shouldn't every column be indexed?

Each additional index consumes storage and increases the cost of INSERT, UPDATE, and DELETE operations because the database must maintain every index when data changes.


Summary

Database indexes are fundamental to high-performance applications. They reduce query execution time, improve scalability, and enable efficient searching, sorting, and joining of large datasets.

In this article, we covered:

  • Index fundamentals
  • Sequential Scan vs Index Scan
  • B-Tree
  • Hash Index
  • Clustered & Non-Clustered Indexes
  • Composite Index
  • Covering Index
  • Query Execution Plans
  • Banking, Amazon, Netflix, and Uber examples
  • Monitoring
  • Best practices

Proper indexing is one of the highest-impact optimizations in database design. Combined with efficient SQL queries, caching, and good schema design, indexes enable applications to serve millions of requests with low latency while minimizing database resource consumption.


Loading likes...

Comments

Share a question, correction, or practical insight about this article.

Loading approved comments...