Full Stack • Java • System Design • Cloud • AI Engineering

Decision Trees and Random Forest Explained

Learn Decision Trees and Random Forest algorithms with real-world examples, entropy, information gain, Gini index, overfitting, ensemble learning, Python examples, advantages, limitations, and interview questions.

What You Will Learn

In this article, you'll learn:

  • What is a Decision Tree?
  • How Decision Trees work
  • Entropy and Information Gain
  • Gini Index
  • Decision Tree Training Process
  • Overfitting Problems
  • What is Random Forest?
  • Ensemble Learning
  • Real-world Examples
  • Python Examples
  • Advantages and Limitations
  • Interview Questions

Introduction

Imagine a bank wants to approve or reject a loan application.

Instead of manually reviewing every application, the bank creates rules:

Is Salary > $80,000?

    Yes
       ↓

Credit Score > 700?

    Yes → Approve Loan

    No  → Reject Loan

No

    Reject Loan

This decision-making process looks like a tree.

Machine Learning uses the same concept through:

Decision Trees

One of the most interpretable machine learning algorithms.


What is a Decision Tree?

A Decision Tree is a supervised machine learning algorithm that makes decisions using a tree-like structure.

It splits data into smaller groups based on conditions.

The goal is:

Reduce Uncertainty

Increase Prediction Accuracy

Real World Analogy

Think about deciding whether to play cricket.

Is it raining?

    Yes → Stay Home

    No

       ↓

Is Ground Dry?

    Yes → Play Cricket

    No → Stay Home

Humans naturally make decisions using trees.

Decision Trees automate this process.


Decision Tree Structure

flowchart TD

A[Root Node]

A --> B[Condition 1]

B --> C[Condition 2]

B --> D[Condition 3]

C --> E[Prediction]

D --> F[Prediction]

Components of a Decision Tree

Root Node

The first decision.

Example:

Salary > 80000?

Internal Node

Intermediate decision points.

Example:

Credit Score > 700?

Branch

Possible outcomes.

Example:

Yes

No

Leaf Node

Final prediction.

Example:

Approve Loan

Reject Loan

Loan Approval Example

flowchart TD

A[Salary > 80000]

A -->|Yes| B[Credit Score > 700]

A -->|No| C[Reject Loan]

B -->|Yes| D[Approve Loan]

B -->|No| E[Reject Loan]

Sample Dataset

Salary Credit Score Loan Approved
100000 750 Yes
90000 720 Yes
50000 650 No
45000 600 No
120000 800 Yes

How Decision Trees Learn

The algorithm repeatedly asks:

Which feature best separates the data?

Examples:

Salary

Age

Credit Score

Experience

The best feature becomes the next split.


Entropy

Entropy measures disorder or uncertainty.

High Entropy:

Mixed Data

Low Entropy:

Pure Data

Entropy Formula


Entropy Example

Dataset:

10 Approvals

10 Rejections

Entropy is high.

Dataset:

20 Approvals

0 Rejections

Entropy is zero.

Perfect separation.


Information Gain

Decision Trees choose splits that maximize information gain.

Information Gain

=

Current Entropy

-

New Entropy

Formula


Goal

Choose the feature with:

Highest Information Gain

because it reduces uncertainty the most.


Gini Index

Another popular metric used by CART Trees.

Measures impurity.

Lower Gini:

Better Split

Higher Gini:

Poor Split

Formula


Decision Tree Training Process

flowchart TD

A[Training Dataset]

A --> B[Calculate Entropy/Gini]

B --> C[Choose Best Feature]

C --> D[Split Data]

D --> E[Repeat Recursively]

E --> F[Leaf Node]

Classification Example

Predict:

Spam Email

or

Not Spam

Features:

Contains Offer?

Contains Free?

Contains Prize?

Decision Tree learns the rules automatically.


Regression Trees

Decision Trees can also predict numbers.

Examples:

House Price

Stock Price

Insurance Premium

Instead of categories.


Overfitting Problem

Decision Trees can memorize training data.

Example:

100% Training Accuracy

60% Testing Accuracy

This is called:

Overfitting

Overfitted Tree

flowchart TD

A[Root]

A --> B

B --> C

C --> D

D --> E

E --> F

F --> G

G --> H

Too many branches.

Poor generalization.


Preventing Overfitting

Techniques:

  • Maximum Depth
  • Minimum Samples Split
  • Minimum Samples Leaf
  • Pruning

What is Random Forest?

Random Forest is an ensemble learning algorithm.

Instead of one tree:

Decision Tree

It creates:

Hundreds Of Trees

and combines predictions.


Why Random Forest?

One tree may be wrong.

Many trees together are more reliable.


Random Forest Architecture

flowchart LR

A[Training Data]

A --> B[Tree 1]

A --> C[Tree 2]

A --> D[Tree 3]

A --> E[Tree 100]

B --> F[Voting]

C --> F

D --> F

E --> F

F --> G[Final Prediction]

Example

Predict Loan Approval.

Tree 1:

Approve

Tree 2:

Approve

Tree 3:

Reject

Majority Vote:

Approve

Final Prediction:

Approve

Ensemble Learning

Random Forest belongs to:

Ensemble Learning

Meaning:

Multiple Models

↓

One Strong Model

Bootstrap Sampling

Each tree trains on a random subset of data.

Example:

Dataset

1000 Rows

Tree 1:

Random 1000 Samples

Tree 2:

Different Random Samples

This increases diversity.


Feature Randomization

Each split considers only a random subset of features.

Example:

Salary

Age

Experience

Credit Score

Tree chooses random feature combinations.

This reduces correlation between trees.


Decision Tree vs Random Forest

Feature Decision Tree Random Forest
Accuracy Medium High
Speed Fast Slower
Overfitting High Risk Low Risk
Interpretability Excellent Moderate
Training Time Fast Slower

Banking Example

Loan Approval:

Features:

Salary

Credit Score

Debt

Employment

Banks often use Random Forest for better accuracy.


Insurance Example

Predict:

Claim Fraud

Yes/No

Random Forest handles large feature sets effectively.


Healthcare Example

Predict:

Disease Risk

Heart Disease

Diabetes

Cancer

Based on patient data.


E-Commerce Example

Predict:

Customer Purchase Probability

Using:

Age

Location

Past Purchases

Browsing History

Python Example

Decision Tree

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

prediction = model.predict(X_test)

Random Forest

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_estimators=100
)

model.fit(X_train, y_train)

prediction = model.predict(X_test)

Advantages of Decision Trees

✅ Easy To Understand

✅ Easy To Visualize

✅ Handles Nonlinear Data

✅ No Feature Scaling Needed

✅ Fast Training


Limitations of Decision Trees

❌ Overfitting

❌ Unstable

❌ Sensitive To Data Changes

❌ Lower Accuracy


Advantages of Random Forest

✅ High Accuracy

✅ Reduces Overfitting

✅ Handles Large Datasets

✅ Robust

✅ Works Well For Classification & Regression


Limitations of Random Forest

❌ Slower Training

❌ More Memory Usage

❌ Harder To Interpret

❌ Larger Models


Interview Questions

What is a Decision Tree?

A supervised learning algorithm that uses a tree structure to make predictions.


What is Entropy?

A measure of uncertainty in data.


What is Information Gain?

Reduction in entropy after splitting data.


What is Gini Index?

A measure of impurity used to select the best split.


What is Overfitting?

When a model memorizes training data and performs poorly on unseen data.


What is Random Forest?

An ensemble of multiple decision trees whose predictions are combined.


Why is Random Forest better?

It reduces overfitting and improves accuracy through ensemble learning.


Key Takeaways

  • Decision Trees are intuitive and interpretable machine learning models.
  • They split data using Entropy, Information Gain, or Gini Index.
  • Trees can easily overfit.
  • Random Forest solves this by combining multiple trees.
  • Random Forest is one of the most widely used machine learning algorithms.
  • Common applications include banking, insurance, healthcare, fraud detection, and recommendation systems.