Decision Trees and Random Forest Explained
Learn Decision Trees and Random Forest algorithms with real-world examples, entropy, information gain, Gini index, overfitting, ensemble learning, Python examples, advantages, limitations, and interview questions.
What You Will Learn
In this article, you'll learn:
- What is a Decision Tree?
- How Decision Trees work
- Entropy and Information Gain
- Gini Index
- Decision Tree Training Process
- Overfitting Problems
- What is Random Forest?
- Ensemble Learning
- Real-world Examples
- Python Examples
- Advantages and Limitations
- Interview Questions
Introduction
Imagine a bank wants to approve or reject a loan application.
Instead of manually reviewing every application, the bank creates rules:
Is Salary > $80,000?
Yes
↓
Credit Score > 700?
Yes → Approve Loan
No → Reject Loan
No
Reject Loan
This decision-making process looks like a tree.
Machine Learning uses the same concept through:
Decision Trees
One of the most interpretable machine learning algorithms.
What is a Decision Tree?
A Decision Tree is a supervised machine learning algorithm that makes decisions using a tree-like structure.
It splits data into smaller groups based on conditions.
The goal is:
Reduce Uncertainty
Increase Prediction Accuracy
Real World Analogy
Think about deciding whether to play cricket.
Is it raining?
Yes → Stay Home
No
↓
Is Ground Dry?
Yes → Play Cricket
No → Stay Home
Humans naturally make decisions using trees.
Decision Trees automate this process.
Decision Tree Structure
flowchart TD
A[Root Node]
A --> B[Condition 1]
B --> C[Condition 2]
B --> D[Condition 3]
C --> E[Prediction]
D --> F[Prediction]
Components of a Decision Tree
Root Node
The first decision.
Example:
Salary > 80000?
Internal Node
Intermediate decision points.
Example:
Credit Score > 700?
Branch
Possible outcomes.
Example:
Yes
No
Leaf Node
Final prediction.
Example:
Approve Loan
Reject Loan
Loan Approval Example
flowchart TD
A[Salary > 80000]
A -->|Yes| B[Credit Score > 700]
A -->|No| C[Reject Loan]
B -->|Yes| D[Approve Loan]
B -->|No| E[Reject Loan]
Sample Dataset
| Salary | Credit Score | Loan Approved |
|---|---|---|
| 100000 | 750 | Yes |
| 90000 | 720 | Yes |
| 50000 | 650 | No |
| 45000 | 600 | No |
| 120000 | 800 | Yes |
How Decision Trees Learn
The algorithm repeatedly asks:
Which feature best separates the data?
Examples:
Salary
Age
Credit Score
Experience
The best feature becomes the next split.
Entropy
Entropy measures disorder or uncertainty.
High Entropy:
Mixed Data
Low Entropy:
Pure Data
Entropy Formula
Entropy Example
Dataset:
10 Approvals
10 Rejections
Entropy is high.
Dataset:
20 Approvals
0 Rejections
Entropy is zero.
Perfect separation.
Information Gain
Decision Trees choose splits that maximize information gain.
Information Gain
=
Current Entropy
-
New Entropy
Formula
Goal
Choose the feature with:
Highest Information Gain
because it reduces uncertainty the most.
Gini Index
Another popular metric used by CART Trees.
Measures impurity.
Lower Gini:
Better Split
Higher Gini:
Poor Split
Formula
Decision Tree Training Process
flowchart TD
A[Training Dataset]
A --> B[Calculate Entropy/Gini]
B --> C[Choose Best Feature]
C --> D[Split Data]
D --> E[Repeat Recursively]
E --> F[Leaf Node]
Classification Example
Predict:
Spam Email
or
Not Spam
Features:
Contains Offer?
Contains Free?
Contains Prize?
Decision Tree learns the rules automatically.
Regression Trees
Decision Trees can also predict numbers.
Examples:
House Price
Stock Price
Insurance Premium
Instead of categories.
Overfitting Problem
Decision Trees can memorize training data.
Example:
100% Training Accuracy
60% Testing Accuracy
This is called:
Overfitting
Overfitted Tree
flowchart TD
A[Root]
A --> B
B --> C
C --> D
D --> E
E --> F
F --> G
G --> H
Too many branches.
Poor generalization.
Preventing Overfitting
Techniques:
- Maximum Depth
- Minimum Samples Split
- Minimum Samples Leaf
- Pruning
What is Random Forest?
Random Forest is an ensemble learning algorithm.
Instead of one tree:
Decision Tree
It creates:
Hundreds Of Trees
and combines predictions.
Why Random Forest?
One tree may be wrong.
Many trees together are more reliable.
Random Forest Architecture
flowchart LR
A[Training Data]
A --> B[Tree 1]
A --> C[Tree 2]
A --> D[Tree 3]
A --> E[Tree 100]
B --> F[Voting]
C --> F
D --> F
E --> F
F --> G[Final Prediction]
Example
Predict Loan Approval.
Tree 1:
Approve
Tree 2:
Approve
Tree 3:
Reject
Majority Vote:
Approve
Final Prediction:
Approve
Ensemble Learning
Random Forest belongs to:
Ensemble Learning
Meaning:
Multiple Models
↓
One Strong Model
Bootstrap Sampling
Each tree trains on a random subset of data.
Example:
Dataset
1000 Rows
Tree 1:
Random 1000 Samples
Tree 2:
Different Random Samples
This increases diversity.
Feature Randomization
Each split considers only a random subset of features.
Example:
Salary
Age
Experience
Credit Score
Tree chooses random feature combinations.
This reduces correlation between trees.
Decision Tree vs Random Forest
| Feature | Decision Tree | Random Forest |
|---|---|---|
| Accuracy | Medium | High |
| Speed | Fast | Slower |
| Overfitting | High Risk | Low Risk |
| Interpretability | Excellent | Moderate |
| Training Time | Fast | Slower |
Banking Example
Loan Approval:
Features:
Salary
Credit Score
Debt
Employment
Banks often use Random Forest for better accuracy.
Insurance Example
Predict:
Claim Fraud
Yes/No
Random Forest handles large feature sets effectively.
Healthcare Example
Predict:
Disease Risk
Heart Disease
Diabetes
Cancer
Based on patient data.
E-Commerce Example
Predict:
Customer Purchase Probability
Using:
Age
Location
Past Purchases
Browsing History
Python Example
Decision Tree
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
Random Forest
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(
n_estimators=100
)
model.fit(X_train, y_train)
prediction = model.predict(X_test)
Advantages of Decision Trees
✅ Easy To Understand
✅ Easy To Visualize
✅ Handles Nonlinear Data
✅ No Feature Scaling Needed
✅ Fast Training
Limitations of Decision Trees
❌ Overfitting
❌ Unstable
❌ Sensitive To Data Changes
❌ Lower Accuracy
Advantages of Random Forest
✅ High Accuracy
✅ Reduces Overfitting
✅ Handles Large Datasets
✅ Robust
✅ Works Well For Classification & Regression
Limitations of Random Forest
❌ Slower Training
❌ More Memory Usage
❌ Harder To Interpret
❌ Larger Models
Interview Questions
What is a Decision Tree?
A supervised learning algorithm that uses a tree structure to make predictions.
What is Entropy?
A measure of uncertainty in data.
What is Information Gain?
Reduction in entropy after splitting data.
What is Gini Index?
A measure of impurity used to select the best split.
What is Overfitting?
When a model memorizes training data and performs poorly on unseen data.
What is Random Forest?
An ensemble of multiple decision trees whose predictions are combined.
Why is Random Forest better?
It reduces overfitting and improves accuracy through ensemble learning.
Key Takeaways
- Decision Trees are intuitive and interpretable machine learning models.
- They split data using Entropy, Information Gain, or Gini Index.
- Trees can easily overfit.
- Random Forest solves this by combining multiple trees.
- Random Forest is one of the most widely used machine learning algorithms.
- Common applications include banking, insurance, healthcare, fraud detection, and recommendation systems.