Activation Functions Explained
Learn Activation Functions in Neural Networks including Sigmoid, Tanh, ReLU, Leaky ReLU, Softmax, their advantages, limitations, mathematical intuition, real-world examples, and interview questions.
What You Will Learn
In this article, you'll learn:
- What are Activation Functions?
- Why Activation Functions Matter
- Linear vs Non-Linear Models
- Sigmoid Function
- Tanh Function
- ReLU Function
- Leaky ReLU
- Softmax
- Real-World Applications
- Choosing the Right Activation Function
- Advantages and Limitations
- Interview Questions
Introduction
In the previous article, we learned:
Inputs
↓
Weights
↓
Bias
↓
Output
But there is a problem.
Without an activation function:
Neural Network
=
Linear Equation
Even if we add:
100 Hidden Layers
the model still behaves like:
One Simple Linear Model
This makes Neural Networks incapable of learning complex patterns.
Activation Functions solve this problem.
What is an Activation Function?
An Activation Function decides:
Should A Neuron Fire?
or
Should It Stay Silent?
It introduces:
Non-Linearity
into the network.
Why Activation Functions Matter
Suppose we want to predict:
House Price
Fraud Detection
Image Recognition
Language Translation
Relationships are complex.
Without non-linearity:
Neural Networks
Cannot Learn Complex Patterns
Neural Network Without Activation
flowchart LR
A[Inputs]
A --> B[Weighted Sum]
B --> C[Output]
Only linear relationships can be learned.
Neural Network With Activation
flowchart LR
A[Inputs]
A --> B[Weighted Sum]
B --> C[Activation Function]
C --> D[Output]
Now the network can learn complex patterns.
Biological Analogy
Human brain neurons:
Receive Signal
↓
Process Signal
↓
Fire Or Not Fire
Activation functions mimic this behavior.
Activation Function Workflow
flowchart TD
A[Inputs]
A --> B[Weights]
B --> C[Bias]
C --> D[Activation Function]
D --> E[Output]
Linear Activation Function
Simplest activation:
Output = Input
Problem With Linear Activation
Suppose:
Layer 1
↓
Layer 2
↓
Layer 3
Each layer remains linear.
Entire network becomes:
Single Linear Equation
No deep learning advantage.
Why Non-Linearity Matters
Real-world relationships are not linear.
Examples:
Customer Spending
Disease Risk
Image Pixels
Language Understanding
Neural networks need non-linear functions.
Types of Activation Functions
flowchart TD
A[Activation Functions]
A --> B[Sigmoid]
A --> C[Tanh]
A --> D[ReLU]
A --> E[Leaky ReLU]
A --> F[Softmax]
Sigmoid Function
One of the oldest activation functions.
Output range:
0 to 1
Formula
Sigmoid Curve
xychart-beta
title "Sigmoid Function"
x-axis [-5,-4,-3,-2,-1,0,1,2,3,4,5]
y-axis "Output" 0 --> 1
line [0.01,0.02,0.05,0.12,0.27,0.5,0.73,0.88,0.95,0.98,0.99]
Sigmoid Characteristics
Large Negative Input:
≈ 0
Large Positive Input:
≈ 1
Middle:
≈ 0.5
Sigmoid Use Cases
Commonly used for:
Binary Classification
Yes / No
Spam / Not Spam
Fraud / Not Fraud
Sigmoid Advantages
✅ Probability Output
✅ Easy Interpretation
✅ Smooth Curve
Sigmoid Limitations
❌ Vanishing Gradient Problem
❌ Slow Training
❌ Not Zero-Centered
Tanh Function
Improved version of Sigmoid.
Output range:
-1 to +1
Formula
Tanh Curve
xychart-beta
title "Tanh Function"
x-axis [-5,-4,-3,-2,-1,0,1,2,3,4,5]
y-axis "Output" -1 --> 1
line [-0.99,-0.99,-0.95,-0.76,-0.46,0,0.46,0.76,0.95,0.99,0.99]
Advantages
✅ Zero-Centered
✅ Better Than Sigmoid
✅ Stronger Gradients
Limitations
❌ Still Suffers Vanishing Gradient
❌ Not Ideal For Deep Networks
ReLU Function
Most popular activation function today.
ReLU stands for:
Rectified Linear Unit
Formula
ReLU Behavior
Negative Values
↓
0
Positive Values
↓
Keep Original Value
ReLU Example
| Input | Output |
|---|---|
| -5 | 0 |
| -2 | 0 |
| 0 | 0 |
| 3 | 3 |
| 8 | 8 |
ReLU Curve
xychart-beta
title "ReLU Function"
x-axis [-5,-4,-3,-2,-1,0,1,2,3,4,5]
y-axis "Output" 0 --> 5
line [0,0,0,0,0,0,1,2,3,4,5]
Why ReLU Became Popular
ReLU solves:
Vanishing Gradient Problem
and trains very fast.
ReLU Advantages
✅ Fast Computation
✅ Simple
✅ Efficient
✅ Works Well In Deep Networks
✅ Industry Standard
ReLU Limitations
Problem:
Dying ReLU
Neuron becomes permanently inactive.
Dying ReLU Example
Negative inputs:
-10
-20
-30
Output always:
0
Neuron stops learning.
Leaky ReLU
Created to solve Dying ReLU.
Instead of:
Negative Input → 0
Use:
Negative Input → Small Value
Formula
Advantages
✅ Prevents Dead Neurons
✅ Better Gradient Flow
✅ Improved Training
Leaky ReLU Visualization
flowchart LR
A[Negative Input]
A --> B[Small Negative Output]
C[Positive Input]
C --> D[Positive Output]
Softmax Function
Used for:
Multi-Class Classification
Examples:
Cat
Dog
Bird
Horse
Softmax Output
Converts scores into probabilities.
Example:
| Class | Probability |
|---|---|
| Cat | 0.70 |
| Dog | 0.20 |
| Bird | 0.05 |
| Horse | 0.05 |
Softmax Formula
Softmax Characteristics
All outputs:
Between 0 and 1
Total probability:
Always = 1
Classification Example
Input Image:
Animal Image
Output:
Cat = 70%
Dog = 20%
Bird = 5%
Horse = 5%
Prediction:
Cat
Activation Function Comparison
| Function | Range | Use Case |
|---|---|---|
| Sigmoid | 0 to 1 | Binary Classification |
| Tanh | -1 to 1 | Hidden Layers (Older Models) |
| ReLU | 0 to ∞ | Deep Networks |
| Leaky ReLU | Negative + Positive | Deep Networks |
| Softmax | Probabilities | Multi-Class Output |
Real World Banking Example
Fraud Detection
Hidden Layers:
ReLU
Output Layer:
Sigmoid
Prediction:
Fraud
Not Fraud
Healthcare Example
Disease Classification
Hidden Layers:
ReLU
Output:
Softmax
Prediction:
Diabetes
Cancer
Heart Disease
ChatGPT Example
Transformer Networks use:
ReLU Variants
GELU
instead of Sigmoid.
These perform better in deep architectures.
Neural Network Architecture
flowchart LR
A[Input Layer]
A --> B[ReLU]
B --> C[ReLU]
C --> D[Softmax]
D --> E[Prediction]
Python Example
ReLU
import tensorflow as tf
relu = tf.keras.layers.ReLU()
Sigmoid
activation='sigmoid'
Softmax
activation='softmax'
Keras Example
model = tf.keras.Sequential([
tf.keras.layers.Dense(
64,
activation='relu'
),
tf.keras.layers.Dense(
10,
activation='softmax'
)
])
Interview Questions
What is an Activation Function?
A mathematical function that introduces non-linearity into neural networks.
Why Are Activation Functions Needed?
Without them, neural networks behave like simple linear models.
What is Sigmoid Used For?
Binary classification problems.
Why Is ReLU Popular?
Fast computation and reduced vanishing gradients.
What is Dying ReLU?
A neuron that always outputs zero and stops learning.
Why Use Leaky ReLU?
To prevent dead neurons.
What is Softmax Used For?
Multi-class classification problems.
Which Activation Function Is Most Common?
ReLU for hidden layers and Softmax/Sigmoid for output layers.
Key Takeaways
- Activation Functions enable neural networks to learn complex patterns.
- Sigmoid is used for binary classification.
- Tanh improves upon Sigmoid by being zero-centered.
- ReLU is the most widely used activation function today.
- Leaky ReLU solves the Dying ReLU problem.
- Softmax converts outputs into probabilities.
- Activation functions are a critical foundation of Deep Learning.