Full Stack • Java • System Design • Cloud • AI Engineering

Activation Functions Explained

Learn Activation Functions in Neural Networks including Sigmoid, Tanh, ReLU, Leaky ReLU, Softmax, their advantages, limitations, mathematical intuition, real-world examples, and interview questions.

What You Will Learn

In this article, you'll learn:

  • What are Activation Functions?
  • Why Activation Functions Matter
  • Linear vs Non-Linear Models
  • Sigmoid Function
  • Tanh Function
  • ReLU Function
  • Leaky ReLU
  • Softmax
  • Real-World Applications
  • Choosing the Right Activation Function
  • Advantages and Limitations
  • Interview Questions

Introduction

In the previous article, we learned:

Inputs

↓

Weights

↓

Bias

↓

Output

But there is a problem.

Without an activation function:

Neural Network

=

Linear Equation

Even if we add:

100 Hidden Layers

the model still behaves like:

One Simple Linear Model

This makes Neural Networks incapable of learning complex patterns.

Activation Functions solve this problem.


What is an Activation Function?

An Activation Function decides:

Should A Neuron Fire?

or

Should It Stay Silent?

It introduces:

Non-Linearity

into the network.


Why Activation Functions Matter

Suppose we want to predict:

House Price

Fraud Detection

Image Recognition

Language Translation

Relationships are complex.

Without non-linearity:

Neural Networks

Cannot Learn Complex Patterns

Neural Network Without Activation

flowchart LR

A[Inputs]

A --> B[Weighted Sum]

B --> C[Output]

Only linear relationships can be learned.


Neural Network With Activation

flowchart LR

A[Inputs]

A --> B[Weighted Sum]

B --> C[Activation Function]

C --> D[Output]

Now the network can learn complex patterns.


Biological Analogy

Human brain neurons:

Receive Signal

↓

Process Signal

↓

Fire Or Not Fire

Activation functions mimic this behavior.


Activation Function Workflow

flowchart TD

A[Inputs]

A --> B[Weights]

B --> C[Bias]

C --> D[Activation Function]

D --> E[Output]

Linear Activation Function

Simplest activation:

Output = Input

Problem With Linear Activation

Suppose:

Layer 1

↓

Layer 2

↓

Layer 3

Each layer remains linear.

Entire network becomes:

Single Linear Equation

No deep learning advantage.


Why Non-Linearity Matters

Real-world relationships are not linear.

Examples:

Customer Spending

Disease Risk

Image Pixels

Language Understanding

Neural networks need non-linear functions.


Types of Activation Functions

flowchart TD

A[Activation Functions]

A --> B[Sigmoid]

A --> C[Tanh]

A --> D[ReLU]

A --> E[Leaky ReLU]

A --> F[Softmax]

Sigmoid Function

One of the oldest activation functions.

Output range:

0 to 1

Formula


Sigmoid Curve

xychart-beta
title "Sigmoid Function"
x-axis [-5,-4,-3,-2,-1,0,1,2,3,4,5]
y-axis "Output" 0 --> 1
line [0.01,0.02,0.05,0.12,0.27,0.5,0.73,0.88,0.95,0.98,0.99]

Sigmoid Characteristics

Large Negative Input:

≈ 0

Large Positive Input:

≈ 1

Middle:

≈ 0.5

Sigmoid Use Cases

Commonly used for:

Binary Classification

Yes / No

Spam / Not Spam

Fraud / Not Fraud

Sigmoid Advantages

✅ Probability Output

✅ Easy Interpretation

✅ Smooth Curve


Sigmoid Limitations

❌ Vanishing Gradient Problem

❌ Slow Training

❌ Not Zero-Centered


Tanh Function

Improved version of Sigmoid.

Output range:

-1 to +1

Formula


Tanh Curve

xychart-beta
title "Tanh Function"
x-axis [-5,-4,-3,-2,-1,0,1,2,3,4,5]
y-axis "Output" -1 --> 1
line [-0.99,-0.99,-0.95,-0.76,-0.46,0,0.46,0.76,0.95,0.99,0.99]

Advantages

✅ Zero-Centered

✅ Better Than Sigmoid

✅ Stronger Gradients


Limitations

❌ Still Suffers Vanishing Gradient

❌ Not Ideal For Deep Networks


ReLU Function

Most popular activation function today.

ReLU stands for:

Rectified Linear Unit

Formula


ReLU Behavior

Negative Values

↓

0

Positive Values

↓

Keep Original Value

ReLU Example

Input Output
-5 0
-2 0
0 0
3 3
8 8

ReLU Curve

xychart-beta
title "ReLU Function"
x-axis [-5,-4,-3,-2,-1,0,1,2,3,4,5]
y-axis "Output" 0 --> 5
line [0,0,0,0,0,0,1,2,3,4,5]

Why ReLU Became Popular

ReLU solves:

Vanishing Gradient Problem

and trains very fast.


ReLU Advantages

✅ Fast Computation

✅ Simple

✅ Efficient

✅ Works Well In Deep Networks

✅ Industry Standard


ReLU Limitations

Problem:

Dying ReLU

Neuron becomes permanently inactive.


Dying ReLU Example

Negative inputs:

-10

-20

-30

Output always:

0

Neuron stops learning.


Leaky ReLU

Created to solve Dying ReLU.

Instead of:

Negative Input → 0

Use:

Negative Input → Small Value

Formula


Advantages

✅ Prevents Dead Neurons

✅ Better Gradient Flow

✅ Improved Training


Leaky ReLU Visualization

flowchart LR

A[Negative Input]

A --> B[Small Negative Output]

C[Positive Input]

C --> D[Positive Output]

Softmax Function

Used for:

Multi-Class Classification

Examples:

Cat

Dog

Bird

Horse

Softmax Output

Converts scores into probabilities.

Example:

Class Probability
Cat 0.70
Dog 0.20
Bird 0.05
Horse 0.05

Softmax Formula


Softmax Characteristics

All outputs:

Between 0 and 1

Total probability:

Always = 1

Classification Example

Input Image:

Animal Image

Output:

Cat = 70%

Dog = 20%

Bird = 5%

Horse = 5%

Prediction:

Cat

Activation Function Comparison

Function Range Use Case
Sigmoid 0 to 1 Binary Classification
Tanh -1 to 1 Hidden Layers (Older Models)
ReLU 0 to ∞ Deep Networks
Leaky ReLU Negative + Positive Deep Networks
Softmax Probabilities Multi-Class Output

Real World Banking Example

Fraud Detection

Hidden Layers:

ReLU

Output Layer:

Sigmoid

Prediction:

Fraud

Not Fraud

Healthcare Example

Disease Classification

Hidden Layers:

ReLU

Output:

Softmax

Prediction:

Diabetes

Cancer

Heart Disease

ChatGPT Example

Transformer Networks use:

ReLU Variants

GELU

instead of Sigmoid.

These perform better in deep architectures.


Neural Network Architecture

flowchart LR

A[Input Layer]

A --> B[ReLU]

B --> C[ReLU]

C --> D[Softmax]

D --> E[Prediction]

Python Example

ReLU

import tensorflow as tf

relu = tf.keras.layers.ReLU()

Sigmoid

activation='sigmoid'

Softmax

activation='softmax'

Keras Example

model = tf.keras.Sequential([
    tf.keras.layers.Dense(
        64,
        activation='relu'
    ),
    tf.keras.layers.Dense(
        10,
        activation='softmax'
    )
])

Interview Questions

What is an Activation Function?

A mathematical function that introduces non-linearity into neural networks.


Why Are Activation Functions Needed?

Without them, neural networks behave like simple linear models.


What is Sigmoid Used For?

Binary classification problems.


Why Is ReLU Popular?

Fast computation and reduced vanishing gradients.


What is Dying ReLU?

A neuron that always outputs zero and stops learning.


Why Use Leaky ReLU?

To prevent dead neurons.


What is Softmax Used For?

Multi-class classification problems.


Which Activation Function Is Most Common?

ReLU for hidden layers and Softmax/Sigmoid for output layers.


Key Takeaways

  • Activation Functions enable neural networks to learn complex patterns.
  • Sigmoid is used for binary classification.
  • Tanh improves upon Sigmoid by being zero-centered.
  • ReLU is the most widely used activation function today.
  • Leaky ReLU solves the Dying ReLU problem.
  • Softmax converts outputs into probabilities.
  • Activation functions are a critical foundation of Deep Learning.