Deep Learning: Overview of Neurons and Activation Functions
This post is designed to be an overview on concepts and terminology used in deep learning. It’s goal is to provide an introduction on neural networks, before describing some of the mathematics behind neurons and activation functions.
What is an Artificial Neural Network?
Neural networks can learn complex patterns using layers of neurons which mathematically transform the data
The layers between the input and output are referred to as “hidden layers”
A neural network can learn relationships between the features that other algorithms cannot easily discover
The above diagram is a Multilayer Perceptron (MLP). An MLP must have at least three layers: the input layer, a hidden layer and the output layer. They are fully connected; each node in one layer connects with a weight to every node in the next layer.
The term “deep learning” is coined for machine learning models built with many hidden layers: deep neural networks.
What is a neuron?
An artificial neuron (also referred to as a perceptron) is a mathematical function. It takes one or more inputs that are multiplied by values called “weights” and added together. This value is then passed to a non-linear function, known as an activation function, to become the neuron’s output.
- The x values refer to inputs, either the original features or inputs from a previous hidden layer
- At each layer, there is also a bias b which can help better fit the data
- The neuron passes the value a to all neurons it is connected to in the next layer, or returns it as the final value
The calculation starts with a linear equation:
Before adding a non-linear activation function:
Which brings us to our next question…
What is an Activation Function?
An activation function is a non-linear function applied by a neuron to introduce non-linear properties in the network.
A relationship is linear if a change in the first variable corresponds to a constant change in the second variable. A non-linear relationship means that a change in the first variable doesn’t necessarily correspond with a constant change in the second. However, they may impact each other but it appears to be unpredictable.
A quick visual example, by introducing non-linearity we can better capture the patterns in this data
Linear Activation Function
- A straight line function: a is a constant value
- Values can get very large
- The linear function alone doesn’t capture complex patterns
Sigmoid Activation Function
- A non-linear function so can capture more complex patterns
- Output values are bounded so don’t get too large
- Can suffer from “vanishing gradient”
Hyperbolic Tangent Activation Function
- A non-linear function so can capture more complex patterns
- Output values are bounded so don’t get too large
- Can suffer from “vanishing gradient”
Rectified Linear Unit (ReLU) Activation Function
- A non-linear function so can capture more complex patterns
- Values can get very large
- As it does not allow for negative values, certain patterns may not be captured
- Gradient can go towards 0 so weights are not updated: “dying ReLU problem”
Leaky ReLU Activation Function
- A non-linear function so can capture more complex patterns
- Attempts to solve the “dying ReLU problem”
- Values can get very large
Alternatively, instead of using 0.01, that can also be a parameter, α, whic is then learned during training alongside the weights. This is referred to as Parametric ReLU (PReLU):
Softmax Activation Function
- Each value ranges between 0 and 1 and the sum of all values is 1 so can be used to model probability distributions
- Only used in the output layer rather than throughout the network
Summary
Hopefully this post was valuable in providing an overview of neural networks and activation functions. The next post continues by discussing which final-layer activation functions should be used with which loss function depending on the purpose of building the model.
Deep Learning: Which Loss and Activation Functions should I use?