Machine Learning: Trying to detect outliers or unusual behavior
This post is part of a series introducing Algorithm Explorer: a framework for exploring which data science methods relate to your business needs.
The introductory post “Machine Learning: Where to begin…” can be found here and Algorithm Explorer here
If you are looking to use machine learning to detect outliers or unusual behavior, you should look to Anomaly Detection Techniques.
Anomaly Detection Techniques
Anomaly Detection is a technique used to identify unusual events or patterns that do not conform to expected behavior. Those identified are often referred to as anomalies or outliers.
Use-Cases
- Detect abnormal behavior of equipment in a manufacturing plant using sensor data such as temperature, pressure and humidity
- Detect and prevent fraudulent spending by understanding normal customer spending amounts, locations and time between transactions
Most Common Anomaly Detection Algorithms
Below are introductions on the most common algorithms for anomaly detection: K-Means, One-class Support Vector Machines, and Autoencoders.
K-Means
Clustering techniques are a common approach for anomaly detection. Clusters of “normal” characteristics are identified and if the distance between a new point and all other clusters is too great, it is identified as an anomaly.
K-Means Clustering aims to partition n observations (data points) into k clusters in which each observation belongs to the cluster with the nearest center.
For more examples of Clustering techniques that can be used for anomaly detection, see Machine Learning: Trying to discover structure in your data
Pros
- Simple and easy to implement
- Easy to interpret results
- Fast
Cons
- Sensitive to outliers
- You must define the number of clusters
- Assumes the clusters are spherical
- The clusters are found using a random starting point so may not be repeatable and can require multiple runs to find an optimal solutions
Vocabulary
Characteristics — Characteristics are common or similar values seen in the data based upon features, e.g. people that spend a lot of money on rent, those that infrequently make purchases, etc.
k — k is a user-defined value referring to the number of clusters the algorithm should find.
Observation — An observation is a single example, a data point or row in the data.
Cluster — A group of similar things that are close together.
One-Class Support Vector Machines
If you were to plot your data in an n-dimensional space (where n is the number of features), One-Class Support Vector Machines (SVM) attempt to identify the region where most cases lie, these are considered “normal”. It will then fit a hyperplane that best separates these “normal” examples from the rest. When you have a new data point, it is labeled as “normal” or an “anomaly” depending how close it is to the “normal” boundary.
Pros
- No assumptions on the distribution of the data
- Ability to find normal boundary that is non-linear
- Can be used in high-dimensional space
Cons
- Choosing the right hyper-parameters to find the appropriate non-linear shape of the boundary can be difficult
- Can be slow to train large datasets
- Memory intensive
Vocabulary
n-dimensional space — A 1-dimensional (1D) space is represented simply as a line and 2-dimensional (2D) is referred to as the Cartesian plane, where you can move up or down and right or left. To generalize, n-dimensional space is used.
hyperplane — A hyperplane in a 1-dimensional (1D) space is a point. In a 2-dimensional (2D) space, it is a line. A hyperplane in 3-dimensional (3D) space is a plane, a flat surface. To generalize for any dimension, the concept is referred to as a hyperplane.
boundary — This is the line, plane or hyperplane that divides the data between those that have been identified as ‘normal’ and those that are not.
Autoencoders
An Autoencoder is a technique for dimensionality reduction. It is a type of neural network where the first part of the network, called the encoder, reduces the input to a lower dimension. The second part of the network, called the decoded, aims to reconstruct the original input. The goal is to create a model where the input and output are the same. A new data point can be passed through the model and if the error between the input data and the computed output is too great, it can be flagged as an anomaly.
Pros
- Can capture non-linear relationships and subtle connections between the features
- Variations result in state-of-the-art results
- If the data is temporal, LSTM (long-short-term-memory) autoencoders can be used
Cons
- Requires a very large amount of data
- Many hyper-parameters to tune
- Long training time
- Requires significant computing power for large datasets
Vocabulary
Dimensionality reduction — The initial number of dimension will be the number of features. The goal of dimensionality reduction is to reduce the number of dimensions without losing important information
Neural network — Neural networks can learn complex patterns using “hidden layers” between inputs and the output. These layers are made of neurons which mathematically transform the data.
Input — The features are passed as inputs.
Model — Machine learning algorithms create a model after training, this is a mathematical function that can then be used to take a new observation and calculates an appropriate prediction.
Output — This is the target variable, the thing we are trying to predict.
Non-linear relationships — A non-linear relationship means that the a change in the first variable doesn’t necessarily correspond with a constant change in the second. However, they may impact each other but it appears to be unpredictable.
Temporal — Temporal data is data relating to time
Hyper-parameters — A hyper-parameter is a value that is set prior to building a model; these values are important as they impact the success of the model
Further Reading
Other posts in this series:
- Machine Learning: Where to begin…
- Machine Learning: Trying to predict a numerical value
- Machine Learning: Trying to classify your data
- Machine Learning: Trying to discover structure in your data
- Machine Learning: Trying to make recommendations
Many Thanks
I wish to thank Sam Rose for his great front end development work (and patience!), converting my raw idea into something much more consumable, streamlined and aesthetically pleasing.
Similarly, my drawing skills leave much to be desired so thank you to Mary Kim for adding an artistic flare to this work!