Machine Learning: Trying to detect outliers or unusual behavior

5 min readJul 31, 2018

This post is part of a series introducing Algorithm Explorer: a framework for exploring which data science methods relate to your business needs.

The introductory post “Machine Learning: Where to begin…” can be found here and Algorithm Explorer here

If you are looking to use machine learning to detect outliers or unusual behavior, you should look to Anomaly Detection Techniques.

Anomaly Detection Techniques

Anomaly Detection is a technique used to identify unusual events or patterns that do not conform to expected behavior. Those identified are often referred to as anomalies or outliers.

Use-Cases

Detect abnormal behavior of equipment in a manufacturing plant using sensor data such as temperature, pressure and humidity
Detect and prevent fraudulent spending by understanding normal customer spending amounts, locations and time between transactions

Most Common Anomaly Detection Algorithms

Below are introductions on the most common algorithms for anomaly detection: K-Means, One-class Support Vector Machines, and Autoencoders.

K-Means

Clustering techniques are a common approach for anomaly detection. Clusters of “normal” characteristics are identified and if the distance between a new point and all other clusters is too great, it is identified as an anomaly.

K-Means Clustering aims to partition n observations (data points) into k clusters in which each observation belongs to the cluster with the nearest center.

For more examples of Clustering techniques that can be used for anomaly detection, see Machine Learning: Trying to discover structure in your data

Pros

Simple and easy to implement
Easy to interpret results
Fast

Cons

Sensitive to outliers
You must define the number of clusters
Assumes the clusters are spherical
The clusters are found using a random starting point so may not be repeatable and can require multiple runs to find an optimal solutions

Vocabulary

Characteristics — Characteristics are common or similar values seen in the data based upon features, e.g. people that spend a lot of money on rent, those that infrequently make purchases, etc.

k — k is a user-defined value referring to the number of clusters the algorithm should find.

Observation — An observation is a single example, a data point or row in the data.

Cluster — A group of similar things that are close together.

One-Class Support Vector Machines

If you were to plot your data in an n-dimensional space (where n is the number of features), One-Class Support Vector Machines (SVM) attempt to identify the region where most cases lie, these are considered “normal”. It will then fit a hyperplane that best separates these “normal” examples from the rest. When you have a new data point, it is labeled as “normal” or an “anomaly” depending how close it is to the “normal” boundary.

Pros

No assumptions on the distribution of the data
Ability to find normal boundary that is non-linear
Can be used in high-dimensional space

Cons

Choosing the right hyper-parameters to find the appropriate non-linear shape of the boundary can be difficult
Can be slow to train large datasets
Memory intensive

Vocabulary

n-dimensional space — A 1-dimensional (1D) space is represented simply as a line and 2-dimensional (2D) is referred to as the Cartesian plane, where you can move up or down and right or left. To generalize, n-dimensional space is used.

hyperplane — A hyperplane in a 1-dimensional (1D) space is a point. In a 2-dimensional (2D) space, it is a line. A hyperplane in 3-dimensional (3D) space is a plane, a flat surface. To generalize for any dimension, the concept is referred to as a hyperplane.

boundary — This is the line, plane or hyperplane that divides the data between those that have been identified as ‘normal’ and those that are not.

Autoencoders

An Autoencoder is a technique for dimensionality reduction. It is a type of neural network where the first part of the network, called the encoder, reduces the input to a lower dimension. The second part of the network, called the decoded, aims to reconstruct the original input. The goal is to create a model where the input and output are the same. A new data point can be passed through the model and if the error between the input data and the computed output is too great, it can be flagged as an anomaly.

Pros

Can capture non-linear relationships and subtle connections between the features
Variations result in state-of-the-art results
If the data is temporal, LSTM (long-short-term-memory) autoencoders can be used

Cons

Requires a very large amount of data
Many hyper-parameters to tune
Long training time
Requires significant computing power for large datasets

Vocabulary

Dimensionality reduction — The initial number of dimension will be the number of features. The goal of dimensionality reduction is to reduce the number of dimensions without losing important information

Neural network — Neural networks can learn complex patterns using “hidden layers” between inputs and the output. These layers are made of neurons which mathematically transform the data.

Input — The features are passed as inputs.

Model — Machine learning algorithms create a model after training, this is a mathematical function that can then be used to take a new observation and calculates an appropriate prediction.

Output — This is the target variable, the thing we are trying to predict.

Non-linear relationships — A non-linear relationship means that the a change in the first variable doesn’t necessarily correspond with a constant change in the second. However, they may impact each other but it appears to be unpredictable.

Temporal — Temporal data is data relating to time

Hyper-parameters — A hyper-parameter is a value that is set prior to building a model; these values are important as they impact the success of the model

Many Thanks

I wish to thank Sam Rose for his great front end development work (and patience!), converting my raw idea into something much more consumable, streamlined and aesthetically pleasing.

Similarly, my drawing skills leave much to be desired so thank you to Mary Kim for adding an artistic flare to this work!

Machine Learning: Trying to detect outliers or unusual behavior

Anomaly Detection Techniques

Use-Cases

Most Common Anomaly Detection Algorithms

K-Means

Pros

Cons

Vocabulary

One-Class Support Vector Machines

Pros

Cons

Vocabulary

Autoencoders

Pros

Cons

Vocabulary

Further Reading

Many Thanks

Written by Stacey Ronaghan

No responses yet