Machine Learning: Trying to detect outliers or unusual behavior

Stacey Ronaghan
5 min readJul 31, 2018

--

This post is part of a series introducing Algorithm Explorer: a framework for exploring which data science methods relate to your business needs.

The introductory post “Machine Learning: Where to begin…” can be found here and Algorithm Explorer here

If you are looking to use machine learning to detect outliers or unusual behavior, you should look to Anomaly Detection Techniques.

Anomaly Detection Techniques

Anomaly Detection is a technique used to identify unusual events or patterns that do not conform to expected behavior. Those identified are often referred to as anomalies or outliers.

Use-Cases

  • Detect abnormal behavior of equipment in a manufacturing plant using sensor data such as temperature, pressure and humidity
  • Detect and prevent fraudulent spending by understanding normal customer spending amounts, locations and time between transactions

Most Common Anomaly Detection Algorithms

Below are introductions on the most common algorithms for anomaly detection: K-Means, One-class Support Vector Machines, and Autoencoders.

K-Means

Clustering techniques are a common approach for anomaly detection. Clusters of “normal” characteristics are identified and if the distance between a new point and all other clusters is too great, it is identified as an anomaly.

K-Means Clustering aims to partition n observations (data points) into k clusters in which each observation belongs to the cluster with the nearest center.

For more examples of Clustering techniques that can be used for anomaly detection, see Machine Learning: Trying to discover structure in your data

Pros

  • Simple and easy to implement
  • Easy to interpret results
  • Fast

Cons

  • Sensitive to outliers
  • You must define the number of clusters
  • Assumes the clusters are spherical
  • The clusters are found using a random starting point so may not be repeatable and can require multiple runs to find an optimal solutions

Vocabulary

Characteristics — Characteristics are common or similar values seen in the data based upon features, e.g. people that spend a lot of money on rent, those that infrequently make purchases, etc.

k — k is a user-defined value referring to the number of clusters the algorithm should find.

Observation — An observation is a single example, a data point or row in the data.

Cluster — A group of similar things that are close together.

One-Class Support Vector Machines

If you were to plot your data in an n-dimensional space (where n is the number of features), One-Class Support Vector Machines (SVM) attempt to identify the region where most cases lie, these are considered “normal”. It will then fit a hyperplane that best separates these “normal” examples from the rest. When you have a new data point, it is labeled as “normal” or an “anomaly” depending how close it is to the “normal” boundary.

Pros

  • No assumptions on the distribution of the data
  • Ability to find normal boundary that is non-linear
  • Can be used in high-dimensional space

Cons

  • Choosing the right hyper-parameters to find the appropriate non-linear shape of the boundary can be difficult
  • Can be slow to train large datasets
  • Memory intensive

Vocabulary

n-dimensional space — A 1-dimensional (1D) space is represented simply as a line and 2-dimensional (2D) is referred to as the Cartesian plane, where you can move up or down and right or left. To generalize, n-dimensional space is used.

hyperplane — A hyperplane in a 1-dimensional (1D) space is a point. In a 2-dimensional (2D) space, it is a line. A hyperplane in 3-dimensional (3D) space is a plane, a flat surface. To generalize for any dimension, the concept is referred to as a hyperplane.

boundary — This is the line, plane or hyperplane that divides the data between those that have been identified as ‘normal’ and those that are not.

Autoencoders

An Autoencoder is a technique for dimensionality reduction. It is a type of neural network where the first part of the network, called the encoder, reduces the input to a lower dimension. The second part of the network, called the decoded, aims to reconstruct the original input. The goal is to create a model where the input and output are the same. A new data point can be passed through the model and if the error between the input data and the computed output is too great, it can be flagged as an anomaly.

Pros

  • Can capture non-linear relationships and subtle connections between the features
  • Variations result in state-of-the-art results
  • If the data is temporal, LSTM (long-short-term-memory) autoencoders can be used

Cons

  • Requires a very large amount of data
  • Many hyper-parameters to tune
  • Long training time
  • Requires significant computing power for large datasets

Vocabulary

Dimensionality reduction — The initial number of dimension will be the number of features. The goal of dimensionality reduction is to reduce the number of dimensions without losing important information

Neural network — Neural networks can learn complex patterns using “hidden layers” between inputs and the output. These layers are made of neurons which mathematically transform the data.

Input — The features are passed as inputs.

Model — Machine learning algorithms create a model after training, this is a mathematical function that can then be used to take a new observation and calculates an appropriate prediction.

Output — This is the target variable, the thing we are trying to predict.

Non-linear relationships — A non-linear relationship means that the a change in the first variable doesn’t necessarily correspond with a constant change in the second. However, they may impact each other but it appears to be unpredictable.

Temporal — Temporal data is data relating to time

Hyper-parameters — A hyper-parameter is a value that is set prior to building a model; these values are important as they impact the success of the model

Further Reading

Other posts in this series:

Many Thanks

I wish to thank Sam Rose for his great front end development work (and patience!), converting my raw idea into something much more consumable, streamlined and aesthetically pleasing.

Similarly, my drawing skills leave much to be desired so thank you to Mary Kim for adding an artistic flare to this work!

--

--

Stacey Ronaghan

Data Scientist keen to share experiences & learnings from work & studies