Machine Learning: Trying to discover structure in your data

5 min readJul 31, 2018

This post is part of a series introducing Algorithm Explorer: a framework for exploring which data science methods relate to your business needs.

The introductory post “Machine Learning: Where to begin…” can be found here and Algorithm Explorer here

If you are looking to use machine learning to solve a business problem requiring you to understand structure in your data, but you don’t have a target value you can use for prediction, you should look to Clustering Techniques.

Clustering Techniques

Clustering algorithms are machine learning techniques to divide data into a number of groups where points in the groups have similar traits. They are unsupervised learning tasks and therefore do not require labelled training examples.

Use-Cases

Segmenting your market based upon similar collections of customers using their location, spending habits and demographics
Understanding topics in your documents, whether they are emails, reports, or customer call transactions by exploring the common words

Most Common Clustering Algorithms

Below are introductions on the most common algorithms for discovering structure in your data: K-Means, GMM, DBSCAN, and Agglomerative Hierarchical Clustering.

K-Means

K-Means Clustering aims to partition the observations into k clusters. The algorithm will determine which observation is in which cluster and also the center of each cluster. A new observation is assigned the cluster whose center it is nearest to.

Pros

Simple and easy to implement
Easy to interpret results
Fast

Cons

Sensitive to outliers
You must define the number of clusters
Assumes the clusters are spherical
The clusters are found using a random starting point so may not be repeatable and can require multiple runs to find an optimal solutions

Vocabulary

k — k is a user-defined value referring to the number of clusters the algorithm should find.

Observation — An observation is a single example, a data point or row in the data.

Cluster — A group of similar things that are close together.

Gaussian Mixture Model

With a Gaussian Mixture Model (GMM), we are assuming that the k clusters are of normal distribution. It’s algorithm tries to find the mean and standard deviation of each of these clusters. For a new observation, the probability it belongs to each cluster is calculated, resulting in a soft assignment.

Pros

Does not enforce the clusters to be circular
Points can be assigned to multiple clusters

Cons

You must define the number of clusters
Difficult to interpret

Vocabulary

k — k is a user-defined value referring to the number of clusters the algorithm should find.

Normal distribution — The normal, or Gaussian, distribution is a common probability distribution informally referred to as a bell curve due to its shape.

Mean — Mean is the average calculated by summing up all values and dividing by the number of numbers.

Standard deviation — Standard deviation is the measurement of variance in the data, how spread out the data is.

Observation — An observation is a single example, a data point or row in the data.

Probability — Probability means to what extend something is likely to happen or be a particular case.

Cluster — A group of similar things that are close together.

Soft assignment — Rather than being assigned to one cluster, each point is assigned to all clusters with different probabilities

DBSCAN

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) attempts to find dense areas of data points and identifies these as a cluster. If data points are close enough to each other, and there are a sufficient number of them, they form a cluster. If not, they are labeled as noise and ignored.

Pros

Can find arbitrarily shaped clusters
Does not require defining the number of clusters
Robust to outliers

Cons

Cannot cluster datasets with large differences in densities
Can perform poorly on high dimensional data
Choosing the right parameters for density can be difficult

Vocabulary

Dense — Closely compact areas of observations.

Cluster — A group of similar things that are close together.

High dimensional — High dimensional data means that the data has a very large number of features. If your data is represented in a CSV, database or Excel file, if there are a lot of columns which you will be using to build a model with, it’s high dimensional

Parameters — A parameter is a value that the user defines which is used by the algorithm; these values are important as impacts the success of the model

Agglomerative Hierarchical Clustering

Agglomerative Hierarchical Clustering is an algorithm that builds a hierarchy of clusters. Initially, all points start as their own cluster, then the two nearest clusters merge as the same cluster. This process continues, clusters merging until only one cluster containing all the data points remain. To identify the significant clusters a threshold would be chosen.

Pros

Easy probabilistic model with interpretable topics
Does not require defining the number of clusters
Clusters can be arbitrarily shaped

Cons

Slow and therefore not suitable for big data
Can be hard to identify the correct number of clusters
Interpretation can be confusing

Vocabulary

Hierarchy of clusters — It is a tree based representation called a Dendrogram where each leaf is an observation and they are combined with branches. The similarity of the observations can be inferred by the height of the branch connecting them.

Cluster — A group of similar things that are close together.

Threshold — A boundary to use to make a decision. For clustering, this threshold might be used to stop the algorithm after a certain number of merges have occurred or when a certain number of clusters are identified.

Many Thanks

I wish to thank Sam Rose for his great front end development work (and patience!), converting my raw idea into something much more consumable, streamlined and aesthetically pleasing.

Similarly, my drawing skills leave much to be desired so thank you to Mary Kim for adding an artistic flare to this work!

Machine Learning: Trying to discover structure in your data

Clustering Techniques

Use-Cases

Most Common Clustering Algorithms

K-Means

Pros

Cons

Vocabulary

Gaussian Mixture Model

Pros

Cons

Vocabulary

DBSCAN

Pros

Cons

Vocabulary

Agglomerative Hierarchical Clustering

Pros

Cons

Vocabulary

Further Reading

Many Thanks

Written by Stacey Ronaghan

No responses yet