Machine Learning: Trying to discover structure in your data

Stacey Ronaghan
5 min readJul 31, 2018

--

This post is part of a series introducing Algorithm Explorer: a framework for exploring which data science methods relate to your business needs.

The introductory post “Machine Learning: Where to begin…” can be found here and Algorithm Explorer here

If you are looking to use machine learning to solve a business problem requiring you to understand structure in your data, but you don’t have a target value you can use for prediction, you should look to Clustering Techniques.

Clustering Techniques

Clustering algorithms are machine learning techniques to divide data into a number of groups where points in the groups have similar traits. They are unsupervised learning tasks and therefore do not require labelled training examples.

Use-Cases

  • Segmenting your market based upon similar collections of customers using their location, spending habits and demographics
  • Understanding topics in your documents, whether they are emails, reports, or customer call transactions by exploring the common words

Most Common Clustering Algorithms

Below are introductions on the most common algorithms for discovering structure in your data: K-Means, GMM, DBSCAN, and Agglomerative Hierarchical Clustering.

K-Means

K-Means Clustering aims to partition the observations into k clusters. The algorithm will determine which observation is in which cluster and also the center of each cluster. A new observation is assigned the cluster whose center it is nearest to.

Pros

  • Simple and easy to implement
  • Easy to interpret results
  • Fast

Cons

  • Sensitive to outliers
  • You must define the number of clusters
  • Assumes the clusters are spherical
  • The clusters are found using a random starting point so may not be repeatable and can require multiple runs to find an optimal solutions

Vocabulary

k — k is a user-defined value referring to the number of clusters the algorithm should find.

Observation — An observation is a single example, a data point or row in the data.

Cluster — A group of similar things that are close together.

Gaussian Mixture Model

With a Gaussian Mixture Model (GMM), we are assuming that the k clusters are of normal distribution. It’s algorithm tries to find the mean and standard deviation of each of these clusters. For a new observation, the probability it belongs to each cluster is calculated, resulting in a soft assignment.

Pros

  • Does not enforce the clusters to be circular
  • Points can be assigned to multiple clusters

Cons

  • You must define the number of clusters
  • Difficult to interpret

Vocabulary

k — k is a user-defined value referring to the number of clusters the algorithm should find.

Normal distribution — The normal, or Gaussian, distribution is a common probability distribution informally referred to as a bell curve due to its shape.

Mean — Mean is the average calculated by summing up all values and dividing by the number of numbers.

Standard deviation — Standard deviation is the measurement of variance in the data, how spread out the data is.

Observation — An observation is a single example, a data point or row in the data.

Probability — Probability means to what extend something is likely to happen or be a particular case.

Cluster — A group of similar things that are close together.

Soft assignment — Rather than being assigned to one cluster, each point is assigned to all clusters with different probabilities

DBSCAN

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) attempts to find dense areas of data points and identifies these as a cluster. If data points are close enough to each other, and there are a sufficient number of them, they form a cluster. If not, they are labeled as noise and ignored.

Pros

  • Can find arbitrarily shaped clusters
  • Does not require defining the number of clusters
  • Robust to outliers

Cons

  • Cannot cluster datasets with large differences in densities
  • Can perform poorly on high dimensional data
  • Choosing the right parameters for density can be difficult

Vocabulary

Dense — Closely compact areas of observations.

Cluster — A group of similar things that are close together.

High dimensional — High dimensional data means that the data has a very large number of features. If your data is represented in a CSV, database or Excel file, if there are a lot of columns which you will be using to build a model with, it’s high dimensional

Parameters — A parameter is a value that the user defines which is used by the algorithm; these values are important as impacts the success of the model

Agglomerative Hierarchical Clustering

Agglomerative Hierarchical Clustering is an algorithm that builds a hierarchy of clusters. Initially, all points start as their own cluster, then the two nearest clusters merge as the same cluster. This process continues, clusters merging until only one cluster containing all the data points remain. To identify the significant clusters a threshold would be chosen.

Pros

  • Easy probabilistic model with interpretable topics
  • Does not require defining the number of clusters
  • Clusters can be arbitrarily shaped

Cons

  • Slow and therefore not suitable for big data
  • Can be hard to identify the correct number of clusters
  • Interpretation can be confusing

Vocabulary

Hierarchy of clusters — It is a tree based representation called a Dendrogram where each leaf is an observation and they are combined with branches. The similarity of the observations can be inferred by the height of the branch connecting them.

Cluster — A group of similar things that are close together.

Threshold — A boundary to use to make a decision. For clustering, this threshold might be used to stop the algorithm after a certain number of merges have occurred or when a certain number of clusters are identified.

Further Reading

Other posts in this series:

Many Thanks

I wish to thank Sam Rose for his great front end development work (and patience!), converting my raw idea into something much more consumable, streamlined and aesthetically pleasing.

Similarly, my drawing skills leave much to be desired so thank you to Mary Kim for adding an artistic flare to this work!

--

--

Stacey Ronaghan

Data Scientist keen to share experiences & learnings from work & studies