# Machine Learning: Trying to discover structure in your data

This post is part of a series introducing Algorithm Explorer: a framework for exploring which data science methods relate to your business needs.

The introductory post “Machine Learning: Where to begin…” can be found here and Algorithm Explorer here

If you are looking to use machine learning to solve a business problem requiring you to understand structure in your data, but you don’t have a target value you can use for prediction, you should look to Clustering Techniques.

# Clustering Techniques

Clustering algorithms are machine learning techniques to divide data into a number of groups where points in the groups have similar traits. They are unsupervised learning tasks and therefore do not require labelled training examples.

## Use-Cases

- Segmenting your market based upon similar collections of customers using their location, spending habits and demographics
- Understanding topics in your documents, whether they are emails, reports, or customer call transactions by exploring the common words

## Most Common Clustering Algorithms

Below are introductions on the most common algorithms for discovering structure in your data: *K-Means, GMM, DBSCAN*, and *Agglomerative Hierarchical Clustering.*

# K-Means

K-Means Clustering aims to partition the observations into ** k** clusters. The algorithm will determine which

**is in which**

*observation***and also the center of each cluster. A new observation is assigned the cluster whose center it is nearest to.**

*cluster*## Pros

- Simple and easy to implement
- Easy to interpret results
- Fast

## Cons

- Sensitive to outliers
- You must define the number of clusters
- Assumes the clusters are spherical
- The clusters are found using a random starting point so may not be repeatable and can require multiple runs to find an optimal solutions

## Vocabulary

*k** — k is a user-defined value referring to the number of clusters the algorithm should find.*

*Observation** — An observation is a single example, a data point or row in the data.*

*Cluster** — A group of similar things that are close together.*

# Gaussian Mixture Model

With a Gaussian Mixture Model (GMM), we are assuming that the ** k** clusters are of

**. It’s algorithm tries to find the**

*normal distribution***and**

*mean***of each of these clusters. For a new**

*standard deviation***, the**

*observation***it belongs to each**

*probability***is calculated, resulting in a**

*cluster***.**

*soft assignment*## Pros

- Does not enforce the clusters to be circular
- Points can be assigned to multiple clusters

## Cons

- You must define the number of clusters
- Difficult to interpret

## Vocabulary

*k** — k is a user-defined value referring to the number of clusters the algorithm should find.*

*Normal distribution** — The normal, or Gaussian, distribution is a common probability distribution informally referred to as a bell curve due to its shape.*

*Mean** — Mean is the average calculated by summing up all values and dividing by the number of numbers.*

*Standard deviation **— Standard deviation is the measurement of variance in the data, how spread out the data is.*

*Observation** — An observation is a single example, a data point or row in the data.*

*Probability **— Probability means to what extend something is likely to happen or be a particular case.*

*Cluster** — A group of similar things that are close together.*

*Soft assignment** — Rather than being assigned to one cluster, each point is assigned to all clusters with different probabilities*

# DBSCAN

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) attempts to find ** dense** areas of data points and identifies these as a

**. If data points are close enough to each other, and there are a sufficient number of them, they form a cluster. If not, they are labeled as noise and ignored.**

*cluster*## Pros

- Can find arbitrarily shaped clusters
- Does not require defining the number of clusters
- Robust to outliers

## Cons

- Cannot cluster datasets with large differences in densities
- Can perform poorly on
data*high dimensional* - Choosing the right
for density can be difficult*parameters*

## Vocabulary

*Dense **— Closely compact areas of observations.*

*Cluster **— A group of similar things that are close together.*

*High dimensional **— High dimensional data means that the data has a very large number of features. If your data is represented in a CSV, database or Excel file, if there are a lot of columns which you will be using to build a model with, it’s high dimensional*

*Parameters** — A parameter is a value that the user defines which is used by the algorithm; these values are important as impacts the success of the model*

# Agglomerative Hierarchical Clustering

Agglomerative Hierarchical Clustering is an algorithm that builds a ** hierarchy of clusters**. Initially, all points start as their own

**, then the two nearest clusters merge as the same cluster. This process continues, clusters merging until only one cluster containing all the data points remain. To identify the significant clusters a**

*cluster***would be chosen.**

*threshold*## Pros

- Easy probabilistic model with interpretable topics
- Does not require defining the number of clusters
- Clusters can be arbitrarily shaped

## Cons

- Slow and therefore not suitable for big data
- Can be hard to identify the correct number of clusters
- Interpretation can be confusing

## Vocabulary

*Hierarchy of clusters** — It is a tree based representation called a Dendrogram where each leaf is an observation and they are combined with branches. The similarity of the observations can be inferred by the height of the branch connecting them.*

*Cluster **— A group of similar things that are close together.*

*Threshold **— A boundary to use to make a decision. For clustering, this threshold might be used to stop the algorithm after a certain number of merges have occurred or when a certain number of clusters are identified.*

# Further Reading

Other posts in this series:

- Machine Learning: Where to begin…
- Machine Learning: Trying to predict a numerical value
- Machine Learning: Trying to classify your data
- Machine Learning: Trying to make recommendations
- Machine Learning: Trying to detect outliers or unusual behavior

# Many Thanks

I wish to thank Sam Rose for his great front end development work (and patience!), converting my raw idea into something much more consumable, streamlined and aesthetically pleasing.

Similarly, my drawing skills leave much to be desired so thank you to Mary Kim for adding an artistic flare to this work!