Machine Learning: Trying to classify your data

7 min readJul 31, 2018

This post is part of a series introducing Algorithm Explorer: a framework for exploring which data science methods relate to your business needs.

The introductory post “Machine Learning: Where to begin…” can be found here and Algorithm Explorer here

If you’re looking to use machine learning to solve a business problem requiring you to predict a categorical outcome, you should look to Classification Techniques.

Classification Techniques

Classification algorithms are machine learning techniques for predicting which category the input data belongs to. They are supervised learning tasks which means they require labelled training examples.

Use-Cases

Predicting a clinical diagnosis based upon symptoms, laboratory results, and historical diagnosis
Predicting whether a healthcare claim is fraudulent using data such as claim amount, drug predisposition, disease and provider

Most Common Regression Algorithms

Below are introductions on the most common algorithms for predicting a categorical outcome: Support Vector Machines, Naive Bayes, Logistic Regression, Decision Trees, and Neural Networks

Support Vector Machines

If you plot your data in an n-dimensional space (where n is the number of features), Support Vector Machines (SVM) attempt to fit a hyperplane that best separates the categories. When you have a new data point, its position in relation to the hyperplane will predict which category the point belongs to.

Pros

High accuracy
Ability to find solutions even if non-linearly separable
Good for high-dimensional space (lots of features)

Cons

Hard to interpret
Can be slow to train large data sets
Memory-intensive

Vocabulary

n-dimensional space — A 1-dimensional (1D) space is represented simply as a line and 2-dimensional (2D) is referred to as the Cartesian plane, where you can move up or down and right or left. To generalize, n-dimensional space is used.

hyperplane — A hyperplane in a 1-dimensional (1D) space is a point. In a 2-dimensional (2D) space, it is a line. A hyperplane in 3-dimensional (3D) space is a plane, a flat surface. To generalize for any dimension, the concept is referred to as a hyperplane.

categories — The terms categories and classes can be used interchangeably.

Example Python Notebook

Predicting Wine Types with SVM

Nave Bayes

Naive Bayes assumes that all features are independent, that they independently contribute to the probability of the target variable’s class; this does not always hold true which is why it is referred to as “Naive”. Various probabilities and likelihood values are calculated based upon the frequency they appear in the data and the final probabilities calculated using a formula called Bayes Theorem.

Pros

Simple and easy to interpret
Computationally fast
Good for high-dimensional space (lots of features)

Cons

Performance will be inhibited if significant dependence between variables
If a class that appears in the test data did not appear in the training data, it will be given a probability of zero

Vocabulary

Independent — Two features are independent if the value of one does not affect the value of the other. Two events are independent if the probability of one occurring does not affect the probability of the other occurring.

Probability — Probability means to what extend something is likely to happen or be a particular case.

Target variable — This is the thing are are trying to predict, e.g. whether an action is fraudulent or not; the price of a product

Likelihood — The probability of an event occurring given a criteria can be represented as the likelihood of that criteria given the event occurring.

Bayes Theorem — Bayes Theorem is a mathematical formula for determining conditional probability.

Example Python Notebook

Predicting Wine Types with Naive Bayes

Logistic Regression

Logistic regression predicts the probability of a binary outcome. A new observation is predicted to be within the class if its probability is above a set threshold. There are methods to use Logistic Regression for scenarios where there are multiple classes.

Pros

Quick to compute and can be updated easily with new data
Output can be interpreted as probability; this can be used for ranking
Regularization techniques can be used to prevent overfitting

Cons

Unable to learn complex relationships
Difficult to capture non-linear relationships (without first transforming data which can be complicated)

Vocabulary

Probability — Probability means to what extend something is likely to happen or be a particular case.

Binary outcome — A binary outcome means the variable will be one of two possible values, a 1 or a 0. A 1 indicates that the observation is in the class and a 0 would indicate it isn’t.

Observation — An observation is a single example, a data point or row in the data.

Threshold — A boundary to use to make a decision. For classification, if the probability for an observation to be within a class is greater than the threshold, it will be classified as that class

Overfitting — An overfit model will have very high accuracy on the training data, having discovered useful features that are specific in the data it has seen. However, it will have low accuracy on test data as it cannot generalize.

non-linear relationships — A non-linear relationship means that the a change in the first variable doesn’t necessarily correspond with a constant change in the second. However, they may impact each other but it appears to be predictable.

Example Python Notebook

Predicting Wine Types with Logistic Regression

Decision Trees

Decision trees learn how to best split the dataset into separate branches, allowing it to learn non-linear relationships.

Random Forests (RF) and Gradient Boosted Trees (GBT) are two algorithms that build many individual trees, pooling their predictions. As they use a collection of results to make a final decision, they are referred to as “Ensemble techniques”.

Pros

A single decision tree is fast to train
Robust to noise and missing values
RF performs very well “out-of-the-box”

Cons

Single decision trees are prone to overfitting (which is where ensembles come in!)
Complex trees are hard to interpret

Vocabulary

Pooling — This is a way to combine data and is usually done by taking the mean average.

Noise — Noise refers to data points are incorrect which may resulting in discovering patterns that are untrue. These are usually identified if they are outliers, which means they are much different to the rest of the data set. However, be cautious as some outliers may be valid data points and worth investigating.

Example Python Notebook

Predicting Wine Types with Decision Trees & Random Forest

Neural Networks

Neural networks can learn complex patterns using layers of neurons which mathematically transform the data. The layers between the input and output are referred to as “hidden layers”. A neural network can learn relationships between the features that other algorithms cannot easily discover.

Pros

Extremely powerful/state-of-the-art for many domains (e.g. computer vision, speech recognition)
Can learn even very complex relationships
Hidden layers reduce need for feature engineering (less need to understand underlying data)

Cons

Require a very large amount of data!
Prone to overfitting
Long training time
Requires significant computing power for large datasets (computationally expensive)
Model is a “black box”, unexplainable

Vocabulary

Neurons — An artificial neuron is a mathematical function. It takes one or more inputs that are multiplied by values called ‘weights’ and added together. This value is then passed to a non-linear function, referred to as an ‘activation function’, which becomes the output.

Input — The features are passed as inputs, e.g. size, brand, location, etc.

Output — This is the target variable, the thing we are trying to predict, e.g. the price of an item.

Hidden layers — These are a number of neurons which mathematically transform the data. They are referred to as ‘hidden’ as the user is only concerned with the input layers, where the features are passed, and the output layers, where the prediction is made.

Feature engineering — Feature engineering is the process of transforming the raw data into something more meaningful, this usually involves working with someone that has domain expertise.

Model — Machine learning algorithms create a model after training, this is a mathematical function that can then be used to take a new observation and calculates an appropriate prediction.

Example Python Notebook

Predicting Wine Types with Neural Networks

Many Thanks

I wish to thank Sam Rose for his great front end development work (and patience!), converting my raw idea into something much more consumable, streamlined and aesthetically pleasing.

Similarly, my drawing skills leave much to be desired so thank you to Mary Kim for adding an artistic flare to this work!

Machine Learning: Trying to classify your data

Classification Techniques

Use-Cases

Most Common Regression Algorithms

Support Vector Machines

Pros

Cons

Vocabulary

Example Python Notebook

Nave Bayes

Pros

Cons

Vocabulary

Example Python Notebook

Logistic Regression

Pros

Cons

Vocabulary

Example Python Notebook

Decision Trees

Pros

Cons

Vocabulary

Example Python Notebook

Neural Networks

Pros

Cons

Vocabulary

Example Python Notebook

Further Reading

Many Thanks

Written by Stacey Ronaghan

No responses yet