Machine Learning: Trying to classify your data

Stacey Ronaghan
7 min readJul 31, 2018

--

This post is part of a series introducing Algorithm Explorer: a framework for exploring which data science methods relate to your business needs.

The introductory post “Machine Learning: Where to begin…” can be found here and Algorithm Explorer here

If you’re looking to use machine learning to solve a business problem requiring you to predict a categorical outcome, you should look to Classification Techniques.

Classification Techniques

Classification algorithms are machine learning techniques for predicting which category the input data belongs to. They are supervised learning tasks which means they require labelled training examples.

Use-Cases

  • Predicting a clinical diagnosis based upon symptoms, laboratory results, and historical diagnosis
  • Predicting whether a healthcare claim is fraudulent using data such as claim amount, drug predisposition, disease and provider

Most Common Regression Algorithms

Below are introductions on the most common algorithms for predicting a categorical outcome: Support Vector Machines, Naive Bayes, Logistic Regression, Decision Trees, and Neural Networks

Support Vector Machines

If you plot your data in an n-dimensional space (where n is the number of features), Support Vector Machines (SVM) attempt to fit a hyperplane that best separates the categories. When you have a new data point, its position in relation to the hyperplane will predict which category the point belongs to.

Pros

  • High accuracy
  • Ability to find solutions even if non-linearly separable
  • Good for high-dimensional space (lots of features)

Cons

  • Hard to interpret
  • Can be slow to train large data sets
  • Memory-intensive

Vocabulary

n-dimensional space — A 1-dimensional (1D) space is represented simply as a line and 2-dimensional (2D) is referred to as the Cartesian plane, where you can move up or down and right or left. To generalize, n-dimensional space is used.

hyperplane — A hyperplane in a 1-dimensional (1D) space is a point. In a 2-dimensional (2D) space, it is a line. A hyperplane in 3-dimensional (3D) space is a plane, a flat surface. To generalize for any dimension, the concept is referred to as a hyperplane.

categories — The terms categories and classes can be used interchangeably.

Example Python Notebook

Predicting Wine Types with SVM

Nave Bayes

Naive Bayes assumes that all features are independent, that they independently contribute to the probability of the target variable’s class; this does not always hold true which is why it is referred to as “Naive”. Various probabilities and likelihood values are calculated based upon the frequency they appear in the data and the final probabilities calculated using a formula called Bayes Theorem.

Pros

  • Simple and easy to interpret
  • Computationally fast
  • Good for high-dimensional space (lots of features)

Cons

  • Performance will be inhibited if significant dependence between variables
  • If a class that appears in the test data did not appear in the training data, it will be given a probability of zero

Vocabulary

Independent — Two features are independent if the value of one does not affect the value of the other. Two events are independent if the probability of one occurring does not affect the probability of the other occurring.

Probability — Probability means to what extend something is likely to happen or be a particular case.

Target variable — This is the thing are are trying to predict, e.g. whether an action is fraudulent or not; the price of a product

Likelihood — The probability of an event occurring given a criteria can be represented as the likelihood of that criteria given the event occurring.

Bayes Theorem — Bayes Theorem is a mathematical formula for determining conditional probability.

Example Python Notebook

Predicting Wine Types with Naive Bayes

Logistic Regression

Logistic regression predicts the probability of a binary outcome. A new observation is predicted to be within the class if its probability is above a set threshold. There are methods to use Logistic Regression for scenarios where there are multiple classes.

Pros

  • Quick to compute and can be updated easily with new data
  • Output can be interpreted as probability; this can be used for ranking
  • Regularization techniques can be used to prevent overfitting

Cons

  • Unable to learn complex relationships
  • Difficult to capture non-linear relationships (without first transforming data which can be complicated)

Vocabulary

Probability — Probability means to what extend something is likely to happen or be a particular case.

Binary outcome — A binary outcome means the variable will be one of two possible values, a 1 or a 0. A 1 indicates that the observation is in the class and a 0 would indicate it isn’t.

Observation — An observation is a single example, a data point or row in the data.

Threshold — A boundary to use to make a decision. For classification, if the probability for an observation to be within a class is greater than the threshold, it will be classified as that class

Overfitting — An overfit model will have very high accuracy on the training data, having discovered useful features that are specific in the data it has seen. However, it will have low accuracy on test data as it cannot generalize.

non-linear relationships — A non-linear relationship means that the a change in the first variable doesn’t necessarily correspond with a constant change in the second. However, they may impact each other but it appears to be predictable.

Example Python Notebook

Predicting Wine Types with Logistic Regression

Decision Trees

Decision trees learn how to best split the dataset into separate branches, allowing it to learn non-linear relationships.

Random Forests (RF) and Gradient Boosted Trees (GBT) are two algorithms that build many individual trees, pooling their predictions. As they use a collection of results to make a final decision, they are referred to as “Ensemble techniques”.

Pros

  • A single decision tree is fast to train
  • Robust to noise and missing values
  • RF performs very well “out-of-the-box”

Cons

  • Single decision trees are prone to overfitting (which is where ensembles come in!)
  • Complex trees are hard to interpret

Vocabulary

non-linear relationships — A non-linear relationship means that the a change in the first variable doesn’t necessarily correspond with a constant change in the second. However, they may impact each other but it appears to be predictable.

Pooling — This is a way to combine data and is usually done by taking the mean average.

Noise — Noise refers to data points are incorrect which may resulting in discovering patterns that are untrue. These are usually identified if they are outliers, which means they are much different to the rest of the data set. However, be cautious as some outliers may be valid data points and worth investigating.

Overfitting — An overfit model will have very high accuracy on the training data, having discovered useful features that are specific in the data it has seen. However, it will have low accuracy on test data as it cannot generalize.

Example Python Notebook

Predicting Wine Types with Decision Trees & Random Forest

Neural Networks

Neural networks can learn complex patterns using layers of neurons which mathematically transform the data. The layers between the input and output are referred to as “hidden layers”. A neural network can learn relationships between the features that other algorithms cannot easily discover.

Pros

  • Extremely powerful/state-of-the-art for many domains (e.g. computer vision, speech recognition)
  • Can learn even very complex relationships
  • Hidden layers reduce need for feature engineering (less need to understand underlying data)

Cons

  • Require a very large amount of data!
  • Prone to overfitting
  • Long training time
  • Requires significant computing power for large datasets (computationally expensive)
  • Model is a “black box”, unexplainable

Vocabulary

Neurons — An artificial neuron is a mathematical function. It takes one or more inputs that are multiplied by values called ‘weights’ and added together. This value is then passed to a non-linear function, referred to as an ‘activation function’, which becomes the output.

Input — The features are passed as inputs, e.g. size, brand, location, etc.

Output — This is the target variable, the thing we are trying to predict, e.g. the price of an item.

Hidden layers — These are a number of neurons which mathematically transform the data. They are referred to as ‘hidden’ as the user is only concerned with the input layers, where the features are passed, and the output layers, where the prediction is made.

Feature engineering — Feature engineering is the process of transforming the raw data into something more meaningful, this usually involves working with someone that has domain expertise.

Overfitting — An overfit model will have very high accuracy on the training data, having discovered useful features that are specific in the data it has seen. However, it will have low accuracy on test data as it cannot generalize.

Model — Machine learning algorithms create a model after training, this is a mathematical function that can then be used to take a new observation and calculates an appropriate prediction.

Example Python Notebook

Predicting Wine Types with Neural Networks

Further Reading

Other posts in this series:

Many Thanks

I wish to thank Sam Rose for his great front end development work (and patience!), converting my raw idea into something much more consumable, streamlined and aesthetically pleasing.

Similarly, my drawing skills leave much to be desired so thank you to Mary Kim for adding an artistic flare to this work!

--

--

Stacey Ronaghan

Data Scientist keen to share experiences & learnings from work & studies