Machine Learning: Trying to predict a numerical value

Stacey Ronaghan
5 min readJul 31, 2018

--

This post is part of a series introducing Algorithm Explorer: a framework for exploring which data science methods relate to your business needs.

The introductory post “Machine Learning: Where to begin…” can be found here and Algorithm Explorer here

If you’re looking to use machine learning to solve a business problem requiring you to predict a numerical value, you should look to Regression Techniques.

Regression Techniques

Regression algorithms are machine learning techniques for predicting continuous numerical values. They are supervised learning tasks which means they require labelled training examples.

Use-Cases

  • Predicting the appropriate price for a product based upon size, brand, and location
  • Predicting the number of sales each day based upon store location, public holidays, day of the week, and the closest competitor

Most Common Regression Algorithms

Below are introductions on the most common algorithms for predicting a numerical value: Linear Regression, Decision Trees, Neural Networks, and K-Nearest Neighbors

Linear Regression

Linear regression attempts to fit a straight hyperplane to your dataset that is closest to all data points. It is most suitable when there are linear relationships between the variables in the dataset.

Pros

  • Quick to compute and can be updated easily with new data
  • Relatively easy to understand and explain

Regularization techniques can be used to prevent overfitting

Cons

  • Unable to learn complex relationships
  • Difficult to capture non-linear relationships (without first transforming data which can be complicated)

Vocabulary

hyperplane — A hyperplane in a 1-dimensional (1D) space is a point. In a 2-dimensional (2D) space, it is a line. A hyperplane in 3-dimensional (3D) space is a plane, a flat surface. To generalize for any dimension, the concept is referred to as a hyperplane.

linear relationships — A relationship is linear if a change in the first variable corresponds to a constant change in the second variable.

Overfitting — An overfit model will have very high accuracy on the training data, having discovered useful features that are specific in the data it has seen. However, it will have low accuracy on test data as it cannot generalize.

non-linear relationships— A non-linear relationship means that the a change in the first variable doesn’t necessarily correspond with a constant change in the second. However, they may impact each other but it appears to be predictable.

Example Python Notebook

Predicting Yacht Resistance with Linear Regression

Decision Trees

Decision trees learn how to best split the dataset into separate branches, allowing it to learn non-linear relationships.

Random Forests (RF) and Gradient Boosted Trees (GBT) are two algorithms that build many individual trees, pooling their predictions. As they use a collection of results to make a final decision, they are referred to as “Ensemble techniques”.

Pros

  • A single decision tree is fast to train
  • Robust to noise and missing values
  • RF performs very well “out-of-the-box”

Cons

  • Single decision trees are prone to overfitting (which is where ensembles come in!)
  • Complex trees are hard to interpret

Vocabulary

non-linear relationships — A non-linear relationship means that the a change in the first variable doesn’t necessarily correspond with a constant change in the second. However, they may impact each other but it appears to be predictable.

Pooling — This is a way to combine data and is usually done by taking the mean average.

Noise — Noise refers to data points are incorrect which may resulting in discovering patterns that are untrue. These are usually identified if they are outliers, which means they are much different to the rest of the data set. However, be cautious as some outliers may be valid data points and worth investigating.

Overfitting — An overfit model will have very high accuracy on the training data, having discovered useful features that are specific in the data it has seen. However, it will have low accuracy on test data as it cannot generalize.

Example Python Notebook

Predicting Yacht Resistance with Decision Trees & Random Forest

Neural Networks

Neural networks can learn complex patterns using layers of neurons which mathematically transform the data. The layers between the input and output are referred to as “hidden layers”. A neural network can learn relationships between the features that other algorithms cannot easily discover.

Pros

  • Extremely powerful/state-of-the-art for many domains (e.g. computer vision, speech recognition)
  • Can learn even very complex relationships
  • Hidden layers reduce need for feature engineering (less need to understand underlying data)

Cons

  • Require a very large amount of data!
  • Prone to overfitting
  • Long training time
  • Requires significant computing power for large datasets (computationally expensive)
  • Model is a “black box”, unexplainable

Vocabulary

Neurons — An artificial neuron is a mathematical function. It takes one or more inputs that are multiplied by values called ‘weights’ and added together. This value is then passed to a non-linear function, referred to as an ‘activation function’, which becomes the output.

Input — The features are passed as inputs, e.g. size, brand, location, etc.

Output — This is the target variable, the thing we are trying to predict, e.g. the price of an item.

Hidden layers — These are a number of neurons which mathematically transform the data. They are referred to as ‘hidden’ as the user is only concerned with the input layers, where the features are passed, and the output layers, where the prediction is made.

Feature engineering — Feature engineering is the process of transforming the raw data into something more meaningful, this usually involves working with someone that has domain expertise.

Overfitting — An overfit model will have very high accuracy on the training data, having discovered useful features that are specific in the data it has seen. However, it will have low accuracy on test data as it cannot generalize.

Model — Machine learning algorithms create a model after training, this is a mathematical function that can then be used to take a new observation and calculates an appropriate prediction.

Example Python Notebook

Predicting Yacht Resistance with Neural Networks

K-Nearest Neighbors

K-Nearest Neighbors (KNN) makes a prediction for a new observation by searching for the most similar training observations and pooling their values

Pros

  • Simple
  • Powerful
  • No training involved

Cons

  • Expensive & slow to predict new instances
  • Performs poorly on high dimensional datasets

Vocabulary

Observation — An observation is a single example, a data point or row in the data.

Pooling — This is a way to combine data and is usually done by taking the mean average.

High dimensional — High dimensional data means that the data has a very large number of features. If your data is represented in a CSV, database or Excel file, if there are a lot of columns which you will be using to build a model with, it’s high dimensional

Example Python Notebook

Predicting Yacht Resistance with K Nearest Neighbors

Further Reading

This series continues with:

Many Thanks

I wish to thank Sam Rose for his great front end development work (and patience!), converting my raw idea into something much more consumable, streamlined and aesthetically pleasing.

Similarly, my drawing skills leave much to be desired so thank you to Mary Kim for adding an artistic flare to this work!

--

--

Stacey Ronaghan

Data Scientist keen to share experiences & learnings from work & studies