Mark As Completed Discussion

Standardization & Normalization

So you've collected all your data and now it's time to run your machine learning project. In the data you have collected there will be the features which all have two important properties; the unit and the magnitude. For example, the feature 'age', has units of years and the magnitude is the value.

Introduction

Each feature in your dataset will have both different magnitudes and units. Since algorithms that compute the distance between the features are biased towards numerically larger values it is important that you scale down this data for your ML algorithm.

The two most common techniques to do this are normalization and standardization. Let's take a closer look at the two.

Introduction

What is Normalization?

This method also known as Min-Max Scaling involves the rescaling of values into a common scale between a range of [0,1] or [-1,1]. It is best suited when there are no outliers. In Python we use a transformer from the Scikit-Learn package called MinMaxScaler for Normalization.

What is Normalization?

Firstly we will import our dataset as follows.

PYTHON
1import pandas as pd
2import numpy as np
3
4df = pd.read_csv("/content/winequality-red.csv")
5
6# Choosing 3 features for our analysis
7wine_df = df.loc[:,['quality','alcohol','density']]

Next using MinMaxScaler we will perform Normalization on our data.

PYTHON
1from sklearn.preprocessing import MinMaxScaler
2scaling = MinMaxScaler()
3
4scaling.fit_transform(wine_df[['quality','alcohol']])

Try this exercise. Is this statement true or false?

Normalization is also known as Min-Max Scaling.

Press true if you believe the statement is correct, or false otherwise.

What is Standardization?

This method also known as Z-Score Normalization involves the rescaling based on standard normal distribution where the mean is 0 and the standard deviation is 1. Unlike normalization it is less affected by outliers since there is no predefined range of transformed features. In Python we use a transformer from the Scikit-Learn package called StandardScaler for standardization.

What is Standardization?

Using the same dataframe we will perform Standardization on our data using the StandardScaler.

PYTHON
1from sklearn.preprocessing import StandardScaler
2scaling = StandardScaler()
3
4scaling.fit_transform(wine_df[['quality','alcohol']])

Are you sure you're getting this? Is this statement true or false?

Standardization involves the rescaling of data between a range of 0,1.

Press true if you believe the statement is correct, or false otherwise.

What Do I Choose?

What Do I Choose?

Normalization is a good option when you don't know the distribution of your data. It is useful when your data has varying scales and the algorithm you are using does not make assumptions about the distribution of your data, such as k-nearest neighbors and artificial neural networks.

Standardization on the other hand is a good option when you have assumed your data has a Gaussian distribution. Although not strictly necessary, this technique is more effective with a Gaussian distribution. It is useful when your data has varying scales and the algorithm you are using does make assumptions about the distribution, such as linear regression, logistic regression, and linear discriminant analysis.

Try this exercise. Is this statement true or false?

The Normalization method is most appropriate when your data has a Gaussian distribution.

Press true if you believe the statement is correct, or false otherwise.

Conclusion

Feature scaling is an important preprocessing step in ML. Choosing whether to use Normalization or Standardization all depends on your data and algorithm you are using.

One Pager Cheat Sheet

  • Scaling numerical data by either normalizing or standardizing the features is important for many machine learning algorithms in order to account for differences in magnitude and unit.
  • Normalization is the process of rescaling values into a common scale between a range of [0,1] or [-1,1] by using the MinMaxScaler transformer from the Scikit-Learn package.
  • Normalization is a method of rescaling the values of a dataset so that they are within a given range, which is typically either between 0 and 1 or between -1 and 1, and is commonly referred to as Min-Max Scaling.
  • Standardization, also known as Z-Score Normalization, involves rescaling the data based on standard normal distribution, and is performed using the StandardScaler transformer in Scikit-Learn.
  • Standardization is the process of calculating the z-score relative to the mean and standard deviation of the dataset.
  • When deciding between Normalization and Standardization, it depends on whether you know the distribution of your data or if you need an algorithm that makes assumptions about the distribution, such as the use of k-nearest neighbors or linear regression respectively.
  • No, Normalization should not be used when data has a Gaussian distribution, instead Standardization should be used for data with a Gaussian distribution where the algorithm makes assumptions about it.
  • Feature scaling is an important preprocessing step for machine learning models, with Normalization or Standardization being the two main techniques available.