Mark As Completed Discussion

Univariate, Bivariate, Multivariate Analysis

When we talk about Univariate, Bivariate, Multivariate analysis we are referring to classifications of Exploratory Analysis, which refers to:

‘The critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.’

We need to conduct this analysis to understand our data better and also to guide us in the right direction when choosing what machine learning model to implement.

Univariate Analysis

Let’s start with the simplest form of analyzing data, univariate analysis. “Uni” meaning “one,” means that we are only analyzing one feature of our dataset. Based on the one selected feature 'sepal_length', we will try to determine the output species .

Univariate Analysis

PYTHON
1# Import libraries
2import pandas as pd
3import numpy as np
4import matplotlib.pyplot as plt
5import seaborn as sns
6
7# Import the dataset
8df = sns.load_dataset('iris')

First we are dividing out data frame up into three separate data frames for each category.

PYTHON
1setosa_filter = df['species'] == 'setosa'
2virginica_filter = df['species'] == 'virginica'
3versicolor_filter = df['species'] == 'versicolor'
4
5setosa_df = df.loc[setosa_filter,:]
6virginica_df = df.loc[virginica_filter,:]
7versicolor_df = df.loc[versicolor_filter,:]

Next we will plot the feature, since this is a univariate analysis we make everything on the Y axis equal to zero.

PYTHON
1plt.plot(setosa_df['sepal_length'],np.zeros_like(setosa_df['sepal_length']),'o')
2plt.plot(virginica_df['sepal_length'],np.zeros_like(virginica_df['sepal_length']),'o')
3plt.plot(versicolor_df['sepal_length'],np.zeros_like(versicolor_df['sepal_length']),'o')
4plt.xlabel('Sepal length', fontweight = 'bold')
5plt.show()

Univariate Analysis

As you can see from our plot it is quite easy to find the ranges and distinguish which category each data point falls into by their petal length.

Try this exercise. Click the correct answer from the options.

Using the univariate example above, what is the approximate range for petal_length of the species Versicolor?

Click the option that best answers the question.

  • 4.1 - 4.9
  • 4.9 - 7.1
  • 7.1 - 8.0

Bivariate Analysis

However not all datasets have such clear identification as the last example. Often there is a lot of overlap with the points, and this leads us on to the next type of analysis.

Bivariate analysis, 'Bi' meaning 'two', involves analyzing two variables of our dataset.

Using petal_length and sepal_length we will try to determine the species. To do this we will use Seaborn’s built in function FacetGrid.

Bivariate Analysis

PYTHON
1sns.FacetGrid(df,hue="species",size=5).map(plt.scatter,"petal_length","sepal_width")
2
3plt.xlabel('Petal length', fontweight = 'bold')
4plt.ylabel('Sepal length', fontweight = 'bold')
5
6plt.legend(loc=(0.7,0.8))
7plt.show()

Bivariate Analysis

From the plot, it is easy to classify the species Setosa however, the other species are not as easily classified in comparison due to the overlapping of points.

Since there is overlapping we can also conclude that a linear regression model would not be appropriate. Instead other non-linear classification methods would be preferred as to reduce errors like decision trees, random forest etc. Therefore we will move on to the next type of analysis.

Try this exercise. Is this statement true or false?

Bivariate analysis involves analysing 3 or more variables.

Press true if you believe the statement is correct, or false otherwise.

Multivariate Analysis

Multivariate analysis, 'Multi' meaning 'many', involves analysing multiple variables and their relationships. This type of analysis allows us to visualize what we would not be able to actually see i.e. we are not capable of viewing 4-D diagrams.

Multivariate Analysis

To carry out this type of analysis we will use Seaborn’s built in function PairPlot. The PairPlot allows us to observe relationships between all variables within our dataset.

PYTHON
1sns.pairplot(df,hue="species",size=3)

Multivariate Analysis

From our graphs we can see that we have all the variables of our data set along both the X and Y axis. Along the diagonal we see layered kernel density estimate (KDE) showing the distribution of each of our features.

If two features have high correlation with respect to the output we can drop one of them since it is better to use less features.

Observations

  1. petal_length and petal_width are the most useful features to identify various flower types.
  2. While Setosa can be easily identified (linearly separable), Versicolor and Virginica have some overlap (almost linearly separable)

Build your intuition. Click the correct answer from the options.

Using the multivariate example above, what species appears to have the smallest petal width and petal length?

Click the option that best answers the question.

  • Setosa
  • Versicolor
  • Virginica

Conclusion

Exploratory analysis can be classified as Univariate, Bivariate and Multivariate analysis. Univariate analyses one variable, Bivariate analyses two variables and their relationship and Multivariate analyses three or more variables and their relationships.

One Pager Cheat Sheet

  • Analyzing data with Univariate, Bivariate, and Multivariate methods is an important part of Exploratory Analysis, which helps to uncover patterns, anomalies and inform model selection.
  • Univariate Analysis involves analyzing one feature of our dataset to determine the output, as shown by plotting the sepal_length feature for each species and finding their ranges.
  • The petal_length of the species Versicolor ranges from 7.1 - 8.0.
  • Bivariate Analysis is used to analyze two variables in a dataset to determine the species, with non-linear classification methods preferred due to the overlap of points.
  • Bivariate analysis is a statistical analysis that looks at the correlation between two variables, whereas Multivariate analysis looks at the relationships between three or more.
  • Multivariate analysis is a technique used to analyze multiple variables and their relationships by plotting them in a graph to explore their correlation with each other and with the output, and can help to identify what features are most useful.
  • The Pairplot graph highlights that Setosa has the smallest petal width and petal length from the clear patterning of the data points as well as from the layered kernel density estimates.
  • Exploration of a dataset can involve Univariate, Bivariate, and Multivariate Analysis of variables and their relationships.