Data Visualization is an important step in machine learning. Data Visualization is used to visualize the distribution of data, the relationship between two variables, etc.
So, let’s take a deep dive into univariate and bivariate analysis using seaborn.
Univariate Analysis
- Histogram
- Distplot
- Box plot
- Countplot
Bivariate Analysis on Categorical Variables
- Barplot
Bivariate Analysis on Continuous Variables
- Scatterplot
- lmplot
- lineplot
- regplot
I have taken the Iris data set and have performed univariate and bivariate analysis of it.
Univariate Analysis
-
Histogram
A histogram is a visualisation tool that represents the distribution of one or more variables.
sns.histplot(df[“sepal length”],color=‘darkorange’)
data:image/s3,"s3://crabby-images/cc97d/cc97d7d4b715946754995482a82f6164bf6c696a" alt="Univariate and Bivariate Analysis"
From the above histogram plot, we can infer that the sepal length ranges from 4 to 8. And also we can infer that more iris species have sepal length between 5.5 to 6.5.
To get vertical histogram plots, we can switch the axis.
sns.histplot(y=“sepal length”,data=df,color=‘darkorange’)
data:image/s3,"s3://crabby-images/6716b/6716b3b2050b6f63566e8aeb1c1fb66dda17ebd6" alt="Univariate and Bivariate Analysis"
Histogram for categorical variables
To include ‘cat2egorical variables’, the hue parameter is used. The color encoding is done based on the categorical variable.
sns.histplot(x=‘sepal length’,data=df,hue=df[‘iris’])
data:image/s3,"s3://crabby-images/061af/061afe2342d222a055ed7d40a050fdeb777a1f65" alt="Univariate and Bivariate Analysis"
From the plot, we can infer the sepal length of various ‘iris’ species.
2. distplot
Distplot is a histogram with a line on it. Displot is used for single variable distribution.
sns.distplot(df[“sepal length”],color=‘darkorange’)
data:image/s3,"s3://crabby-images/21f2a/21f2a7a28a54eb2ea80ea339737a52996dd2e006" alt="Univariate and Bivariate Analysis"
To visualize distplot alone, we can give hist=False
sns.distplot(df[“sepal length”],hist=False,color=‘darkorange’)
data:image/s3,"s3://crabby-images/71855/71855829770ff1fd62679aadda2d7b4368cfd9f4" alt="Univariate and Bivariate Analysis"
3. Boxplot
Box plot is used to visualize the descriptive statistics of a variable. It is used to detect outliers. It represents the five-point summary.
Five Point Summary
min,max,median,lower quartile(Q1),upper quartile(Q3)
sns.boxplot(df[“sepal length”],color=’darkorange’)
data:image/s3,"s3://crabby-images/4102f/4102f808fbe61b8627103945d6923ae62a6459f0" alt="Univariate and Bivariate Analysis"
Interpreting boxplot
Boxplot is the pictorial representation of descriptive statistics.
data:image/s3,"s3://crabby-images/b1c41/b1c419efdb4d3a7389459518fde8cf073be1a8c9" alt="Univariate and Bivariate Analysis"
4. countplot
Count plot is used for the distribution of categorical variables. It shows the count of each categorical bin.
sns.countplot(df[‘iris’])
data:image/s3,"s3://crabby-images/2fb5b/2fb5b24cdbda165ee0a906731db80c738194c065" alt="Univariate and Bivariate Analysis"
Bivariate Analysis on Categorical Variables
-
Barplot
Barplot is used to aggregate the categorical data based on some aggregate methods. By default it is mean.
sns.barplot(x=‘iris’,y=‘sepal length’,data=df)
data:image/s3,"s3://crabby-images/6bcf8/6bcf81a69ad80bf5e322d00c4361923120f4249a" alt="Univariate and Bivariate Analysis"
2. pointplot
Pointplot helps to visualize the distribution of values at each level of the categorical variable.
sns.pointplot(x=‘iris’,y=‘petal width’,data=df,color=‘darkorange’)
data:image/s3,"s3://crabby-images/d4c73/d4c73ff6b94eddf7d486547811ec6f1b504e2604" alt="Univariate and Bivariate Analysis"
Bivariate Analysis on Continuous Variables
-
scatterplot
The scatterplot shows the correlation between two numerical variables.
sns.scatterplot(df[“petal length”],df[‘petal width’],color=‘darkorange’)
data:image/s3,"s3://crabby-images/402ee/402eea12f651cb20942fd10f3823029a1c1544f3" alt="Univariate and Bivariate Analysis"
2. lineplot
The relationship between two numerical variables is shown in a line.
sns.lineplot(df[‘petal length’],df[‘petal width’],color=‘darkorange’)
data:image/s3,"s3://crabby-images/92363/92363c5097c85276877e26931f351612ced847e4" alt="Univariate and Bivariate Analysis"
3. regplot
Regplot is a scatterplot with a regression line to it.
sns.regplot(df[‘petal length’],df[‘petal width’],color=‘darkorange’)
data:image/s3,"s3://crabby-images/796e3/796e3031bc4f0162a96a29619fa3b35a4b983f68" alt="Univariate and Bivariate Analysis"
Conclusion
In this article, I have covered the different types of univariate and bivariate analysis using the iris data set. Thanks for reading!