An intro to regression analysis



Regression analysis is one of the most fundamental concepts in machine learning. This article briefly introduces linear and logistic regression and how to implement them using Python and Scikit-learn.

When you take your first steps into machine learning, the first concept you will come across is probably regression or regression analysis. What is regression analysis, though? 

In statistics, regression is a technique used to examine the relationship between one dependent variable and one or more independent variables. The main goal is to model this relationship so that we can make predictions or understand the underlying patterns of the data.

There are various types of regression analysis used in statistics today. The best two to start with are linear and logistic regression. Both these regressions provide a straightforward model to understand the relation between two sides of an equation (or two sets of variables). The difference between them is that linear regression can only model linear relationships between the variables, while logistic regression can model non-linear ones.

In this article, we will explore the basics of linear and logistic regression using Python and Scikit-learn. So, let’s get right to it!

What is Linear regression?

Linear regression is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.

Linear regression is used in different areas of our lives, such as:

  • Real Estate price predictions based on square footage, location, and number of bedrooms.
  • Stock Price Prediction based on historical stock prices, trading volume, and economic indicators.
  • Sales Forecasting using advertising spend and product price.
  • Healthcare Outcome Prediction based on patient age, treatment type, and medical history.

Since linear regression is the simplest form of regression analysis, it is easy to write an implementation for it in Python. Let’s start with a simple linear regression example where we have one dependent variable and one independent variable in 7 steps: 

  1. Load the Dataset
  2. Explore the Dataset
  3. Prepare the Data
  4. Split the Data into Training and Testing Sets
  5. Train the Model
  6. Make Predictions
  7. Evaluate the Model

In code, this would look something like:

regression analysis

For this example, let’s consider a hypothetical dataset stored in a file named data.csv with two columns: Hours_Studied and Test_Score.

regression analysis

Then, we need to split the data into the independent variable X and the dependent variable Y.

regression analysis

Logistic Regression

Linear regression can only predict a linear relationship between the variables, which is not always the case. That’s why we need different types of regression to allow us to explore these different scenarios. One such regression is logistic regression. Logistic regression is another type of regression analysis used when the dependent variable is categorical, typically used for binary classification problems, i.e., when the outcome can be one of two classes. For example, it can be used to predict whether an email is spam or not spam or whether a tumor is malignant or benign.

The logistic regression model predicts the probability that a given instance belongs to a particular category or not based on a threshold value. If the estimated chance is greater than a threshold (often set to 0.5), then the model predicts that the instance belongs to that category; otherwise, it will indicate it does not.

Though logistic regression is slightly more complex than linear regression, its implementation in Python is just as simple, thanks to Scikit-learn!

Let’s go through an example of logistic regression using Python. Suppose we have a dataset stored in a file called health_data.csv with two columns, “Age” and “BMI”,” and we want to predict the binary outcome “HasDisease”.” 

We can do that by following the same steps we did for linear regression. If we write it down as code, it would look like this:

regression analysis


Regression analysis is one of the powerful statistical methods that allows you to uncover the relationship between two or more variables. These variables are often referred to as the dependent and the independent variables. In statistics and machine learning, there are many types of regression analysis; each has its optimal case use. Perhaps the simplest and most commonly used ones are linear and logistic regression. You can implement these two regressions (and more) using Python with libraries such as scikit-learn.

This article briefly introduced linear and logistic regressions and their implementation using Python. However, real-world scenarios often involve multiple independent variables and may require more complex models like multiple linear regression, polynomial regression, or other advanced regression techniques. Nevertheless, the approaches to all different regression algorithms are the same, and it all depends on your desired output and the type of data you have. So, as a beginner, you will need to explore and experiment with different models and datasets to deepen your understanding of regression analysis until you know what regression model to use for specific applications and the required results.


Bryn Bennett, a Full Stack Engineer at Stealth Startup, shares her experience of becoming a Software Engineer and tips for securing your first job.
Conveying the results of your work is all about telling the story your results are trying to tell. To tell a good story, you need...
Using arrays is one of the most utilised functions when programming. In this article, we break down how to manipulate this data structure efficiently.
Mastering the skill of comment writing is, arguably, as important as practising writing clear, precise code. But, what makes good comments, and how can you...