[MLP-03] Machine Learning: Multiple Linear Regression

Posted Nov 8, 2023

By Randy Frans Fela 3 min read

Believe me, in most linear cases, linear regression is a powerful tool in data analysis and predictive modeling. It allows us to explore and exploit the relationships between multiple independent variables and a dependent variable. In this blog post, I’ll write my notes about multiple linear regression using Python.

Setting the Stage

Well, I am never bored of repeating this. Before we dive into the Python code, let’s clarify what multiple linear regression is all about. In essence, it’s a statistical method used to model the relationship between a dependent variable and two or more independent variables by fitting a linear equation to the observed data.

The fundamental equation for multiple linear regression can be expressed as:

\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_nX_n + \varepsilon\]

Here:

$Y$ represents the dependent variable we’re trying to predict.
$\beta_0$ is the intercept.
$\beta_1, \beta_2, \ldots, \beta_n$ are the coefficients associated with the independent variables $X_1, X_2, \ldots, X_n$.
$\varepsilon$ is the error term, which accounts for unexplained variation.

Let’s Move to Python

Download the dataset here: “50_Startups.csv”

Importing the Libraries

  
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Our journey starts by importing the necessary Python libraries. NumPy, Matplotlib, and Pandas are our trusty companions for data manipulation, visualization, and analysis.

Loading the Dataset

  
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

We load our dataset, which contains information about different startups, including R&D spending, administration spending, marketing spending, and state. Our goal is to predict the profit of these startups.

Encoding Categorical Data

  
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

Dealing with categorical data like “state” is a crucial step. We use a one-hot encoder to convert the categorical variable into numerical values, ensuring that our model can understand it.

Splitting the Dataset

  
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

We split our dataset into a training set and a test set. This allows us to train our model on one portion of the data and evaluate its performance on the other.

Training the Model

  
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

The heart of the matter: we train a multiple linear regression model using the LinearRegression class from scikit-learn. It learns the relationships between our independent variables (R&D spending, administration spending, marketing spending, and state) and the dependent variable (profit).

Making Predictions

  
y_pred = regressor.predict(X_test)

With our model trained, we can now make predictions. We apply our model to the test set to estimate the profits of startups based on the features we’ve provided.

Putting It All Together

  
print(y_pred)
print(regressor.coef_)
print(regressor.intercept_)

We can print out the predictions, the coefficients (the $\beta$ values), and the intercept ($\beta_0$) of our model. These coefficients represent the impact of each independent variable on the dependent variable.

And that’s a wrap! We’ve successfully journeyed through the land of multiple linear regression using Python. This tool is invaluable for understanding the relationships between variables and making predictions based on those relationships.

Now you’re equipped to explore and apply multiple linear regression to your own datasets, unlocking valuable insights and predictions.

References:

Practical Programming, Python

This post is licensed under CC BY 4.0 by the author.