Post

[MLR-01] Machine Learning: Data Preprocessing

Data preprocessing might not be the most glamorous part of machine learning, but it’s a crucial step in ensuring that your models deliver accurate and reliable results. In this blog post, I’ll walk you through the process of data preprocessing using R, making it as straightforward as possible.

Importing the Dataset

Let’s start at the beginning. The first step is importing your dataset. In this example, we’re using a CSV file named ‘Data.csv’. We can load it into R using the read.csv() function. This step sets the stage for all subsequent data manipulation and analysis.

1
2
# Importing the dataset
dataset = read.csv('Data.csv')

Handling Missing Data

Real-world datasets often have missing values, which can wreak havoc on your analysis. To address this, we’ll replace missing values with the mean of the respective column. The ifelse() function helps us achieve this by checking for missing values and replacing them with the mean using the ave() function.

1
2
3
4
5
6
7
# Taking care of missing data
dataset$Age = ifelse(is.na(dataset$Age),
                     ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)),
                     dataset$Age)
dataset$Salary = ifelse(is.na(dataset$Salary),
                        ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)),
                        dataset$Salary)

Encoding Categorical Data

Machine learning models often require numerical data, so we need to encode categorical variables into numerical form. Here, we’re converting ‘Country’ and ‘Purchased’ into factors, assigning numerical labels to categories.

1
2
3
4
5
6
7
# Encoding categorical data
dataset$Country = factor(dataset$Country,
                         levels = c('France', 'Spain', 'Germany'),
                         labels = c(1, 2, 3))
dataset$Purchased = factor(dataset$Purchased,
                           levels = c('No', 'Yes'),
                           labels = c(0, 1))

Splitting the Dataset

Before building and testing our models, we need to divide our dataset into a training set and a test set. This division allows us to train our models on one portion and evaluate their performance on another. We use the sample.split() function from the ‘caTools’ library to achieve this.

1
2
3
4
5
6
# Splitting the dataset into the Training set and Test set
library(caTools)
set.seed(123) # for reproducibility
split = sample.split(dataset$Purchased, SplitRatio = 0.8) # 80% for training
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

Feature Scaling

Lastly, we perform feature scaling to ensure that all our numerical features have the same scale. This step prevents certain variables from dominating others during model training. We use the scale() function to standardize our features.

1
2
3
# Feature scaling
training_set[, 2:3] = scale(training_set[, 2:3])
test_set[, 2:3] = scale(test_set[, 2:3])

And there you have it – a comprehensive guide to data preprocessing in machine learning using R. Each step plays a vital role in preparing your data for model building, making your machine learning journey a smoother one. If you want to dive deeper into any of these topics, there are numerous resources available online, as well as comprehensive documentation within R itself.

Happy preprocessing and happy modeling! 🚀

This post is licensed under CC BY 4.0 by the author.