Data Pre-Processing Module

The modules start with data pre-processing. It is the most crucial step for any machine learning project. This step helps the data scientist flesh out a clean and useful dataset to process it for any actionable intelligence. This step is repetitive in nature, and a data scientist often finds himself back to this step at various of the project. This becomes a foundational step in the project.

We need data preprocessing due to various reasons. It is possible that there are inconsistences in the data, for example to describe the male gender, the data consists of the following values: ‘Male’, ‘M’, ‘boy’, ‘m’, ‘man’. All these values mean the same but to a computer they will mean different values. So, it is crucial that such differences be identified and rectified before we start to experiment with the dataset. The data can also be noisy, contain outliers, contain errors or even contain empty values which can induce errors in any algorithm applied to this.

There are various steps involved in data preprocessing:

Step 1: Import the libraries

Step 2: Import the data-set

Step 3: Identify missing values

Step 4: Encoding the Categorical Values

Step 5: Splitting the data-set into Training and Test Set

Step 6: Feature Scaling

Python has a wide variety of data libraries such as Pandas and Numpy which provide high-performance, easy-to-use data structures, and data analysis tools for the Python programming language. I really liked the easy usability for both of these libraries and the inbuilt functions they provide for data cleaning and data structuring.

After importing the libraries, the next step is to use these libraries to import datasets. The entire dataset can be loaded into pandas dataframes for easy manipulation. They also provide functions to check for any missing values. Missing values are handled in two ways: Deleting the missing value data rows or replace it with a mean/median/mode of the other values in that range. Removing the data will lead to a loss of information which will not give the expected results while predicting the output. Using the other method, we can calculate the mean, median or mode of the feature and replace it with the missing values. This is an approximation which can add variance to the data set. But the loss of the data can be negated by this method which yields better results compared to removal of rows and columns.

These strategies work well with numerical data, which why categorical data needs special handling. Since machine learning models are based on Mathematical equations and you can intuitively understand that it would cause some problem if we can keep the Categorical data in the equations because we would only want numbers in the equations. To solve this problem, the categorical values are encoded using methods such as one-hot encoding.

The next step is to split the dataset into test and training dataset. Generally, data is split into a 70-30 or 80-20 ratio where the bigger chunk is used as the training data. The final step is Feature scaling. Feature scaling is the method to limit the range of variables so that they can be compared on common grounds. If we have an Age and a Salary column, these variables won’t have the same scale and this will cause some issue in the machine learning model. Various methods such as re-scaling, Mean normalization, standardization are used to solve this.

Search This Blog

Musings on Learning

Data Pre-Processing Module

Comments

Post a Comment

Popular posts from this blog

Linear Regression Theory

Familiarizing myself with NLTK

Multiple Linear Regression