Handle Missing Data in datasets


Handle Missing Data in datasets. Sometimes working with Machine Learning, you get the dataset is not clear. Failed to load or extract incomplete data causing some values to be missing. Handling missing values is a pre-processing of data. Making the right decision to handle can make the data model better. Now, start learning how to handle missing data.

1. Load the dataset

This is an example of a missing data set. You can download it in my Github.


Missing Data Example

I use python to import this Data.csv file.

import os
import pandas as pd
file = os.path.realpath("Data Preprocessing\Data.csv")
data = pd.read_csv(file)
Import Data.csv file
Import Data.csv file

You can see the NaN values in the dataset. We need to convert them to real values.

2. Handle Missing data in datasets with Sklearn

Python provides a very powerful Machine Learning library that is sklearn.

Step 1: Import SimpleImputer from sklearn

from sklearn.impute import SimpleImputer
import numpy as np

SimpleImputer is a function provided to handle missing values. And we need numpy to define missing values.

Step 2: Call SimpleImputer function

This function has some required parameters. Here are 3 examples explaining the parameters.

Example 1: Take the Mean of the Column

imputer = SimpleImputer(missing_values=np.nan, strategy="mean")

Example 2: Take the Median of the Column

imputer = SimpleImputer(missing_values=np.nan, strategy="median")

Example 3: Take the Most Frequent value of the Column

imputer = SimpleImputer(missing_values=np.nan, strategy="most_frequent")

missing_values: Definition of missing value. In Python, the missing values are displayed as NaN. Therefore, we set missing_values here as np.nan.

strategy: You can replace the NaN values with the following values: Mean, Median, Most Frequent.a

Go back to the example with Data.csv.

  • I will replace NaN values with Mean values on each column.
  • Two columns must replace column 1 and column 2 (starting at 0).

Source code:

from sklearn.impute import SimpleImputer
import numpy as np
imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
X = data.iloc[:, [1,2]]
Setup Missing data column
Setup Missing data column

Step 3: Fit data

Apply SimpleImputer to your data.

imputer = imputer.fit(X)

Step 4: Transform data

Displays missing values with new values computed.

X = imputer.transform(X)
Fit Transform Missing Value
Fit Transform Missing Value

3. Conclusion

Handle missing data in the datasets is an extremely important step in cleaning the data. Do not forget this step. It will make the model work better.

Leave a Reply