panhandlefamily.com

Mastering Missing Data: Four Effective Imputation Techniques

Written on

Chapter 1: Understanding Missing Data

If you have delved into data analysis or data science, you are likely aware that approximately 80% of the effort revolves around data collection and preparation, while only 20% is dedicated to actual data analysis and model development. This principle holds true even if you haven’t yet engaged in practical applications. A pivotal aspect of data preparation involves managing missing values.

In the realm of data analysis, encountering a dataset devoid of missing values is quite rare. If you do come across such an anomaly, it is likely because the data has been previously cleaned, or the dataset is small and straightforward, making it easier to identify and rectify missing entries. When dealing with larger datasets, dropping incomplete observations may not be feasible if every entry is significant. In such cases, employing methods for imputing missing values becomes essential.

Let’s envision a practical scenario. After executing your database queries, merging tables, and performing pivot operations, you end up with a final table ready for analysis. Yet, you cannot commence your analysis or model building until you address the missing data.

This article presents four techniques for imputing missing values, along with Python code examples. For demonstration purposes, we will utilize the insurance.csv dataset, which is readily available on Kaggle. If you wish to follow along, please download the dataset and use Google Colab.

# Import libraries:

import pandas as pd

import numpy as np

import random

# Import and load dataset:

df = pd.read_csv('/content/insurance.csv')

df

Dataset overview

Upon initial visual inspection, it appears that our dataframe contains no missing values. We can confirm this by executing the following command:

# Check for NAN values in dataframe:

print(df.isnull().values.any(), df.isnull().sum().sum())

Data validation check

To proceed, we need to artificially introduce missing values into our dataframe. We will insert 10% of missing values while ensuring that no single row is entirely filled with NAN values. Keep in mind that our dataframe is structured as an algebraic matrix with i rows and j columns.

# Iterations with i (rows) and j (columns):

ij = [(row, col) for row in range(df.shape[0]) for col in range(df.shape[1])]

Next, we will randomly insert NAN values into various locations:

# Insert NAN values:

for row, col in random.sample(ij, int(round(.1*len(ij)))):

df.iat[row, col] = np.nan

If we check our dataframe once more, we should see approximately 10% of entries as NAN:

Dataframe after inserting NAN values

To validate the presence of NAN values across the entire dataset, we can execute the following command:

# Check NAN in data frame:

print(df.isnull().values.any(), df.isnull().sum().sum())

Summary of missing data

To assess NAN values in specific columns, we can check each relevant column with these commands:

print('Age NAN: ', df['age'].isnull().sum())

print('Sex NAN: ', df['sex'].isnull().sum())

print('BMI NAN: ', df['bmi'].isnull().sum())

print('Children NAN: ', df['children'].isnull().sum())

print('Smoker NAN: ', df['smoker'].isnull().sum())

print('Region NAN: ', df['region'].isnull().sum())

print('Charges NAN: ', df['charges'].isnull().sum())

Analysis of missing values by column

Now that we have introduced NAN values, we can begin the process of replacing them with suitable alternatives.

  1. Using a Constant Value

    One of the most straightforward methods for imputing missing values is to utilize a constant value based on the analyst's understanding of the problem. For instance, to replace all NAN values in the 'children' column with 0, you can use the following code:

# Fill NAN with 0 in children column using pandas:

df['children'] = df['children'].fillna(0)

Alternatively, using numpy:

# Fill NAN with 0 in children column using numpy:

df['children'] = df['children'].replace(np.nan, 0)

After executing the above code, you can verify the changes in the 'children' column:

Result after filling NAN with a constant
  1. Imputing with Mean or Mode

    The next technique involves using the mean for numerical variables or the mode for categorical variables. Here, we will address missing values for the 'bmi' (numerical) and 'region' (categorical) columns.

To start, compute the mean for the 'bmi' column:

# Get mean for column bmi:

print('Mean BMI is: ', df["bmi"].mean())

Mean BMI calculation

Now, replace the NAN values in the 'bmi' column accordingly:

# Replace NAN values with mean:

df['bmi'] = df['bmi'].fillna(30.62)

print('BMI NAN: ', df['bmi'].isnull().sum())

Updated BMI column without NAN

Next, to handle missing values for categorical variables, determine the mode:

# Get mode for column region:

print('Mode region is: ', df["region"].mode())

Mode calculation for region

Then, fill the NAN values in the 'region' column:

# Replace NAN values with mode:

df['region'] = df['region'].fillna('southeast')

print('Region NAN: ', df['region'].isnull().sum())

Updated region column without NAN
  1. Random Value Imputation

    Another approach is to generate random values based on the observed distribution to replace NAN entries. This method requires careful consideration of the distribution's nature (normal vs. non-normal).

To determine if the 'age' variable follows a normal distribution, we can visualize it:

import seaborn as sns

# Build and display histogram:

sns.histplot(data=df["age"], kde=False)

Age distribution histogram

Observing the histogram, we can see that the data does not follow a normal distribution.

Non-Normal Distribution:

Since the distribution is non-normal, we can generate random values between the minimum and maximum observed values. First, we check the minimum and maximum values:

# Get maximum and minimum values:

print('Min age: ', df["age"].min())

print('Max age: ', df["age"].max())

Age value range

Next, we define a function to generate values between 18 and 64 and replace the NAN entries:

# Create a function to generate random values and replace NAN:

def fill_missing(value):

if np.isnan(value):

value = np.random.randint(18, 64)

return value

Now, we can apply this function to the 'age' column:

# Replace NAN values with random generated values:

df['age'] = df['age'].apply(fill_missing)

print('Age NAN: ', df['age'].isnull().sum())

Updated age column without NAN

Normal Distribution:

For normally distributed data, we need to exercise caution when generating random values to avoid skewing the distribution. First, we compute the mean and standard deviation:

# Get mean and standard deviation:

mean = df['age'].mean()

std = df['age'].std()

Mean and standard deviation of age

Next, we define a function to generate values based on a normal distribution:

# Create a function to generate random values and replace NAN:

def fill_missing_normal(value):

if np.isnan(value):

value = np.random.normal(mean, std, 1)

return value

Finally, we apply this function to the 'age' column:

# Replace NAN values with random generated values:

df['age'] = df['age'].apply(fill_missing_normal)

print('Age NAN: ', df['age'].isnull().sum())

Updated age column with normal distribution values
  1. Linear Regression for Continuous Variables

    The last method involves utilizing linear regression to predict missing values for a continuous variable, specifically the 'charges' column. Initially, we need to remove any rows with NAN values, as linear regression models do not accept such entries.

# Drop NAN:

df_lm = df.dropna()

Next, we can create a Linear Regression model:

from sklearn.linear_model import LinearRegression

# Define X and y, dropping non-numerical variables:

X = df_lm.drop(['charges', 'sex', 'region', 'smoker'], axis=1)

y = df_lm['charges']

# Fit the model:

lm = LinearRegression().fit(X, y)

Now, similar to previous sections, we create a function to replace missing values:

# Define function:

def fill_missing_lm(value):

if np.isnan(value):

value = lm.predict()

return value

Finally, apply the function to the dataframe:

# Replace NAN values with generated values from linear model:

df['charges'] = df['charges'].apply(fill_missing_normal)

print('Charges NAN: ', df['charges'].isnull().sum())

Updated charges column without NAN

The video "Dealing with Missing Data in R" provides additional insights into handling missing values in datasets, showcasing practical examples and techniques.

The video "18. Missing Data" offers a comprehensive overview of various strategies for addressing missing data in your analysis.

Thank you for reading! I welcome any suggestions for additional techniques, and don’t forget to subscribe for notifications on future publications. If you enjoyed this article, please follow me to stay updated on new releases. Thank you!

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Navigating RRSP and TFSA Over-Contributions: A Comprehensive Guide

Discover effective strategies to manage RRSP and TFSA over-contributions, along with expert advice to enhance your financial planning.

Building Your Personal Brand: From Zero to 1,000 Followers

Discover how I grew my audience on Twitter from scratch, focusing on engagement and storytelling for success.

Unlocking the Potential of Profitable AI Models: A Step-by-Step Guide

Discover how to build and monetize AI models with practical strategies and expert insights in this comprehensive guide.

# Effective Strategies for Managing Stress: A Comprehensive Guide

Discover practical approaches to effectively manage stress and enhance mental well-being with insights and tools for personal growth.

The Hot Hand Fallacy: Exploring a Controversial Phenomenon

An in-depth examination of the hot hand fallacy, its history, and recent research challenging conventional beliefs.

From Dreams to Reality: The Entrepreneur's Journey Unveiled

Explore the transformative journey of an entrepreneur who turned dreams into a thriving business through resilience and innovation.

Unlocking the Secrets of the Top 2.93% of Writers on Medium

Discover effective writing techniques used by top Medium writers and how to improve your own writing skills.

Why Does Apple Stick with Lightning Instead of USB-C for iPhones?

Explore the reasons behind Apple's continued use of the Lightning connector over USB-C for iPhones, considering convenience, environmental impact, and business strategy.