Deep Dive
20 December 2023
Filling in the Blanks: Missing Value Treatment in ML
Blog Hero

Introduction

Missing values are a common issue in real-world datasets, and handling them appropriately is crucial for accurate and reliable machine learning models. In this blog, we will explore various methods for handling missing values in machine learning, along with detailed explanations of each approach.

Identifying the type of the Missing Values

Before delving into different methods, it is essential to identify and understand the patterns of missing values in the dataset. By identifying whether values are missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR), we can gain insights into the potential reasons behind the missingness.

Deletion

The simplest approach is removing any data points (rows or columns) with missing values from the dataset before training the model. This technique is known as deletion. There are two primary methods for deletion: list-wise deletion (also known as complete case analysis) and pairwise deletion. While deletion may seem like a straightforward method, but it has cons like deleting data points can lead to a significant loss of information, especially if the missing values are substantial. This can reduce the accuracy and generalizability of the trained model

Imputation

Imputation refers to the process of estimating and filling in missing values based on available information.

Various imputation techniques exist, including:

Mean/Median Imputation

This method replaces the missing values with the mean or median of that particular feature. While simple, this method assumes that the missing values are missing at random and may not be suitable for categorical variables.

Mode Imputation

Mode imputation replaces missing categorical values with the mode (most frequent value) of that feature. Similar to mean/median imputation, this technique assumes randomness in the missingness.

Regression Imputation

Regression imputation involves using a regression model to predict missing values based on other variables. This approach can capture the relationships between variables, but it assumes a linear relationship.

Multiple Imputation

Multiple imputation creates several imputed datasets using a statistical model and combines them to address uncertainty in imputations. It provides more accurate results compared to single imputation methods.

K-nearest Neighbors (KNN) Imputation

The KNN method imputes missing values by finding the K nearest neighbors and using their values to fill in the missing ones. It considers the non-missing values in other features and finds similar instances.

Advanced Imputation Techniques

While the previously mentioned imputation techniques are commonly used, there are also advanced methods that can be effective in specific scenarios.

These include:

Expectation-Maximization (EM) Algorithm

The Expectation-Maximization (EM) algorithm is an iterative approach for handling missing values in machine learning. It aims to estimate the missing values and the model parameters simultaneously, iterating between two steps: Expectation and Maximization.

Deep Learning Imputation

Deep learning models, such as variational autoencoders (VAEs) and generative adversarial networks (GANs), can be employed to impute missing values using their inherent ability to learn complex patterns and generate plausible values.

Method Pros Cons
Deletion (Complete case)
  • Simple and easy to implement
  • Preserves inherent distribution of the data
  • Reduces sample size
  • May result in biased models
  • Removes potentially valuable information
Deletion (Pairwise)
  • Retains available information
  • Does not reduce sample size
  • Can introduce bias if data is not missing completely at random
  • Increases computational complexity by creating multiple models
Mean/Median Imputation
  • Quick and straightforward
  • Preserves sample size and variable distribution
  • May distort variance and covariance
  • Assumes missing values are missing completely at random
Mode Imputation
  • Applicable for categorical variables
  • Preserves sample size and variable distribution
  • Assumes missing values are missing completely at random
  • May introduce bias due to overrepresentation of the mode
Regression Imputation
  • Utilizes relationships between variables
  • Preserves sample size and variable distribution
  • Assumes linear relationship between variables
  • May introduce errors if relationships are not strong
Multiple Imputation
  • Captures uncertainty in imputations
  • Accounts for complex relationships between variables
  • Requires iterative implementation
  • Increases computational complexity
K-nearest Neighbours
  • Considers non-missing values in other features
  • Preserves sample size and variable distribution
  • Highly dependent on the choice of K
  • Computationally expensive for large datasets
Expectation-Maximization
  • Accounts for complex dependencies between features
  • Can handle missing values and training simultaneously
  • Requires estimating missing values iteratively
  • Sensitive to initial data distribution
Deep Learning Imputation
  • Captures complex patterns and relationships
  • Flexible in handling various types of missing values
  • Requires a large amount of training data
  • May require significant computational resources

Conclusion

Handling missing values in machine learning is imperative to ensure accurate and reliable models. It is vital to identify the pattern of missingness before selecting an appropriate method for handling them. While deletion and basic imputation techniques provide simple solutions, advanced imputation methods and deep learning approaches can capture complex dependencies and generate more robust imputations. Choosing the appropriate method depends on the nature of the missing values and the dataset at hand.

Continue Reading