Dealing with Missing Data

Author: Matthew Miller


Missing Data are, unfortunately, a common occurrence. According to Dillon Niederhut in his paper Safe handling instructions for missing data, this causes two main problems. When data are missing from a feature (or variable), estimates of the variance of that feature, and any tests that use those variances, become unreliable. In addition, when data are missing according to a pattern, model parameters that learn from the remaining data become biased by that pattern (see Dillon’s paper for more information). Plus, on a more practical note, most statistical tests and machine learning algorithms are not built to handle them. Missing data, then, must be dealt with in some way, but the recommended methods to use depend on how the data is missing.

Throughout the rest of this post, I will discuss the different types of missing data, some ways to help distinguish between them (with coded examples), and how to deal with them (with coded examples). The coded examples are in R, but I link to appropriate Python libraries when applicable.


Missing Value Types

Introduced by Donald B. Rubin in his 1976 paper titled Inference and missing data, and discussed in the greater context of missing data analysis in Roderick J. A. Little and Donald B. Rubin’s book Statistical Analysis with Missing Data (preview only), is the common framework for distinguishing between three types of missing data: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR) or Non-Ignorable (NI). All of these different types represent data that shouldn’t be missing but are. The implications that each of these can have on your analysis is important, and each of them must be dealt with in different ways. In contrast, there is data that Little and Rubin considered to not be “real” missing data, and that kind of data doesn’t need to be worried about in the same way as the other three types. This is often called “structurally missing” data, and it is for categorical data where the missingness represents an additional category. For example, a participant taking a multiple-choice questionnaire might not answer a question because the category the participant wants to choose is not among the given choices, or the question itself logically doesn’t apply to them. This type of missingness will not be discussed further.

Missing Completely at Random (MCAR)

Missing data is considered MCAR when the reason the data is missing (it’s “missingness”) is unrelated to itself (the variable that contains the missing data) or any other variables included in the dataset. For example, if data is missing because of some accident, like a mechanical/software error, corrupted data, accidental deletion by a human, etc., then it’s likely to be MCAR.

Missing at Random (MAR)

Missing data is considered MAR when it’s missingness is related to some other variable in the dataset, but unrelated to itself. For example, if data for income is missing more for people with higher levels of education compared to low or mid-level incomes, then it is likely to be MAR.

Missing Not at Random (MNAR) or Non-Ignorable (NI)

Missing data is considered MNAR when it’s missingness is related to itself. For example, if a question in a survey asks if someone has depression, and it is known (perhaps from past studies) that someone who is very depressed is less likely to answer that question, then it is likely to be MNAR.


Distinguishing Between Types

In practice the MCAR pattern is the only pattern that is directly testable, and I will discuss the two most common methods for MCAR testing next. While the MAR pattern cannot be tested directly, one of the MCAR tests used can indirectly help to establish evidence for the MAR pattern.

Also, a note on the data used from here on out. The data is called freetrade and is available in R through the Amelia package. It contains economic and political data on nine developing countries in Asia from 1980 to 1999. It contains 171 observations, and 10 variables included are year, country, average tariff rates, Polity IV score, total population, gross domestic product per capita, gross international reserves, a dummy variable for if the country had signed an IMF agreement in that year, a measure of financial openness, and a measure of US hegemony.

# I exclude the variable "signed", treating it as the outcome variable
freetrade <-
  select (-signed)

summary (freetrade)
##       year        country              tariff           polity      
##  Min.   :1981   Length:171         Min.   :  7.10   Min.   :-8.000  
##  1st Qu.:1985   Class :character   1st Qu.: 16.30   1st Qu.:-2.000  
##  Median :1990   Mode  :character   Median : 25.20   Median : 5.000  
##  Mean   :1990                      Mean   : 31.65   Mean   : 2.905  
##  3rd Qu.:1995                      3rd Qu.: 40.80   3rd Qu.: 8.000  
##  Max.   :1999                      Max.   :100.00   Max.   : 9.000  
##                                    NA's   :58       NA's   :2       
##       pop                gdp.pc           intresmi          fiveop     
##  Min.   : 14105080   Min.   :  149.5   Min.   :0.9036   Min.   :12.30  
##  1st Qu.: 19676715   1st Qu.:  420.1   1st Qu.:2.2231   1st Qu.:12.50  
##  Median : 52799040   Median :  814.3   Median :3.1815   Median :12.60  
##  Mean   :149904501   Mean   : 1867.3   Mean   :3.3752   Mean   :12.74  
##  3rd Qu.:120888400   3rd Qu.: 2462.9   3rd Qu.:4.4063   3rd Qu.:13.20  
##  Max.   :997515200   Max.   :12086.2   Max.   :7.9346   Max.   :13.20  
##                                        NA's   :13       NA's   :18     
##      usheg       
##  Min.   :0.2558  
##  1st Qu.:0.2623  
##  Median :0.2756  
##  Mean   :0.2764  
##  3rd Qu.:0.2887  
##  Max.   :0.3083  

MCAR Tests

First, it must be said that none of the methods discussed below can determine the type of missing data with 100% accuracy. Analysts who decide to use these tests to help with deciding on how to handle missing data should think about the variables and use their knowledge of the data in conjunction with these tests.

Likely the most common way to test for MCAR missing data is to create dummy variables for the variables that contain missing data (where these dummy variables mark when data is missing or not) and then perform multiple t-tests (continuous data) and/or chi-square tests (categorical data) between the dummy variables and other variables to see if the missingness is related to the values of the other variables. If these tests show that the missingness of variables with missing values is related to the values of the other variables, then this lends indirect evidence for MAR data. In R, most common statistical tests are provided in the base stats package. In Python, they can be found in the statsmodels library.

freetrade_withDummies <- 
  mutate_if (.predicate = ~ any ( (.)),
            .funs = list (missing = ~ if_else ( (.), TRUE, FALSE)))

x_arg_num_cols <- 
  select (-fiveop)%>%
  select_if (.predicate = is.numeric)%>%
  colnames ()
# tidy_t.test is my personal function
tidy_t.test (col_list = x_arg_num_cols,
            df = freetrade_withDummies,
            label = "fiveop_missing")

x_arg_notNum_cols <- 
  select (-fiveop)%>%
  select_if (.predicate = is.character)%>%
  colnames ()

# tidy_chiSqTest is my personal function
tidy_chiSqTest (x_arg_notNum_cols, freetrade_withDummies, "fiveop_missing")

The other most often cited test for MCAR is called “Little’s MCAR Test”. This test uses an EM algorithm to estimate the means and covariances of the data and then performs a chi-square test, and you can refer here for a plain language introduction on how the EM algorithm works. A word of caution though, as the article states, the EM algorithm can very slow, and it works best when the percentage of missing data is small and the dimensionality of the data isn’t too big. For R, an implementation of Little’s MCAR Test can be found in the BaylorEdPsych package by A. Alexander Beaujean. For Python, the only implementation I know of can be found in the Impyute package by Elton Law, but the package is still in its infancy and its mcar_test() function is still a work-in-progress (WIP).

freetrade_cleaned <-
  mutate (country = (as.factor (country)%>% 
                      as.numeric ()))

cor_vars_to_drop <- 
  findCorrelation (cor (freetrade_cleaned,
                      use = "pairwise.complete.obs"),
                  cutoff = 0.5)

test_output <- 
  LittleMCAR (freetrade_cleaned%>% 
               select_at (.vars = -cor_vars_to_drop))
## this could take a while
test_output [-6]
## $chi.square
## [1] 88.00145
## $df
## [1] 27
## $p.value
## [1] 2.192608e-08
## $missing.patterns
## [1] 6
## $amount.missing
##                 country     tariff     polity gdp.pc    intresmi
## Number Missing        0 58.0000000 2.00000000      0 13.00000000
## Percent Missing       0  0.3391813 0.01169591      0  0.07602339
##                     fiveop usheg
## Number Missing  18.0000000     0
## Percent Missing  0.1052632     0

The null hypothesis (H0) is that the data is MCAR, so an insignificant p-value is evidence that the data is MCAR and that listwise deletion is a possibility. In our case, the p-value is significant, lending evidence to the data being not MCAR.

However, both methods are not without their problems. As discussed by Craig Enders in his book Applied Missing Data Analysis (a sample chapter where these problems are discussed can be found here on pages 17-21), running multiple statistical tests has to possibility to produce many correlated statistics that can lead to multiple-comparison problems, so it may be worthwhile to use one of a variety of correction methods. In addition, small group sizes can decrease power, so it may be useful to use a measure of effect size such as Cohen’s d. In regards to Little’s MCAR Test, the practical problems are that it doesn’t identify the specific variables that violate MCAR, and it tends to have low power when the number of variables that violate MCAR is small, the relationship between the data and missingness is weak, or the data are MNAR (raising the likelihood of Type II errors).

(Note: For more information on the output values of the LittleMCAR() function, please follow this link. Also, for a technical explanation of how the function works, please refer to the “Little’s MCAR Test” section of Craig Enders paper. In addition, the source code for the LittleMCAR() function can be found here.)


Handling Missing Values

The most common way that researchers have dealt with missing data has been to remove observations with missing values (listwise deletion) or replace missing data with a measure of central tendency for a given feature. From my experience, this seems to be the way most machine learning practitioners handle the problem as well, and many articles on the subject aimed at machine learning practitioners often recommend these simple approaches (although more and more seem to suggest using algorithms to impute the missing data). However, the academic literature on missing data that I reviewed all come to the same conclusions, that unless the MCAR assumption is met these simple fixes will introduce bias into a model. The current techniques that are considered the “practical state of the art” are what are known as maximum likelihood (ML) and Bayesian multiple imputation (MI), and a fairly technical summary of those methods can be found here in the paper “Missing Data: Our View of the State of the Art” by Joseph L. Schafer and John W. Graham. None of these methods, however, are implemented in the common packages used for machine learning in either R or Python, so I will not be discussing those methods further. What follows are some examples of what can be most easily accomplished using in the most common machine learning packages.


If it can be determined that the data is MCAR, it is possible to simply delete the observations, or rows, that contain missing values, without biasing your data. For most statistical functions in R, this is the default option, but the default is to do nothing in Python. The safe option is to simply handle missing values yourself, which can easily be done in both R (tidyr package) and Python (pandas library) using drop_na() and dropna() respectively.

dim (freetrade)
## [1] 171   9
# R (Dplyr)
freetrade_droppedNA <-freetrade%>% drop_na ()

dim (freetrade_droppedNA)
## [1] 97  9

However, even if the data is MCAR, and simply dropping rows becomes an option in terms of not biasing your data, it can still reduce the power of any statistical test to detect differences in your data due to decreased sample size. In our example, the sample size was reduced from 171 rows to 97, ~57% of the originally size! Unless very few rows will be deleted, listwise deletion shouldn’t be used.


For MCAR and MAR data, some kind of imputation is your best bet. The simplest method, and likely most common one, is to simply replace the missing values with the mean or median value in a variable. This can easily be done manually in R using mutate() (or one of its scoped variants mutate_all()mutate_at(), or mutate_if()) from the dplyr package. In Python, the easiest method is to use the SimpleImputer() function from scikit-learn.

# R (Dplyr)
freetrade_meanImpute <-
  # replace numeric NAs with mean
  mutate_if (.predicate = is.numeric,
            .funs = ~ replace_na (., replace = mean (., na.rm = T)))

In addition, along with simple median imputation, more advanced (and generally more accurate) model based imputation methods can be done in R with the CARET package’s preProcess() function. In Python, model based imputation can be performed using the IterativeImputer() function from scikit-learn (however, this function is still experimental for now).

# Use "knnImpute" or "bagImpute" as the method to impute using a "k-nearest 
#neighbors "or" bagged trees "model.
freetrade_knnImpute <- 
  preProcess (x = freetrade, method = "knnImpute")%>%
  predict (object =., newdata = freetrade)

freetrade_bagImpute <- 
  preProcess (x = freetrade, method = "bagImpute")%>%
  predict (object =., newdata = freetrade)

freetrade_medianImpute <- 
  preProcess (x = freetrade, method = "medianImpute")%>%
  predict (object =., newdata = freetrade)

Now, let’s run some diagnostic plots to get a sense for how well the three different imputation methods did:

imputed_dfs_list <-list (freetrade_meanImpute,

plot_title_vector <-paste (c ("Mean", "Median", "KNN", "Bag"), "Impute")

# Using knnImpute will automatically trigger preprocess to center and scale the
# data, so the original data must also be centered and scaled to be plotted
# properly.
freetrade_centeredScaled <-
  preProcess (freetrade)%>%
  predict (freetrade)
map2 (.x = imputed_dfs_list, .y = plot_title_vector, .f = function (df, title) {
  if (title == "KNN Impute") {
    ggplot () +
      geom_point (data = freetrade_knnImpute,
                 mapping = aes (x = tariff,
                               y = gdp.pc,
                               color = "Imputed Data")) +
      geom_point (data = freetrade_centeredScaled,
                 mapping = aes (x = tariff,
                               y = gdp.pc,
                               color = "Observed Data")) +
      labs (title = "Knn Impute",
           color = "Data Type") +
      theme (plot.title = element_text (hjust = 0.5))
  } else {
    ggplot () +
      geom_point (data = df,
                 mapping = aes (x = tariff,
                               y = gdp.pc,
                               color = "Imputed Data")) +
      geom_point (data = freetrade,
                 mapping = aes (x = tariff,
                               y = gdp.pc,
                               color = "Observed Data")) +
      labs (title = title,
           color = "Data Type") +
      theme (plot.title = element_text (hjust = 0.5))

map2 (.x = imputed_dfs_list, .y = plot_title_vector, .f = function (df, title) {
  if (title == "KNN Impute") {
    ggplot () +
      geom_density (data = df,
                   mapping = aes (x = tariff,
                                 fill = "Imputed Data")) +
      geom_density (data = freetrade_centeredScaled,
                   mapping = aes (x = tariff,
                                 fill = "Observed Data",
                                 alpha = 0.25)) +
      labs (title = title,
           fill = "Data Type") +
      theme (plot.title = element_text (hjust = 0.5)) +
      guides (alpha = "none")
  } else {
    ggplot () +
      geom_density (data = df,
                   mapping = aes (x = tariff,
                                 fill = "Imputed Data")) +
      geom_density (data = freetrade,
                   mapping = aes (x = tariff,
                                 fill = "Observed Data",
                                 alpha = 0.25)) +
      labs (title = title,
           fill = "Data Type") +
      theme (plot.title = element_text (hjust = 0.5)) +
      guides (alpha = "none")

As discussed by Stef van Buuren in his book Flexible Imputation of Missing Data in the section on diagnostics,

One of the best tools to assess the plausibility of imputations is to study the discrepancy between the observed and imputed data. The idea is that good imputations have a distribution similar to the observed data. In other words, the imputations could have been real values had they been observed. Except under MCAR, the distributions do not need to be identical, since strong MAR mechanisms may induce systematic differences between the two distributions. However, any dramatic differences between the imputed and observed data should certainly alert us to the possibility that something is wrong.

Given this, it seems clear that the modeling based approaches have imputed values that are much more likely to have been real values had they been observed.

In addition to what I have shown here, an interesting experiment is detailed in the “Safe handling instructions for missing data” paper introduced at the start of this post. In it, Dillon Niederhut discusses how, because the machine learning community appears less concerned with statistical inference, they seem to feel relatively comfortable with the idea of using these simple fixes for missing data. However, he disagrees, so he set out to lend evidence to his assertion through running an experiment to compare the quality of different imputation techniques. Ultimately, his experiment showed that the typical simple fixes will produce biased results and increase a model’s prediction error. In addition, he performed a case-study, where model results on a complete dataset were compared to model results on the same data where a MAR pattern was imposed. His results showed that the model used on datasets imputed using multiple imputation returned feature importances that were similar to those found in the model run on the complete dataset. In addition, the models used on datasets imputed using the typical simple fixes underestimated the importance of the true important features and overestimated the importance of features that were previously unimportant.



While this post is not an exhaustive summary of all techniques within the field missing data research, I hope it has been helpful for understanding the importance of not simply removing observations with missing values or imputing using single values (such as a central tendency statistic). In general, model based imputation methods are likely to impute values that are much more accurate, and the ability to use them are provided for easy use in both R and Python using CARET and scikit-learn. Before deciding to simply remove observations with missing values or impute using single values, take care to explore the impact that these simple fixes could have on your results.

Final Recommendations:

1.  Do some exploratory analysis to establish evidence for the type of missingness.
2.  If evidence points to MCAR, and very little data will be lost, use listwise deletion.
3.  If evidence points to non MCAR, impute the values using a model based imputation technique.
4.  Create some diagnostic plots to help determine which imputation technique is producing the most likely estimates.


Works Cited


– “An Introduction to Missing Data: Testing the Missing Completely at Random Mechanism.” Applied Missing Data Analysis, by Craig K. Enders, Guilford Publications, 2010, pp. 17–21,
– Buuren, Stef van. “Imputation in Practice: Diagnostics.” Flexible Imputation of Missing Data,
– Glenn, Stephanie. “Effect Size (Measures of Association) Definition and Use in Research.” Statistics How To, Data Science Central, 12 Jan. 2015,
– Glen, Stephanie. “EM Algorithm (Expectation-Maximization): Simple Definition.” Statistics How To, Data Science Central, 7 Sept. 2015,
– Glen, Stephanie. “Statistical Power: What It Is, How to Calculate It.” Statistics How To, Data Science Central, 31 Oct. 2017,
– “Introduction: Mechanisms That Lead to Missing Data.” Statistical Analysis with Missing Data, Third Edition, John Wiley & Sons, 2019, books? id = OaiODwAAQBAJ & lr
– Mangiafico, SS 2015. An R Companion for the Handbook of Biological Statistics, version 1.3.2. Https://
– McDonald, JH 2014. Handbook of Biological Statistics (3rd ed.). Sparky House Publishing, Baltimore, Maryland. This web page contains the content of pages 254-260 in the printed version. Https:// .html
– Niederhut, Dillon. “Safe Handling Instructions for Missing Data.”, Conference., Proceedings of the 17th Python in Science Conference (SciPy 2018), pdf
– Rubin, Donald B. “Inference and Missing Data.” Biometrika, vol. 63, no. 3, Dec. 1976, pp. 581–592., Https:// /rubin-missing-76.pdf
– Schafer, Joseph L., and John W. Graham. “Missing Data: Our View of the State of the Art.” Psychological Methods, vol. 7, no. 2, 2002, pp. 144-177., Https: //

R Packages/Libraries

– A. Alexander Beaujean (2012). BaylorEdPsych: R Package for Baylor University Educational Psychology Quantitative Courses. R package version 0.5.
– Hadley Wickham (2017). Tidyverse: Easy Install and Load the ‘Tidyverse’. R package version 1.2.1.
– Hao Zhu (2019). KableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.1.0.
– James Honaker, Gary King and Matthew Blackwell (2018). Amelia: A Program for Missing Data. R package version 1.7.5.
– Max Kuhn. Contributions from Jed Wing, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, Brenton Kenkel, the R Core Team, Michael Benesty, Reynald Lescarbeau, Andrew Ziem, Luca Scrucca, Yuan Tang, Can Candan and Tyler Hunt. (2019). caret: Classification and Regression Training. R package version 6.0-84.
– R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL
– Yihui Xie (2019). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.23.

Python Packages/Libraries

– Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay. Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, 12, 2825-2830 (2011)
– Law, Elton. Impyute,
– Perktold, Josef, et al. StatsModels: Statistics in Python,
– Wes McKinney. Data Structures for Statistical Computing in Python, Proceedings of the 9th Python in Science Conference, 51-56 (2010)


Further Resources

Session Info

print (sessionInfo (), locale = F)
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## Matrix products: default
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## other attached packages:
##  [1] mvnmle_0.1-11.1   Amelia_1.7.5      Rcpp_1.0.1       
##  [4] kableExtra_1.1.0  knitr_1.23        caret_6.0-84     
##  [7] lattice_0.20-35   BaylorEdPsych_0.5 forcats_0.4.0    
## [10] stringr_1.4.0     dplyr_0.8.2       purrr_0.3.2      
## [13] readr_1.3.1       tidyr_0.8.3       tibble_2.1.3     
## [16] ggplot2_3.2.0     tidyverse_1.2.1  
## loaded via a namespace (and not attached):
##  [1] httr_1.4.0         jsonlite_1.6       viridisLite_0.3.0 
##  [4] splines_3.5.0      foreach_1.4.4      prodlim_2018.04.18
##  [7] modelr_0.1.4       assertthat_0.2.1   highr_0.8         
## [10] stats4_3.5.0       cellranger_1.1.0   yaml_2.2.0        
## [13] ipred_0.9-9        pillar_1.4.2       backports_1.1.4   
## [16] glue_1.3.1         digest_0.6.19      rvest_0.3.4       
## [19] colorspace_1.4-1   recipes_0.1.5      htmltools_0.3.6   
## [22] Matrix_1.2-14      plyr_1.8.4         timeDate_3043.102 
## [25] pkgconfig_2.0.2    broom_0.5.2        haven_2.1.0       
## [28] scales_1.0.0       webshot_0.5.1      RANN_2.6.1        
## [31] gower_0.2.1        lava_1.6.5         generics_0.0.2    
## [34] withr_2.1.2        nnet_7.3-12        lazyeval_0.2.2    
## [37] cli_1.1.0          survival_2.41-3    magrittr_1.5      
## [40] crayon_1.3.4       readxl_1.3.1       evaluate_0.14     
## [43] fansi_0.4.0        nlme_3.1-137       MASS_7.3-51.4     
## [46] xml2_1.2.0         foreign_0.8-70     class_7.3-14      
## [49] tools_3.5.0        data.table_1.12.2  hms_0.4.2         
## [52] munsell_0.5.0      compiler_3.5.0     rlang_0.4.0       
## [55] grid_3.5.0         iterators_1.0.10   rstudioapi_0.10   
## [58] labeling_0.3       rmarkdown_1.13     gtable_0.3.0      
## [61] ModelMetrics_1.2.2 codetools_0.2-15   reshape2_1.4.3    
## [64] R6_2.4.0           lubridate_1.7.4    utf8_1.1.4        
## [67] zeallot_0.1.0      stringi_1.4.3      vctrs_0.1.0       
## [70] rpart_4.1-13       tidyselect_0.2.5   xfun_0.8


$$ \begin{aligned} \newcommand\argmin{\mathop{\rm arg~min}\limits} \boldsymbol{\beta}_{\text{ridge}} & = \argmin_{\boldsymbol{\beta} \in \mathcal{R^p}} \biggl[ ||\boldsymbol{y}-\boldsymbol{X\beta}||^2 + \lambda ||\boldsymbol{\beta}||^2 \biggr] \\ & = (\boldsymbol{X}^T\boldsymbol{X} + \lambda\boldsymbol{I_{p+1}})^{-1}\boldsymbol{X}^T\boldsymbol{y} \end{aligned} $$