Missing data..... what to do?
Missing data occurs in almost all research.
Data can be missing for many reasons; for example in longitudinal studies subjects often drop out before the study is completed because they have moved out of the area, died, or for adverse events as well as in surveys participants do not know the answer to or accidentally skip an item etc.
A possible question when missing values occur is: what it was if we had seen the complete data? The aim of scientist is to find suitable criteria to impute missing data with appropriate values. The choice of an appropriate method depends on the missing data pattern and the missing data mechanism.
The standard classification of missing data mechanism (Rubin, 1976) is:
- missing completely at random (MCAR), i.e. the reason for missing data doesn't depend on the observed or unobserved data;
- missing at random (MAR) i.e. the reason for missing data can be explained by the observed data and after accounting for this (conditionally missing at random), there is no further information in the unobserved data;
- missing not at random (MNAR) i.e. even after considering the information in the observed data, the reason for missing observations depends on the unobserved observations.
How to act in the presence of missing data? The options can be:
1. Complete/Available case analysis: a possibility is to discard units whose information is incomplete, considering a list-wise deletion; however, deleting all units with incomplete data from the analysis can be inefficient, problematic because sample size changes or a very reduced sample size is considered, in addition, when the missing data mechanism is MNAR a completers analysis will give biased estimates and invalid inferences.
2. Filling in the missing values: a possibility is to impute the missing values so that the resulting data set is complete; it is potentially more efficient than case deletion, maintaining the full sample helps to prevent loss of power resulting from a diminished sample size.
Methods available for imputing missing data can be divided into two main categories:
- -Single imputation: attributes one value for each missing one
- - Multiple imputation: generates multiple simulated values for each missing value considering the uncertainty linked to missing data.
In conclusion, unless the proportion of missing is so small as to be unlikely to affect inferences, the single imputation should be avoided, since it doesn’t not properly reflect statistical uncertainty of the data.
Little, R. J. A. Rubin, D. B., (2002) Statistical analysis with missing data, 2nd ed., Wiley.
Rubin DB (1976) Inference and missing data. Biometrika , 72:359-364.