Missing Data in the Big Data Era

‘Big data’, often observational and compound, rather than experimental and homogeneous, poses missing-data challenges: missing values are structured, non independent of the outcome variables of interest. Deleting incomplete observations creates at best information losses, at worst warped conclusions due to a selection bias.

MissingBigData is funded by Institute DATAIA. The project is motivated by applications in medical data, with the Traumabase and UK Biobank, which feature a great diversity of missing values. In particular, we would like to tackle the problem of causal inference with inverse propensity weighting when the data is incomplete.

We propose to use more powerful models that can benefit from the large sample sizes, specifically autoencoders, to impute the missing values, even when they are generated by a non ignorable mechanism. We also consider alternatives to imputation, by directly adapting models such as random forests to handle missing values in the features.

MissingBigData is a joint work between Parietal, CMAP and CNRS.

An introductory interview of Julie and Gaël on MissingBigData.

Comments are closed.