Missing Data in the Big Data Era

‘Big data’, often observational and compound, rather than experimental and homogeneous, poses missing-data challenges: missing values are structured, non independent of the outcome variables of interest. Deleting incomplete observations creates at best information losses, at worst warped conclusions due to a selection bias.

MissingBigData is funded by Institute DATAIA. The project is motivated by applications in medical data (with tabular data), with the Traumabase and UK Biobank, which feature a great diversity of missing values. In particular, we would like to tackle the problem of causal inference when the data is incomplete.

We propose to use more powerful models that can benefit from the large sample sizes, specifically autoencoders, to impute the missing values, even when they are generated by a non ignorable mechanism. We also consider alternatives to imputation, by directly adapting models such as random forests to handle missing values in the features. 

This project also lead to contribute to open source project, as the r-miss-static website platform for missing values and to scikit-learn project.

MissingBigData is a joint work between Parietal, CMAP and CNRS.

A video that presents the science behind and the package can be found here (keynote starts at ~30min):

Comments are closed.