Data science with statistical learning
Some of the research at Soda is on the statistical foundations of machine learning, in particular for problem important to data science. The goals here are to use machine-learning models as non-parametric estimators for common problems in data science. Beyond mere prediction accuracy, questions of statistical control arise.
Statistical learning with missing values
Statistical inference with missing values has been studied for decades, but modern machine-learning practice brings new trade-offs. In particular, we have shown that the classical view on imputation may not give the best-performing predictors, and that missing-not-at-random settings could be tackled by machine learning models.
Machine learning for causal inference
Modern causal inference builds on estimating response function, for treated and non treated individuals, or probability of treatment or trial inclusion. We study the use of machine-learning models to estimate these quantities. Indeed, as we deal with increasingly complex data, such as that in Electronic Health Records, simple parametric models are no longer enough to leverage the data: the data is made of multiple tables, with many missing values and non-normalized text inputs.
One specific problem that we have focused on is that of generalizing a effect inferred on a study sample with a selection bias compared to the target population. This question is related to external validity of a study.
Publications
Missing values
- titre
- On the consistency of supervised learning with missing values
- auteur
- Julie Josse, Jacob M. Chen, Nicolas Prost, Erwan Scornet, Gaël Varoquaux
- article
- 2024
- Accès au texte intégral et bibtex
- titre
- Causal effect on a target population: a sensitivity analysis to handle missing covariates
- auteur
- Bénédicte Colnet, Julie Josse, Gaël Varoquaux, Erwan Scornet
- article
- Journal of Causal Inference, 2022, 10 (1), pp.372-414. ⟨10.1515/jci-2021-0059⟩
- Accès au texte intégral et bibtex
- titre
- Benchmarking missing-values approaches for predictive models on health databases
- auteur
- Alexandre Perez-Lebel, Gaël Varoquaux, Marine Le Morvan, Julie Josse, Jean-Baptiste Poline
- article
- GigaScience, In press, ⟨10.1093/gigascience/giac013⟩
- Accès au texte intégral et bibtex
- titre
- What’s a good imputation to predict with missing values?
- auteur
- Marine Le Morvan, Julie Josse, Erwan Scornet, Gaël Varoquaux
- article
- NeurIPS 2021 – 35th Conference on Neural Information Processing Systems, Dec 2021, Virtual, France. ⟨10.48550/arXiv.2106.00311⟩
- Accès au texte intégral et bibtex
- titre
- NeuMiss networks: differentiable programming for supervised learning with missing values
- auteur
- Marine Le Morvan, Julie Josse, Thomas Moreau, Erwan Scornet, Gaël Varoquaux
- article
- NeurIPS 2020 – 34th Conference on Neural Information Processing Systems, Dec 2020, Vancouver / Virtual, Canada
- Accès au texte intégral et bibtex
- titre
- Linear predictor on linearly-generated data with missing values: non consistency and solutions
- auteur
- Marine Le Morvan, Nicolas Prost, Julie Josse, Erwan Scornet, Gaël Varoquaux
- article
- AISTATS 2020 – International Conference on Artificial Intelligence and Statistics, Aug 2020, Online, France. pp.3165-3174
- Accès au texte intégral et bibtex
Causal inference
- titre
- Risk ratio, odds ratio, risk difference… Which causal measure is easier to generalize?
- auteur
- Bénédicte Colnet, Julie Josse, Gaël Varoquaux, Erwan Scornet
- article
- 2024
- Accès au bibtex
- titre
- Causal inference methods for combining randomized trials and observational studies: a review
- auteur
- Bénédicte Colnet, Imke Mayer, Guanhua Chen, Awa Dieng, Ruohong Li, Gaël Varoquaux, Jean-Philippe Vert, Julie Josse, Shu Yang
- article
- Statistical Science, In press
- Accès au texte intégral et bibtex
- titre
- Decrease of the spatial variability and local dimension of the Euro-Atlantic eddy-driven jet stream with global warming
- auteur
- Robin Noyelle, Vivien Guette, Akim Viennet, Bénédicte Colnet, Davide Faranda, Andreia Hisi, Pascal Yiou
- article
- Climate Dynamics, 2023, ⟨10.1007/s00382-023-07022-z⟩
- Accès au texte intégral et bibtex
- titre
- Reweighting the RCT for generalization: finite sample error and variable selection
- auteur
- Bénédicte Colnet, Julie Josse, Gaël Varoquaux, Erwan Scornet
- article
- 2022
- Accès au texte intégral et bibtex
- titre
- Causal effect on a target population: a sensitivity analysis to handle missing covariates
- auteur
- Bénédicte Colnet, Julie Josse, Gaël Varoquaux, Erwan Scornet
- article
- Journal of Causal Inference, 2022, 10 (1), pp.372-414. ⟨10.1515/jci-2021-0059⟩
- Accès au texte intégral et bibtex