Representation learning for databases

Representation learning for heterogeneous databases

We develop a deep-learning methodology on heterogeneous databases, comprising relational data and tabular datasets at large scale.
The stakes are (1) to build machine-learning models that apply readily to the raw, uncurated data so as to avoid manual cleaning, data formatting and integration, (2) to extract reusable representations that reduce sample complexity on new databases by transforming the data in well-distributed vectors.

The challenges are:

  • Tokens (symbolic entries) often with high cardinality, such ICD10 classification of disease
  • Numerical entries with different, non-Gaussian marginals, such as salary (long-tailed distribution) or age
  • Missing values, often missing not at random and thus not ignorable without causing biases
  • Shoft text (possibly corresponding to non-normalized entities)
  • Operation across tables: aggregation (learning on sequences, joins (optimal transport)

These challenges call for dedicated architectures, that borrow from neural architectures used in natural language
processing and add new tricks. The fundamental ingredients are:

– Entity embeddings in tabular and relational data
– Neural-network formulations of classical statistical techniques such General Additive Models or EM for missing values (Le Morvan et al., 2020);
– Simple attention and self-attention mechanisms, as those we developed for short texts (Chen et al., 2021);

Current research focus

  • Learning despite database normalization errors
  • Tabular deep learning

Publications

Publications HAL du projet ANR. ANR-17-CE23-0018

titre
How to remove or control confounds in predictive models, with applications to brain biomarkers
auteur
Darya Chyzhyk, Gaël Varoquaux, Michael Milham, Bertrand Thirion
article
GigaScience, Oxford Univ Press, 2022, 11, ⟨10.1093/gigascience/giac014⟩
Accès au texte intégral et bibtex
https://hal.inria.fr/hal-03607651/file/giac014.pdf BibTex
titre
Benchmarking missing-values approaches for predictive models on health databases
auteur
Alexandre Perez-Lebel, Gaël Varoquaux, Marine Le Morvan, Julie Josse, Jean-Baptiste Poline
article
GigaScience, Oxford Univ Press, In press, ⟨10.1093/gigascience/giac013⟩
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-03526292/file/Benchmarking%20missing-values%20approaches%20for%20predictive%20models%20on%20health%20databases.pdf BibTex
titre
Analytics on Non-Normalized Data Sources: more Learning, rather than more Cleaning
auteur
Alexis Cvetkov-Iliev, Alexandre Allauzen, Gaël Varoquaux
article
IEEE Access, IEEE, In press, 10, pp.42420-42431. ⟨10.1109/ACCESS.2022.3168013⟩
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-03647434/file/final.pdf BibTex
titre
Causal effect on a target population: a sensitivity analysis to handle missing covariates
auteur
Bénédicte Colnet, Julie Josse, Erwan Scornet, Gaël Varoquaux
article
2021
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-03473691/file/missing-cov.pdf BibTex
titre
AI as statistical methods for imperfect theories
auteur
Gaël Varoquaux
article
NeurIPS 2021 – 35th Conference on Neural Information Processing Systems. Workshop: AI for Science, Dec 2021, Virtual, France
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-03474791/file/paper.pdf BibTex
titre
What’s a good imputation to predict with missing values?
auteur
Marine Le Morvan, Julie Josse, Erwan Scornet, Gaël Varoquaux
article
NeurIPS 2021 – 35th Conference on Neural Information Processing Systems, Dec 2021, Virtual, France
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-03243931/file/LeMorvan2021_ImputeThenRegress.pdf BibTex
titre
Accounting for variance in machine learning benchmarks
auteur
Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Naz Sepah, Edward Raff, Kanika Madan, Vikram Voleti, Samira Ebrahimi Kahou, Vincent Michalski, Dmitriy Serdyuk, Tal Arbel, Chris Pal, Gaël Varoquaux, Pascal Vincent
article
MLsys 2021 – 4th Conference on Machine Learning and Systems, Apr 2021, San Francisco (virtual), United States
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-03177159/file/main.pdf BibTex
titre
A lightweight neural model for biomedical entity linking
auteur
Lihu Chen, Gaël Varoquaux, Fabian Suchanek
article
AAAI 2021 – The Thirty-Fifth Conference on Artificial Intelligence, Association for the Advancement of Artificial Intelligence, Feb 2021, Palo Alto (virtual), United States. pp.12657-12665
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-03086044/file/Biomedical_Entity_Linking.pdf BibTex
titre
Preventing dataset shift from breaking machine-learning biomarkers
auteur
Jérôme Dockès, Gaël Varoquaux, Jean-Baptiste Poline
article
GigaScience, Oxford Univ Press, In press
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-03293375/file/main.pdf BibTex
titre
NeuMiss networks: differentiable programming for supervised learning with missing values
auteur
Marine Le Morvan, Julie Josse, Thomas Moreau, Erwan Scornet, Gaël Varoquaux
article
NeurIPS 2020 – 34th Conference on Neural Information Processing Systems, Dec 2020, Vancouver / Virtual, Canada
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-02888867/file/main.pdf BibTex
titre
Causal inference methods for combining randomized trials and observational studies: a review
auteur
Bénédicte Colnet, Imke Mayer, Guanhua Chen, Awa Dieng, Ruohong Li, Gaël Varoquaux, Jean-Philippe Vert, Julie Josse, Shu yang
article
2020
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-03008276/file/main.pdf BibTex
titre
Linear predictor on linearly-generated data with missing values: non consistency and solutions
auteur
Marine Le Morvan, Nicolas Prost, Julie Josse, Erwan Scornet, Gaël Varoquaux
article
AISTATS 2020 – International Conference on Artificial Intelligence and Statistics, Aug 2020, Online, France. pp.3165-3174
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-02464569/file/aistats.pdf BibTex
titre
On the consistency of supervised learning with missing values
auteur
Julie Josse, Nicolas Prost, Erwan Scornet, Gaël Varoquaux
article
2020
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-02024202/file/main.pdf BibTex
titre
Encoding high-cardinality string categorical variables
auteur
Patricio Cerda, Gaël Varoquaux
article
IEEE Transactions on Knowledge and Data Engineering, Institute of Electrical and Electronics Engineers, In press, ⟨10.1109/TKDE.2020.2992529⟩
Accès au texte intégral et bibtex
https://hal.inria.fr/hal-02171256/file/article.pdf BibTex
titre
Comparing distributions: $l1$ geometry improves kernel two-sample testing
auteur
Meyer Scetbon, Gaël Varoquaux
article
NeurIPS 2019 – 33th Conference on Neural Information Processing Systems, Dec 2019, Vancouver, Canada
Accès au texte intégral et bibtex
https://hal.inria.fr/hal-02292545/file/NIPS_L1_test-HAL-v2%20%281%29.pdf BibTex

 

Comments are closed.