Presentation
Overall objectives
Context
Application context: richer data in health and social sciences
Opportunistic data accumulations, often observational, bare great promises for social and health sciences. But the data are too big and complex for standard statistical methodologies in these sciences.
Health databases
Increasingly rich health data is accumulated during routine clinical practice as well as for research. Its large coverage brings new promises for public health and personalized medicine, but it does not fit easily in standard biostatistical practice because it is not acquired and formatted for a specific medical question.
Social, educational, and behavioral sciences
Better data sheds new light on human behavior and psychology, for instance with on-line learning platforms. Machine learning can be used both as a model for human intelligence and as a tool to leverage these data, for instance improving education.
Likewise, activity traces can provide empirical evidence for economical or political science, but their complexity requires new statistical practices.
AI in society
AI increasingly impacts multiple aspects of society. As such, it calls for rigorous evaluation, whether it is a benchmark of its ability, or a broader assessment of its impacts.
Related data-science challenges
Data management: preparing tabular data for analytics
Assembling, curating, and transforming data for data analysis is very labor intensive. These data-preparation steps are often considered the number one bottleneck to data-science. They mostly rely on data-management techniques. A typical problem is to establish correspondences between entries that denote the same entities but appear in different forms (entity linking, including deduplication and record linkage). Another time-consuming process is to join and aggregate data across multiple tables with repetitions at different levels (as with panel data in econometrics and epidemiology) to form a unique set of “features” to describe each individual. This process is related to database denormalization and might require schema alignment when performed across multiple data sources with imperfect correspondence in columns.
Progress in machine learning increasingly helps automating data preparation and processing data with less curation.
From machine learning to statistically-valid answers
Machine learning can be a tool to answer complex domain questions by providing non-parametric estimators. Yet, it still requires much work, eg to go beyond point estimators, to derive non-parametric procedures that account for a variety of bias (censoring, sampling biases, non-causal associations), or to provide theoretical and practical tools to assess validity of estimates and conclusion in weakly-parametric settings.
A question that is increasingly important in all applications of machine learning is that of auditing the model used in practice. This question arises in fundamental-research settings (medical research, political science…) for statistical validity, and in applications to assess societal biases, or safety of AI systems.
Last activity report : 2024
Results
New results
Table representation learning
Tabular deep learning
Neural networks traditionally underperform tree-based learners on tabular data. However, Holzmuller et al3 show that an array of modifications (initializations, learning-rate scheduler, feature standardization…), enables classic architectures (such as the multi-layer perceptron) to catch up. This work suggests that defaults must be adapted to the data modality, and tables call for new defaults.
Table foundation models
Much of the success of deep learning has been driven by the ability to reuse pretrained models –fitted on very large datasets, foundation models pushing this idea very far with models that provide background information useful for a wide variety of downstream tasks. A crucial part of these foundation model is the attention mechanism, stacked in a transformer architecture, that bring associative memory to the inputs by contextualizing them.
With the CARTE model 4, we adapted these ideas to tables. The strings –in the tables entries and column names– give the information that enables transfer from one table to another: data semantics. Here, key is to have an architecture that 1) models both strings and numerical values 2) applies to any set of tables while using the column names to route the information. For this purpose, CARTE uses a new dedicated attention mechanism that accounts for column names. It is pre-trained on a very large knowledge base. As a result, it outperform the best models (including tree-based models) in small sample settings (up to
This result is very significant as it opens the door to foundation models for tables. It is giving birth to a very active line of research.
Statistical aspects of machine learning
Prediction with missing values
Asymptotic results shows that to predict well with missing values, it is neither necessary nor sufficient to impute well these missing values by their most-likely value. Le Morvan et al5 studied the finite-sample question empirically, in the missing at random settings where, in theory, imputation in most likely to give benefits. Results show that indeed, better recovery of missing values leads to better prediction, but with diminishing returns: a large improvement in recovery quality –which typically comes at a sizable computational cost– leads to a small improvement in prediction accuracy. Additionally, the more flexible the final learner, the weaker the link is. However, adding a missing-value indicator, an extra column that indicates which values have been imputed, is always beneficial.
Assessment of large language models
Large language models (LLMs), such as chatGPT, may produce answers that are plausible but not factually correct, the so-called “hallucinations”. A variety of approach try to assess how likely a statement is to be true, for instance by sampling multiple responses from the language model. However, the challenge is to threshold these assessments, or assign a probability of correctness.
Chen et al1 investigates the confidence of LLMs in their answers. The work shows that the probabilities computed are not only overconfident, but also that there is heterogeneity (grouping loss): on some groups of queries the overconfidence is more pronounced than on others. For instance, for an answer on a notable individual, the LLMs’ confidence is reasonably calibrated if the individual is from the United States, but severely overconfident for individuals from South East Asia (fig:llmconfidencenationality). Characterizing the corresponding groups opens the door to correcting the corresponding bias, a “reconfidencing” procedure.
Observed error rate and a function predicted probability of correctnessFor the birth date, when a large language model (here Mistral 7B) gives information on a given notable individual. The different curves give the corresponding calibration for different nationalities of the individuals, revealing that the probability is much more trustworthy for a citizen of the United States than for another countries, and particularly poor for people that originate from South-East Asia. Figure from 1.
Machine learning for health and social sciences
Causal machine learning on large scale observational data
Causal approaches offer a compromise between purely predictive machine learning models that have no causal interpretation and randomized experiments that are costly and difficult to organize. Causal machine learning proposes a framework of strong assumptions to assess the causal effect of a treatment on an outcome. Those assumptions focus on the inclusion of confounding variables in the model and a non-zero probability to get the treatment for any unit in the dataset.
Dumas et al2 apply this strategy to systematically analyze the impact of chronic diseases and medications at the time of breast cancer (BC) diagnosis on cancer survival using the French Social Security data (SNDS). This would be infeasible and inefficient to do it on actual BC patients, for cost and ethical considerations, but the scale and the exhaustivity of the SNDS data enables such a study, at least to narrow down potential therapeutic candidates. In the analysis of a cohort of 235,368 French women and 288 medications with sufficient subcohort size to draw statistical conclusions, several medications have a statistically significant positive or negative effect on BC survival. Those results should not be directly interpreted as candidates for additional treatment after the BC diagnosis as chronic conditions pre-existing the diagnosis are considered here, but offer some insights about potential drug interactions or mechanism that affect the onset of BC, in particular through the immune system. This large-scale systematic study also provides a proof-of-concept of the relevance and the precision of medical knowledge that could be extracted from large claim or EHR datasets.
Reinforcement learning for adaptive recommendation of learning resources
Massive Open Online Courses (MOOCs) have greatly contributed to making education more accessible. However, many MOOCs maintain a rigid, one-size-fits-all structure that fails to address the diverse needs and backgrounds of individual learners. Learning path personalization aims to address this limitation, by tailoring sequences of educational content to optimize individual student learning outcomes. Existing approaches, however, often require either massive student interaction data or extensive expert annotation, limiting their broad application.
Vassoyan et al. 6 have framed learning path personalization as a partially observable Markov decision process. This is actually the first RL environment for dynamic cognitive diagnosis, where we assume that students learn when we show them documents within their frontier of knowledge (i.e. zone of proximal development) and our goal is to optimize their learning outcomes. By propagating information on a bipartite graph of keywords and documents, we could learn a policy (using REINFORCE algorithm) for selecting the best learning resource for learning a topic. By using word embeddings on the content of documents, we could alleviate item cold-start. We conducted experiments with simulated students on a real corpus of MOOCs. We went deeper in reducing the data needed to provide relevant recommendations of documents: dozens of episodes instead of thousands of episodes. We also showed that our method can generalize to unseen corpuses of documents. This is a collaboration with Anan Schütt & Elisabeth André from U. Augsburg, Arun Narayanan from U. Pittsburgh, and Nicolas Vayatis from Centre Borelli.
Turn-key machine-learning tools for socio-economic impact
Releases of scikit-learn
With 3 major releases in 2024 (1.4 in Jan, 1.5 in May, and 1.6 in December), Scikit-learn is always improving, adding features for better and easier machine learning in Python. We list below a few highlights that are certainly not exhaustive but illustrate the continuous progress made.
Monotonic constraints in treesHere, a random forest is fitted on the data, comparing an unconstrained version (blue) to one with a monotonic constraint on the corresponding feature (orange). scikit-learn.org/dev/auto_examples/release_highlights/ plot_release_highlights_1_4_0.html
FinedThresholdClassifierand
TunedThresholdClassifiercan adjust the threshold to maximize a given utility, either set theoretically with a cost matrix, or empirically to minimize the cost on a validation set.
FrozenEstimatorso to have an object that is no longer modified at fit time. This is useful to inject in pipelines pre-trained models as it enables reusing standard model evaluation tools.
HistGradientBoostingclassifier and regressor will use a categorical splitter in the trees for these.
set_outputmethod of an estimator, transformers can output a polars dataframe, respecting the column names of the input if any.
skrub
The first release of skrub was late 2023. There have been 3 releases in 2024, leading to 0.4 in December 2024. Skrub is a package to facilitate machine learning on tables. The major features added in 2024 are:
TableReportgives an interactive display of dataframes, enabling inspection of the different columns and their distributions, that can be easily embedded, including in the programming environment used by data scientists.
TextEncoderuses a pretrained deep-learning language model to embed strings in a given column.
tabular_learnerfunction builds a preprocessing pipeline that encodes messy data frames in a way that is well suited for a given a predictor.
joblib
joblib is a very simple computation engine in Python that is massively used worldwide, including as a dependency of packages such as scikit-learn for parallel computing.
Release 1.4 (May 2024). Many changes to follow evolutions of the ecosystem and improve behaviors (eg better error handling). Major changes are:
- Allowing to cache coroutines
- Optional unordered execution of parallel loop, to better use multiple CPUs