Research – Soda – Computational and mathematical methods to understand health and society with data

Presentation

Overall objectives

Context

Application context: richer data in health and social sciences

Opportunistic data accumulations, often observational, bare great promises for social and health sciences. But the data are too big and complex for standard statistical methodologies in these sciences.

Health databases

Increasingly rich health data is accumulated during routine clinical practice as well as for research. Its large coverage brings new promises for public health and personalized medicine, but it does not fit easily in standard biostatistical practice because it is not acquired and formatted for a specific medical question.

Social, educational, and behavioral sciences

Better data sheds new light on human behavior and psychology, for instance with on-line learning platforms. Machine learning can be used both as a model for human intelligence and as a tool to leverage these data, for instance improving education.

Likewise, activity traces can provide empirical evidence for economical or political science, but their complexity requires new statistical practices.

Related data-science challenges

Data management: preparing dirty data for analytics

Assembling, curating, and transforming data for data analysis is very labor intensive. These data-preparation steps are often considered the number one bottleneck to data-science. They mostly rely on data-management techniques. A typical problem is to establish correspondences between entries that denote the same entities but appear in different forms (entity linking, including deduplication and record linkage). Another time-consuming process is to join and aggregate data across multiple tables with repetitions at different levels (as with panel data in econometrics and epidemiology) to form a unique set of “features” to describe each individual. This process is related to database denormalization and might require schema alignment when performed across multiple data sources with imperfect correspondence in columns.

Progress in machine learning increasingly helps automating data preparation and processing data with less curation.

From machine learning to statistically-valid answers

Machine learning can be a tool to answer complex domain questions by providing non-parametric estimators. Yet, it still requires much work, eg to go beyond point estimators, to derive non-parametric procedures that account for a variety of bias (censoring, sampling biases, non-causal associations), or to provide theoretical and practical tools to assess validity of estimates and conclusion in weakly-parametric settings.

A question that is increasingly important in all applications of machine learning is that of auditing the model used in practice. This question arises in fundamental-research settings (medical research, political science…) for statistical validity, and in applications to assess societal biases, or safety of AI systems.

Last activity report : 2023

2023 : PDF – HTML
2022 : PDF – HTML

Results

New results

Table representation learning

Acronym Disambiguation: benchmark and model

14 Acronym Disambiguation (AD) is crucial for natural language understanding on various sources, including biomedical reports, scientific papers, and search engine queries. However, existing acronym disambiguation benchmarks and tools are limited to specific domains, and the size of prior benchmarks is rather small. To accelerate the research on acronym disambiguation, we construct a new benchmark named GLADIS with three components: (1) a much larger acronym dictionary with 1.5M acronyms and 6.4M long forms; (2) a pre-training corpus with 160 million sentences; (3) three datasets that cover the general, scientific, and biomedical domains. We then pre-train a language model, AcroBERT, on our constructed corpus for general acronym disambiguation, and show the challenges and values of our new benchmark.

The structure of positional encodings in language models

15 Positional Encodings (PEs) are used to inject word-order information into transformer-based language models. They are needed from transformers to model sequences rather than sets of words, and they significantly enhance the quality of sentence representations. However, the embeddings capture specific inductive biases and the corresponding contribution to language models is not fully understood. Indeed recent findings highlight that various positional encodings are insensitive to word order. We conducted a systematic study of positional encodings in Bidirectional Masked Language Models (BERT-style). This study revealed two common properties, Locality and Symmetry, core to the function of PEs, that vary across the encodings used in practice (figure 2). We showed that these two properties are closely correlated with the performances of downstream tasks (figure 3). We quantified the weakness of current PEs by introducing two new probing tasks, on which current PEs perform poorly. We believe that these results are the basis for developing better PEs for transformer-based language models and that they explain the choice of local attention structure in the recent language model Mistral 7B.

Figure2: Locality and symmetry values of positional encodings. The green points are fixed and human-designed positional encodings while the orange points are positional encodings after pre-training.

Study of locality and symmetry

Figure3: Empirical studies of the properties of locality and symmetry on the MR sentiment analysis dataset.

Mathematical aspects of statistical learning for data science

Risk ratio, odds ratio, risk difference… Which causal measure is easier to generalize?

1 There are many measures to report so-called treatment or causal effect: absolute difference, ratio, odds ratio, number needed to treat, and so on. The choice of a measure, eg absolute versus relative, is often debated because it leads to different appreciations of the same phenomenon; but it also implies different heterogeneity of treatment effect. In addition some measures – but not all – have appealing properties such as collapsibility, matching the intuition of a population summary. We review common measures and their pros and cons typically brought forward. Doing so, we clarify notions of collapsibility and treatment effect heterogeneity, unifying different existing definitions. Our main contribution is to propose to reverse the thinking: rather than starting from the measure, we start from a non-parametric generative model of the outcome. Depending on the nature of the outcome, some causal measures disentangle treatment modulations from baseline risk. Therefore, our analysis outlines an understanding what heterogeneity and homogeneity of treatment effect mean, not through the lens of the measure, but through the lens of the covariates. Our goal is the generalization of causal measures. We show that different sets of covariates are needed to generalize an effect to a different target population depending on (i) the causal measure of interest, (ii) the nature of the outcome, and (iii) the generalization’s method itself (generalizing either conditional outcome or local effects).

Validating probabilistic classifiers: beyond calibration

3 Ensuring that a classifier gives reliable confidence scores is essential for informed decision-making. For instance, before using a clinical prognostic model, we want to establish that for a given individual is attributes probabilities of different clinical outcomes that can be indeed trusted. To this end, recent work has focused on miscalibration, i.e., the over or under confidence of model scores. Yet calibration is not enough: even a perfectly calibrated classifier with the best possible accuracy can have confidence scores that are far from the true posterior probabilities, if it is over-confident for some samples and under-confident for others. This is captured by the grouping loss, created by samples with the same confidence scores but different true posterior probabilities. Proper scoring rule theory shows that given the calibration loss, the missing piece to characterize individual errors is the grouping loss. While there are many estimators of the calibration loss, none exists for the grouping loss in standard settings. We propose an estimator to approximate the grouping loss 17. We show that modern neural network architectures in vision and NLP exhibit grouping loss, notably in distribution shifts settings, which highlights the importance of pre-production validation.

Machine learning for health and social sciences

Figure4: The Dice Similarity Coefficient (DSC) is not sensitive to the number of objects detected, while this might be what is important for the application.

Understanding metric-related pitfalls in image analysis validation

4 Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibility of metric-related knowledge: While taking into account the individual strengths, weaknesses, and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multi-stage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, we provided the first reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Focusing on biomedical image analysis but with the potential of transfer to other fields, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. To facilitate comprehension, illustrations and specific examples accompany each pitfall. As a structured body of information accessible to researchers of all levels of expertise, this work enhances global comprehension of a key topic in image analysis validation.

Figure5: Heterogeneity in the scores may be invisible in a box plot. Adding the relevant stratus information on a scatter plot may reveal it.

Synchronous female bilateral breast cancers

2 Synchronous bilateral breast cancer is a very rare situation where tumors in both breasts are detected in a patient. This configuration provides insights to understand the relationships among tumor, host (shared for the two tumors), immunity and response to treatment. However, little evidence exists regarding immune infiltration and response to treatment in sBBCs. With Institut Curie we analysed the health records of 404 patients with sBBCs (fig:cancerflowchart), out of 17,575 female patients with non-metastatic breast cancer between 2005 and 2015. We showed that the impact of the subtype of breast cancer on levels of tumor infiltrating lymphocytes (TIL) and on pathologic complete response rates differs according to the concordant or discordant subtype of breast cancer of the contralateral tumor: luminal breast tumors with a discordant contralateral tumor had higher TIL levels and higher pCR rates than those with a concordant contralateral tumor. Our study indicates that tumor-intrinsic characteristics may have a role in the association of tumor immunity and pCR and demonstrates that the characteristics of the contralateral tumor are also associated with immune infiltration and response to treatment.

Figure6: Flowchart representing the multi-source analysis of a cohort of Synchronous bilateral breast cancer patients in collaboration with Institut Curie

Additionally, on a subset of patients with available frozen tissue, tumor sequencing revealed that left and right tumors were independent regarding somatic mutations, copy number alterations and clonal phylogeny, whereas primary tumor and residual disease were closely related both from the somatic mutation and from the transcriptomic point of view.

Learning path personalization as recommending nodes on a bipartite graph

5 Adaptive learning is an area of educational technology that consists in delivering personalized learning experiences to address the unique needs of each learner. An important subfield of adaptive learning is learning path personalization: it aims at designing systems that recommend sequences of educational activities to maximize students’ learning outcomes. In this work we framed learning path personalization as recommendation of nodes on a bipartite graph of keywords to documents and learned a policy for recommending documents based on prior user feedback, using reinforcement learning. Our model is based on a graph neural network, as those can be trained on some graphs and be reused on new graphs, no matter the number of nodes, making it a scalable approach as new documents go. We evaluated on simulated data, and showed good results compared to a baseline, even in the low data regime.

Turn-key machine-learning tools for socio-economic impact

New release of scikit-learn

Scikit-learn is always improving, adding features for better and easier machine learning in Python. We list below a few highlights that are certainly not exhaustive but illustrate the continuous progress made.

Two visualization of clusters estimated with the HDBScan algorithm, the figure on the right is estimated on data with a scale twice as small as on the left

Figure7: A visualization of clusters estimated with the HDBScan algorithm using default hyper-parameters: the same data is used on the left and on the right, but on the right it is scaled by a factor 0.5. On both datasets the clustering algorithm recovers the same set of 3 clusters, where the number of clusters is inferred from the data. This demonstrates the robustness of the algorithm to scaling in the data, a feature that makes it popular. scikit-learn.org/stable/auto_examples/cluster/plot_hdbscan.html

Release 1.3 (June 2023), with a large number of changes; the most notable ones are:

Addition of HDBSCAN, a modern hierarchical density-based clustering algorithm. Similarly to OPTICS, it can be seen as a generalization of DBSCAN by allowing for hierarchical instead of flat clustering, however it varies in its approach from OPTICS. This algorithm is very robust with respect to its hyperparameters’ values and can be used on a wide variety of data without much, if any, tuning.
Addition of TargetEncoder which is a categorical encoding based on target mean conditioned on the value of the category.
Decision trees now natively handle missing values.
Added the class ValidationCurveDisplay that allows easy plotting of validation curves
The gradient-boosting models now support the Gamma-deviance loss function, useful for modeling strictly positive targets with a right-skewed distribution
Similarly to OneHotEncoder, the OrdinalEncoder now supports aggregating infrequent categories into a single output for each feature.

skrub

skrub is a much younger package that strives to facilitate statistical learning on relational data, even when it is messy. Skrub is an evolution of the dirty-cat package, giving it a broader scope.

Release 0.1 (Dec 2023)

The TableVectorizer (to vectorize a table) suitable for automatic usage:
automatic casting of types in transform,
avoid dimensionality explosion when a feature has two unique values, by using a OneHotEncoder that drops one of the two vectors.
transform can now return features, without modification.

DatetimeEncoder can transform a datetime column into several numerical columns (year, month, day, hour, minute, second, …). It is now the default transformer used in the TableVectorizer for datetime columns.

AggJoiner: performs a join on an external table and aggregates the results in case of a one-to-many match. Useful for assembling features across multiple tables for learning.

joblib

joblib is a very simple computation engine in Python that is massively used worldwide, including as a dependency of packages such as scikit-learn for parallel computing.

Release 1.3 (August 2023). Many changes to follow evolutions of the ecosystem and improve behaviors (eg better error handling). Major changes are:

Add limits on age and number of items on the cache
Parallel computing can now return results asynchronously and dynamically map-reduce like behavior to decrease memory usage