Learning from heterogeneous data

Our key insight is that machine learning itself can deal well with errors, qualitative and noisy data. Hence, we aim to do statistical analysis directly on heterogeneous data. The ongoing projects are:

Joint analysis of heterogeneous data sources

The tech world is abuzz with ‘big data’, in which many observations of the same phenomenon enable building very rich data-driven models. However, for a wide variety of fields of study, observations are difficult to acquire and require performing manual operations. Conversely, the growth in dimensionality of the data with a limited number of observation leads to a challenging statistical problem, the curse of dimensionality. Yet, many application fields face an accumulation of weakly-related datasets with observations of different nature and from numerous related data acquisitions.

The goal of this project is to develop a statistical-learning framework that can leverage the weak links across datasets to improve the statistical task on each of the dataset. Technically, one option to explore would be to learn latent factors, or ‘representations’ as they are called in deep learning, common to the multiple tasks. Non-linear mappings or kernels may be necessary to deal with the multiple nature of the data. This framework should help using a wide variety of datasets to improve prediction in specific, separate tasks.

Meta-analysis of Brain Responses to shape a Cognitive Atlas

Neuroimaging studies the brain activation evoked by various central concepts of cognitive science. Relating these studies is challenging as there is no accepted links across these concepts: are attention and vigilance sub-notions of consciousness, or is consciousness a property of its own?

This project proposes to use tools from distributional semantics, pioneered in text processing, to learn a semantic structure of cognitive science from the similarities in the brain response associate to these concepts. The project relies on the emerging large-scale databases of brain activation images and coordinates, linked to the corresponding cognitive-science publications. The structure on cognitive concepts will be used in brain decoding tasks, predicting the subject’s behavior from the brain activity. This machine learning task grounds the association of brain regions to specific cognitive aspects of behavior.

Crawling and structuring Open Data

Machine learning has inspired new markets and applications by extracting new insights from complex and noisy data. However, to perform such analyses, the most costly step is often to prepare the data. It entails correcting input errors and inconsistencies as well as transforming the data into a single matrix-shaped table that comprises all interesting descriptors for all observations to study.

This project aims to explore these concerns using French open data.

Comments are closed