Our key insight is that machine learning itself can deal well with errors, qualitative and noisy data. Hence, we aim to do statistical analysis directly on heterogeneous data. The ongoing projects are:
- Joint analysis of heterogeneous data sources
- Learning from dirty categorical data
- Using large corpora of neuroimaging studies for brain mapping
- Crawling and structuring Open Data
- Missing Data in the Big Data Era
Joint analysis of heterogeneous data sources
The tech world is abuzz with ‘big data’, in which many observations of the same phenomenon enable building very rich data-driven models. However, for a wide variety of fields of study, observations are difficult to acquire and require performing manual operations. Conversely, the growth in dimensionality of the data with a limited number of observation leads to a challenging statistical problem, the curse of dimensionality. Yet, many application fields face an accumulation of weakly-related datasets with observations of different nature and from numerous related data acquisitions.
The goal of this project is to develop a statistical-learning framework that can leverage the weak links across datasets to improve the statistical task on each of the dataset. Technically, one option to explore would be to learn latent factors, or ‘representations’ as they are called in deep learning, common to the multiple tasks. Non-linear mappings or kernels may be necessary to deal with the multiple nature of the data. This framework should help using a wide variety of datasets to improve prediction in specific, separate tasks.
Learning from dirty categorical data
Most statistical learning algorithms require a numerical feature matrix as input . In the presence of categorical variables in the data, feature engineering is needed to encode the different categories into a suitable feature vector. In controlled data-collection settings, categorical variables are standardized: the set of categories is finite, well-known a priori and categories are mutually exclusive. In this case, the common approach is to use one-hot encoding.
On the contrary, many of real-world data contain non-standardized categorical variables, which can arise from a variety of mechanisms: typographical errors, use of special characters, concatenated hierarchical data, etc. From a data-integration point of view, “dirty” categories may be seen as a data cleaning problem, addressed, for instance, with entity resolution. The goal of this project is to study novel methods to represent dirty categories in a robust way in order to perform statistical learning without a data-cleaning step.
One simple approach to solve to this problem is similarity encoding, which creates embeddings based on morphological string similarities. Implementations and examples on learning with dirty categories can be found in the Python package dirty-cat.
This research is supported by the DirtyData project.
Crawling and structuring Open Data
Machine learning has inspired new markets and applications by extracting new insights from complex and noisy data. However, to perform such analyses, the most costly step is often to prepare the data. It entails correcting input errors and inconsistencies as well as transforming the data into a single matrix-shaped table that comprises all interesting descriptors for all observations to study.
This project aims to explore these concerns using French open data.