Gaël Varoquaux, Inria Saclay

28 June 2019, 10:30-11:30

ENS, S16

**Statistics on tables with non-curated entries**

“Dirty data” is said to be the data-scientists worst time sink. We investigate a specific data-quality challenge at the intersection of database curation and statistical learning.

Data tables often contain many non-numerical entries. Knowledge engineering in databases typically strive to recognize entities in these entries. For instance in deduplication or record-linkage are used to match entities expressed differently across the data.

On the other hand, statistical techniques, as in machine learning, tend to cast all entries to numerical vectors, given that statistical models and regularities are easier to formulate in vector spaces. To analyze data with entries that representing discrete entities, a standard pipeline is to curate them with deduplication approaches, after which the resulting categories are “one-hot encoded”: represented in a vector space by orthogonal binary vectors. The success of such pipeline depends crucially on the quality of the deduplication. In addition, it can create very high-dimensional vectorial representations that lead to statistical and computational problems in the machine learning step.

I will introduce statistical models of strings, useful to build low-dimensional representations of the entries that capture their morphological variations. These capture the string similarities between entries. They can also reveal latent categories that interpolate smoothly between various categories of entries without the need for cleaning or deduplication. Finally, we show that they lead to computationally and statistically efficient machine learning on non-curated tables.

Bio: Gaël Varoquaux is a computer-science researcher at Inria. His research focuses on statistical learning tools for data science and scientific inference. He has pioneered the use of machine learning on brain images to map cognition and pathologies. More generally, he develops tools to make machine learning easier, with statistical models suited for real-life, uncurated data, and software for data science. He co-funded scikit-learn, one of the reference machine-learning toolboxes, and helped build various central tools for data analysis in Python. Varoquaux has contributed key methods for learning on spatial data, matrix factorizations, and modeling covariance matrices. He has a PhD in quantum physics and is a graduate from Ecole Normale Superieure, Paris.