Alexis Cvetkov-Iliev: “Analytics across data sources: more learning, rather than more cleaning”

Alexis Cvetkov-Iliev will present his work with Gael Varoquaux on October 1st, 2pm. It will be online on Zoom at

Title: Analytics across data sources: more learning, rather than more cleaning

Abstract: Aggregating data across multiple sources faces the challenges of varying knowledge-representation conventions across the sources. To answer an analytical question, a typical workflow requires entity matching, merging variants across sources into clean categorical variables for the analysis. This task still requires labor-intense manual supervision from the analyst, despite great progress in record linkage and deduplication techniques.
Here we argue that advanced statistical tools can address directly many analytic tasks across data sources with less manual data cleaning. Reframing analytical questions as machine learning tasks enables the use of vector representations of the data to leverage similarities between entries instead of relying on an exact matching of entities. We benchmark such an approach to manually-supervised entity matching, answering analytic questions typical of socio-economic studies across 14 real-world employee databases. Approaches based on continuous embeddings and machine learning models are competitive with classical techniques, while requiring considerably less human labor. In this light, we believe that more research is needed blending machine learning into analytic data stores for the purpose of analysis, rather than cleaning.

Comments are closed.