ConnectionLens: graph integration of structured, semistructured and unstructured data
Data-intensive applications need to work with heterogeneous data sources, which can be structured (e.g., relational or CSV), semi-structured (e.g., JSON, XML or RDF), or unstructured (e.g., text or PDF). We have developed ConnectionLens, a for integrating heterogeneous, independently authored data sources in a single graph. It is particularly suited workloads that explore connections across the data sources, across different data formats and different granularities, such as data journalism projects. To discover connections across data sources and enhance their value for the user, ConnectionLens leverages Information Extraction (Named Entity Recognition) and Named Entity Disambiguation. Further, ConnectionLens allows querying the integrated graph by means of flexible keyword queries.
ConnectionLens is developed as part of the ANR/DGA AI Chair SourcesSay and benefits also from the suppport of the national “Plan IA” and of the DIM RFSI program. We explore applications in collaboration with Le Monde and WeDoData.
You can find the system here: https://gitlab.inria.fr/cedar/connectionlens
- (Reference publication) “Graph integration of structured, semistructured and unstructured data for data journalism” by Angelos-Christos Anadiotis, Oana Balalau, Catarina Conceicao, Helena Galhardas, Mhd Yamen Haddad, Ioana Manolescu, Tayeb Merabti, Jingmao You. In Elsevier Journal of Information Systems, 2021
This article provides a complete description of the vision, the system architecture, and an experimental assessment as of early 2021.
- (Application paper) “Empowering Investigative Journalism with Graph-based Heterogeneous Data Management“, by Angelos-Christos Anadiotis, Oana Balalau, Theo Bouganim, Francesco Chimienti, Helena Galhardas, Mhd Yamen Haddad, Stephane Horel, Ioana Manolescu, Youssr Youssef. Accepted for publication in a special issue if the IEEE Data Engineering Bulletin, 2021.
Here we describe an application of ConnectionLens to the detection of conflicts of interest in the biomedical domain. To scale up its search, we also describe a novel, in-memory, parallel query answering engine.
- “What do the Sources Say? Exploring Journalistic Data as a Graph”, Ioana Manolescu, invited seminar at IRISA (March 18, 2021) and at DOLAP 2021 (March 23, 2021)
- “Integrating (Very) Heterogeneous Data Sources: A Structured and an Unstructured Perspective“, Ioana Manolescu, ADBIS 2020 – 24th European Conference on Advances in Databases and Information Systems, Aug 2020, Lyon, France. pp.15-20, ⟨10.1007/978-3-030-54832-2_3⟩
- “From Data to the Press: Data Management for Journalism and Fact-Checking“, Ioana Manolescu, DATA 2020 – 9th International Conference on Data Science, Technology and Applications, Jul 2020, Paris / Virtuel, France
- “Discovering Conflicts of Interest across Heterogeneous Data Sources with ConnectionLens”, by by Angelos-Christos Anadiotis, Oana Balalau, Théo Bouganim, Francesco Chimienti, Helena Galhardas, Mhd Yamen Haddad, Stéphane Horel, Ioana Manolescu and Youssr Youssef. Demonstration accepted for publication at CIKM 2021. (Demonstration video)
Toward Generic Abstractions for Data of Any Model by Nelly Barret, Ioana Manolescu, Prajna Upadhyay, BDA 2021 – Informal publication only, Oct 2021, Paris, France
Efficiently identifying disguised nulls in heterogeneous text data by Théo Bouganim, Helena Galhardas, Ioana Manolescu, BDA (Conférence sur la Gestion de Données – Principles, Technologies et Applications), Oct 2021, Paris, France,
“Graph-based keyword search in heterogeneous data sources” by Angelos Christos Anadiotis, Mhd Yamen Haddad and Ioana Manolescu, in 36ème Conférence sur la Gestion de Données – Principes, Technologies et Applications (BDA) 2020, informal publication (presentation slides)
- “Graph integration of structured, semistructured and unstructured data for data journalism” by Oana Balalau, Catarina Conceição, Helena Galhardas, Ioana Manolescu, Tayeb Merabti, Jingmao You and Youssr Youssef, in 36ème Conférence sur la Gestion de Données – Principes, Technologies et Applications (BDA) 2020, informal publication (presentation slides)
- Toward Visual Interactive Exploration of Heterogeneous Graphs Irène Burger, Ioana Manolescu, Emmanuel Pietriga, Fabian SuchanekSEAdata 2020 – Workshop on Searching, Exploring and Analyzing Heterogeneous Data in conjunction with EDBT/ICDT, Mar 2020, Copenhagen, Denmark
- ConnectionLens: Finding Connections Across Heterogeneous Data Sources Camille Chanial, Rédouane Dziri, Helena Galhardas, Julien Leblay, Minh-Huong Le Nguyen, Ioana ManolescuProceedings of the VLDB Endowment (PVLDB), VLDB Endowment, 2018, 11, pp. 2030-2033. 10.14778/3229863.3236252 (also accepted for informal presentation at Bases de Données Avancées 2018)
- Angelos Anadiotis (Ecole Polytechnique, CEDAR)
- Oana Balalau (Inria, CEDAR)
- Nelly Barret (Inria, CEDAR)
- Théo Bouganim (Inria, CEDAR)
- Francesco Chimienti (Inria, CEDAR)
- Helena Galhardas (University of Lisbon and IST, Portugal)
- Mhd-Yamen Haddad (Inria, CEDAR)
- Ioana Manolescu (Inria, CEDAR)
- Madhulika Mohanty (Inria, CEDAR)
- Daniel Quintao (X 2018)
- Prajna Upadhyay (Inria, CEDAR)
- Julien Leblay (AIST, Japan)
- Catarina Conceiçao (U. Lisbon and IST, Portugal)
- Tayeb Merabti (Inria)
- Camille Chanial, Rédouane Dziri, Minh-Huong Le Nguyen (X 2015); Lucas Elbert (X 2016); Irène Burger, Jérémie Feitz, Jingmao You (X 2017); Yousser Youssef (ENSTA).