ConnectionLens: graph integration of structured, semistructured and unstructured data
Data-intensive applications need to work with heterogeneous data sources, which can be structured (e.g., relational or CSV), semi-structured (e.g., JSON, XML or RDF), or unstructured (e.g., text or PDF). We have developed ConnectionLens, a for integrating heterogeneous, independently authored data sources in a single graph. It is particularly suited workloads that explore connections across the data sources, across different data formats and different granularities, such as data journalism projects. To discover connections across data sources and enhance their value for the user, ConnectionLens leverages Information Extraction (Named Entity Recognition) and Named Entity Disambiguation. Further, ConnectionLens allows querying the integrated graph by means of flexible keyword queries.
ConnectionLens is developed as part of the ANR/DGA AI Chair SourcesSay and benefits also from the suppport of the national “Plan IA” and of the DIM RFSI program. We explore applications in collaboration with Le Monde and WeDoData.
Download
You can find the system here: https://gitlab.inria.fr/cedar/connectionlens
Publications
- (Reference publication) “Graph integration of structured, semistructured and unstructured data for data journalism” by Angelos-Christos Anadiotis, Oana Balalau, Catarina Conceicao, Helena Galhardas, Mhd Yamen Haddad, Ioana Manolescu, Tayeb Merabti, Jingmao You. In Elsevier Journal of Information Systems, 104:101846, 2022
This article provides a complete description of the vision, the system architecture, and an experimental assessment as of early 2021.
- (Application paper) “Empowering Investigative Journalism with Graph-based Heterogeneous Data Management“, by Angelos-Christos Anadiotis, Oana Balalau, Theo Bouganim, Francesco Chimienti, Helena Galhardas, Mhd Yamen Haddad, Stephane Horel, Ioana Manolescu, Youssr Youssef. Accepted for publication in a special issue if the IEEE Data Engineering Bulletin, 2021.
Here we describe an application of ConnectionLens to the detection of conflicts of interest in the biomedical domain. To scale up its search, we also describe a novel, in-memory, parallel query answering engine.
Invited talks
- “What do the Sources Say? Exploring Journalistic Data as a Graph”, Ioana Manolescu, invited seminar at IRISA (March 18, 2021) and at DOLAP 2021 (March 23, 2021)
- “Integrating (Very) Heterogeneous Data Sources: A Structured and an Unstructured Perspective“, Ioana Manolescu, ADBIS 2020 – 24th European Conference on Advances in Databases and Information Systems, Aug 2020, Lyon, France. pp.15-20, ⟨10.1007/978-3-030-54832-2_3⟩
- “From Data to the Press: Data Management for Journalism and Fact-Checking“, Ioana Manolescu, DATA 2020 – 9th International Conference on Data Science, Technology and Applications, Jul 2020, Paris / Virtuel, France
Conference and journal papers
- “PathWays: entity-focused exploration of heterogeneous data graphs”, by Nelly Barret, Antoine Gauquier, Jean Jia Law and Ioana Manolescu (demonstration), ESWC 2023
- “More power to SPARQL: From paths to trees”, by Angelos Anadiotis, Ioana Manolescu and Madhulika Mohanty (demonstration), ESWC 2023
- “Integrating Connection Search in Graph Queries”, by Angelos Anadiotis, Ioana Manolescu and Madhulika Mohanty, IEEE ICDE 2023
- “Abstra: Toward Generic Abstractions for Data of Any Model”, by Nelly Barret, Ioana Manolescu and Prajna Upadhyay (demonstration), CIKM 2022. (Demonstration video)
- “Discovering Conflicts of Interest across Heterogeneous Data Sources with ConnectionLens”, by Angelos-Christos Anadiotis, Oana Balalau, Théo Bouganim, Francesco Chimienti, Helena Galhardas, Mhd Yamen Haddad, Stéphane Horel, Ioana Manolescu and Youssr Youssef (demonstration), CIKM 2021. (Demonstration video)
- “Toward Generic Abstractions for Data of Any Model” by Nelly Barret, Ioana Manolescu, Prajna Upadhyay, BDA 2021 – Informal publication only, Oct 2021, Paris, France
-
Efficiently identifying disguised nulls in heterogeneous text data by Théo Bouganim, Helena Galhardas, Ioana Manolescu, BDA (Conférence sur la Gestion de Données – Principles, Technologies et Applications), Oct 2021, Paris, France,
- Efficiently Identifying Disguised Missing Values in Heterogeneous, Text-Rich Data by Théo Bouganim, Helena Galhardas, Ioana Manolescu, Transactions on Large-Scale Data- and Knowledge-Centered Systems LI, 13410, Springer Berlin Heidelberg, pp.97-118, 2022
-
“Graph-based keyword search in heterogeneous data sources” by Angelos Christos Anadiotis, Mhd Yamen Haddad and Ioana Manolescu, in 36ème Conférence sur la Gestion de Données – Principes, Technologies et Applications (BDA) 2020, informal publication (presentation slides)
- “Graph integration of structured, semistructured and unstructured data for data journalism” by Oana Balalau, Catarina Conceição, Helena Galhardas, Ioana Manolescu, Tayeb Merabti, Jingmao You and Youssr Youssef, in 36ème Conférence sur la Gestion de Données – Principes, Technologies et Applications (BDA) 2020, informal publication (presentation slides)
- “Toward Visual Interactive Exploration of Heterogeneous Graphs” Irène Burger, Ioana Manolescu, Emmanuel Pietriga, Fabian SuchanekSEAdata 2020 – Workshop on Searching, Exploring and Analyzing Heterogeneous Data in conjunction with EDBT/ICDT, Mar 2020, Copenhagen, Denmark
- “ConnectionLens: Finding Connections Across Heterogeneous Data Sources” Camille Chanial, Rédouane Dziri, Helena Galhardas, Julien Leblay, Minh-Huong Le Nguyen, Ioana ManolescuProceedings of the VLDB Endowment (PVLDB), VLDB Endowment, 2018, 11, pp. 2030-2033. 10.14778/3229863.3236252 (also accepted for informal presentation at Bases de Données Avancées 2018)