ConnectionLens: graph integration of structured, semistructured and unstructured data
Data-intensive applications need to work with heterogeneous data sources, which can be structured (e.g., relational or CSV), semi-structured (e.g., JSON, XML or RDF), or unstructured (e.g., text or PDF). We have developed ConnectionLens, a for integrating heterogeneous, independently authored data sources in a single graph. It is particularly suited workloads that explore connections across the data sources, across different data formats and different granularities, such as data journalism projects. To discover connections across data sources and enhance their value for the user, ConnectionLens leverages Information Extraction (Named Entity Recognition) and Named Entity Disambiguation. Further, ConnectionLens allows querying the integrated graph by means of flexible keyword queries.
ConnectionLens is developed as part of the ANR/DGA AI Chair SourcesSay and benefits also from the suppport of the national “Plan IA” and of the DIM RFSI program. We explore applications in collaboration with Le Monde and WeDoData.
Download
You can find the system here: https://gitlab.inria.fr/cedar/connectionlens
Publications
- (Reference publication) “Graph integration of structured, semistructured and unstructured data for data journalism” by Angelos-Christos Anadiotis, Oana Balalau, Catarina Conceicao, Helena Galhardas, Mhd Yamen Haddad, Ioana Manolescu, Tayeb Merabti, Jingmao You. In Elsevier Journal of Information Systems, 104:101846, 2022
This article provides a complete description of the vision, the system architecture, and an experimental assessment as of early 2021.
- (Application paper) “Empowering Investigative Journalism with Graph-based Heterogeneous Data Management“, by Angelos-Christos Anadiotis, Oana Balalau, Theo Bouganim, Francesco Chimienti, Helena Galhardas, Mhd Yamen Haddad, Stephane Horel, Ioana Manolescu, Youssr Youssef. Accepted for publication in a special issue if the IEEE Data Engineering Bulletin, 2021.
Here we describe an application of ConnectionLens to the detection of conflicts of interest in the biomedical domain. To scale up its search, we also describe a novel, in-memory, parallel query answering engine.
All Conference and journal papers
- “Finding the PG schema of any (semi)structured dataset: a tale of graphs and abstraction”, by Nelly Barret, Tudor Enache, Ioana Manolescu and Madhulika Mohanty, SEAGraph Workshop at ICDE 2024.
- “Graph lenses over any data: the ConnectionLens experience”, by Oana Balalau, Nelly Barret, Simon Ebel, Théo Galizzi, Ioana Manolescu and Madhulika Mohanty, SEAGraph Workshop at ICDE 2024.
- “User-friendly exploration of highly heterogeneous data lakes”, by Nelly Barret, Simon Ebel, Théo Galizzi, Ioana Manolescu and Madhulika Mohanty, CoopIS 2023 (ConnectionStudio Code and Website).
- “Full-Power Graph Querying: State of the Art and Challenges”, by Ioana Manolescu and Madhulika Mohanty (Tutorial), VLDB 2023. (Tutorial Webpage)
- “Exploring heterogeneous data graphs through their entity paths”, by Nelly Barret, Antoine Gauquier, Jean Jia Law and Ioana Manolescu, ADBIS 2023
- “PathWays: entity-focused exploration of heterogeneous data graphs”, by Nelly Barret, Antoine Gauquier, Jean Jia Law and Ioana Manolescu (demonstration), ESWC 2023
- “More power to SPARQL: From paths to trees”, by Angelos Anadiotis, Ioana Manolescu and Madhulika Mohanty (demonstration), ESWC 2023
- “Integrating Connection Search in Graph Queries”, by Angelos Anadiotis, Ioana Manolescu and Madhulika Mohanty, IEEE ICDE 2023
- Efficiently Identifying Disguised Missing Values in Heterogeneous, Text-Rich Data by Théo Bouganim, Helena Galhardas, Ioana Manolescu, Transactions on Large-Scale Data and Knowledge-Centered Systems (TLDKS), 2022
- “Discovering Conflicts of Interest across Heterogeneous Data Sources with ConnectionLens”, by Angelos-Christos Anadiotis, Oana Balalau, Théo Bouganim, Francesco Chimienti, Helena Galhardas, Mhd Yamen Haddad, Stéphane Horel, Ioana Manolescu and Youssr Youssef (demonstration), CIKM 2021. (Demonstration video)
- “Toward Visual Interactive Exploration of Heterogeneous Graphs” Irène Burger, Ioana Manolescu, Emmanuel Pietriga, Fabian Suchanek, SEAdata 2020 – Workshop on Searching, Exploring and Analyzing Heterogeneous Data in conjunction with EDBT/ICDT, Mar 2020, Copenhagen, Denmark
- “ConnectionLens: Finding Connections Across Heterogeneous Data Sources” Camille Chanial, Rédouane Dziri, Helena Galhardas, Julien Leblay, Minh-Huong Le Nguyen, Ioana Manolescu, Proceedings of the VLDB Endowment (PVLDB), VLDB Endowment, 2018, 11, pp. 2030-2033. 10.14778/3229863.3236252 (also accepted for informal presentation at Bases de Données Avancées 2018)