ConnectionLens: graph integration of structured, semistructured and unstructured data

ConnectionLens: graph integration of structured, semistructured and unstructured data

Data-intensive applications need to work with heterogeneous data sources, which can be structured (e.g., relational or CSV), semi-structured (e.g., JSON, XML or RDF), or unstructured (e.g., text or PDF).  We have developed ConnectionLens, a for integrating heterogeneous, independently authored data sources in a single graph. It is particularly suited workloads that explore connections across  the data sources, across different data formats and different granularities, such as data journalism projects. To discover connections across data sources and enhance their value for the user, ConnectionLens leverages Information Extraction (Named Entity Recognition) and Named Entity Disambiguation.  Further, ConnectionLens allows querying the integrated graph by means of flexible keyword queries.
ConnectionLens is developed as part of the ANR/DGA AI Chair SourcesSay  and benefits also from the suppport of the national “Plan IA” and of the DIM RFSI program. We explore applications in collaboration with Le Monde and WeDoData.

Download

You can find the system here: https://gitlab.inria.fr/cedar/connectionlens

Publications

  • (Reference publication)Graph integration of structured, semistructured and unstructured data for data journalism” by Angelos-Christos Anadiotis, Oana Balalau, Catarina Conceicao, Helena Galhardas, Mhd Yamen Haddad, Ioana Manolescu, Tayeb Merabti, Jingmao You. In Elsevier Journal of Information Systems, 2021

    This article provides a complete description of the vision, the system architecture, and an experimental assessment as of early 2021.

  • (Application paper)Empowering Investigative Journalism with Graph-based Heterogeneous Data Management“, by Angelos-Christos Anadiotis, Oana Balalau, Theo Bouganim, Francesco Chimienti, Helena Galhardas, Mhd Yamen Haddad, Stephane Horel, Ioana Manolescu, Youssr Youssef. Accepted for publication in a special issue if the IEEE Data Engineering Bulletin, 2021.
    Here we describe an application of ConnectionLens to the detection of conflicts of interest in the biomedical domain. To scale up its search, we also describe a novel, in-memory, parallel query answering engine.
Invited talks
Other papers

ConnectionLens team:

  • Angelos Anadiotis (Ecole Polytechnique, CEDAR)
  • Oana Balalau (Inria, CEDAR)
  • Nelly Barret (Inria, CEDAR)
  • Théo Bouganim (Inria, CEDAR)
  • Francesco Chimienti (Inria, CEDAR)
  • Helena Galhardas (University of Lisbon and IST, Portugal)
  • Mhd-Yamen Haddad (Inria, CEDAR)
  • Ioana Manolescu (Inria, CEDAR)
  • Madhulika Mohanty (Inria, CEDAR)
  • Daniel Quintao (X 2018)
  • Prajna Upadhyay (Inria, CEDAR)

History

Work on ConnectionLens started in the the ContentCheck ANR project and through the WebClaimExplain Associated Team between Inria and AIST Japan. Past contributors to the project are:

  • Julien Leblay (AIST, Japan)
  • Catarina Conceiçao (U. Lisbon and IST, Portugal)
  • Tayeb Merabti (Inria)
  • Camille Chanial, Rédouane Dziri, Minh-Huong Le Nguyen (X 2015); Lucas Elbert (X 2016); Irène Burger, Jérémie Feitz, Jingmao You (X 2017);  Yousser Youssef (ENSTA).

 

 

Comments are closed.