Return to Projects

RDFSummary: Query-oriented summarization of RDF graphs

In this work, we study the problem of RDF summarization, that is: given an input RDF graph, find an RDF summary graph which summarizes the input dataset as accurately as possible, while being (most frequently) orders of magnitude smaller than the original graph. Such a summary can be used in a variety of contexts: to help an RDF application designer get acquainted with a new dataset, as a first-level user interface, or as a support for query optimization as traditionally the case in semi-structured graph data management etc.

While semi-structured data summarization has been studied before, our work is the first focused on partially explicit, partially implicit RDF graphs. The semantics of the original graph is always preserved in its summary, regardless of the number of implicit triples in the input.

Moreover, we ensure full query representativeness: any query that has answers on the input graph, will have some answers on the summary. This property can be leveraged in query optimization to statically analyze a query: knowing that it is empty, we can avoid running a query on a large input graph, hence saving time and resources.

We introduce two flavors of summaries: a baseline which is compact, simple and satisfies certain accuracy and representativeness properties, at the expense of a potentially too high simplification of the RDF graph, and a refined one which trades some of these properties for more accuracy representing the structure.

A sample summary of a subset of DBpedia dataset related to person data is given below.

Refined summary of a subset of DBpedia

Refined summary of a subset of DBpedia

Another interesting use case of summaries is data quality validation, as illustrated by the DBpedia example. We notice that the summary comprises edges birthPlace and deathPlace, originating and ending in a node of type Person. People are usually born and die in places, not in other people. Therefore, the summary has helped us discover faulty triples in the data.

Refined summary of John Peel BBC radio sessions RDF dataset

Refined summary of BBC Radio John Peel Sessions RDF dataset

refined_summary_insee-geo

Refined summary of an INSEE RDF dataset

Publications

[1] Šejla Čebirić, François Goasdoué, Ioana Manolescu. Query-Oriented Summarization of RDF Graphs. Proceedings of the VLDB Endowment, Aug 2015, Kohala Coast, Hawaii, United States. 8 (12),<http://www.vldb.org/2015/>.

[2] Šejla Čebirić, François Goasdoué, Ioana Manolescu. Query-Oriented Summarization of RDF Graphs. Data Science – 30th British International Conference on Databases, BICOD 2015, Edinburgh, UK, July 6-8, 2015, Proceedings, Jul 2015, Edinburgh, United Kingdom. pp.87–91,<http://conferences.inf.ed.ac.uk/BICOD2015/>.

[3] Šejla Čebirić, François Goasdoué, Ioana Manolescu. Query-Oriented Summarization of RDF Graphs. BDA (Bases de Données Avancées), Sep 2015, Île de Porquerolles, France. <http://bda2015.univ-tln.fr/>.

Permanent link to this article: https://team.inria.fr/oak/projects/rdfsummary/