PhD Defence of Pawel Guzewicz on Oct 6 2021

Pawel Guzewicz defended his PhD thesis entitled “ExpRalytics: Expressive and Efficient Analytics for RDF Graphs”.

The PhD defense took place on Wednesday, October 6, 2021 at 2:00 p.m. (Paris time) in the hybrid format at the Gilles Kahn room (1 Rue Honoré d’Estienne d’Orves Alan Turing building, École polytechnique 91120 Palaiseau campus), as well as by videoconference Zoom

The PhD committee consisted of:

– Ms. Sihem AMER-YAHIA, research director, Laboratoire d’Informatique de Grenoble (LIG) (reviewer)
– Mr. Volker MARKL, professor, Technische Universität Berlin (reviewer)
– Ms. Angela BONIFATI, professor, Lyon 1 University (examiner)
– Mr. Fabian SUCHANEK, professor, Télécom Paris, Institut Polytechnique de Paris (examiner)
– Mr. Federico ULLIANA, associate professor, Montpellier University (examiner)
– Mr. Benoit GROZ, assistant professor, Université Paris-Saclay (examiner)
– Ms Ioana MANOLESCU, research director, Inria, École polytechnique, Institut Polytechnique de Paris (thesis director)
– Ms. Yanlei DIAO, professor, École polytechnique, Institut Polytechnique de Paris, Inria (co-supervisor)
– Mr. François GOASDOUE, professor, Univ. Rennes 1 (guest)

Following are some photos of the defence.

Abstract:

Large (Linked) Open Data are increasingly shared as RDF graphs today. However, such data does not yet reach its full potential in terms of sharing and reuse. The main bottleneck here lies in the capacity of human users to explore, discover, and grasp the content and insights of RDF graphs which are inherently heterogeneous and can be both large and complex.

In the first part of this thesis, we provide new methods to meaningfully summarize data graphs, with a particular focus on RDF graphs.
One class of tools for this task are structural RDF graph summaries, which allow users to grasp the different connections between RDF graph nodes. To this end, we introduce our novel RDFQuotient tool that finds compact yet informative RDF graph summaries that can serve as first-sight visualizations of an RDF graph’s structure. These summaries, based on the notion of quotient graphs, are easy to understand for casual users; they provide an overview of the complete structure of an RDF graph while being typically many orders of magnitude smaller. Our summarization algorithms have a linear time complexity in the size of the input graph. Further, we proposed incremental summarization algorithms capable of bringing the smallest needed adjustments to a summary in order to reflect modifications in the input graph. We have also proposed novel algorithms for building the summaries in a parallel shared-nothing architecture and instantiated them to the Apache Spark platform.

In the second part of this thesis, we consider the problem of automatically identifying the k most interesting aggregate queries that can be evaluated on an RDF graph, given an integer k and a user-specified interestingness function. Aggregate queries are routinely used to learn insights from relational data warehouses, and some prior research has addressed the problem of automatically recommending interesting aggregate queries.
However, the RDF setting is quite different: (a) In an RDF graph we are not given but we must identify the facts, dimensions, and measures which compose aggregate queries; (b) Relational OLAP algorithms for efficiently evaluating multiple aggregates cannot handle the presence of multi-valued dimensions for a given fact; such dimensions are quite frequently found in RDF data: facts may have zero, one or more values for dimensions.
We devise Spade, an extensible end-to-end framework that enables the identification and evaluation of interesting aggregates based on MVDCube, our new RDF-compatible one-pass algorithm for efficiently evaluating a lattice of aggregates, and a novel early-stop technique (with probabilistic guarantees) that can prune uninteresting aggregates and, as a result, reduce the aggregate evaluation cost.
Experiments using both real and synthetic graphs demonstrate the ability of our framework to find interesting aggregates in a large search space, the efficiency of our algorithms, and scalability as the data size and complexity grow.

Congratulations, Pawel!

Main recent results

Latest News

More…