Jesús Camacho Rodríguez’s thesis defense

For the second time in a week we had the pleasure to see one of our colleagues obtaining his PhD degree at the OAK Team. On September 25 Jesús Camacho Rodríguez defended his PhD thesis entitled “Efficient techniques for large-scale Web data management”.


After the defense, Jesús and his family kindly invited attendees to a pot where they were able to have a taste of several Spanish specialties in a casual atmosphere.

Congratulations on your very good work, Jesús!

Thesis juryjesus_thesis_directors_small
M. Reza Akbarinia, Researcher, Inria and Université Montpellier II (examiner)
M. Marc Baboulin, Professor, Université Paris-Sud and Inria (examiner)
M. Dario Colazzo, Professor, Université Paris-Dauphine (thesis director)
M. Donald Kossmann, Professor, ETH Zürich (thesis reporter)
Mme Ioana Manolescu, Research Director, Inria and Université Paris-Sud (thesis director)
M. Philippe Rigaux, Professor, Conservatoire National des Arts et Métiers (examiner)

Thesis abstract
The recent development of commercial cloud computing environments has strongly impacted research and development in distributed software platforms. Cloud providers offer a distributed, shared-nothing infrastructure, that may be used for data storage and processing.

In parallel with the development of cloud platforms, programming models that seamlessly parallelize the execution of data-intensive tasks over large clusters of commodity machines have received significant attention, starting with the MapReduce model very well known by now, and continuing through other novel and more expressive frameworks. As these models are increasingly used to express analytical-style data processing tasks, the need for higher-level languages that ease the burden of writing complex queries for these systems arises.

This thesis investigates the efficient management of Web data on large-scale infrastructures. In particular, we study the performance and cost of exploiting cloud services to build Web data warehouses, and the parallelization and optimization of query languages that are tailored towards querying Web data declaratively.

First, we present AMADA, an architecture for warehousing large-scale Web data in commercial cloud platforms. AMADA operates in a Software as a Service (SaaS) approach, allowing users to upload, store, and query large volumes of Web data. Since cloud users support monetary costs directly connected to their consumption of resources, our focus is not only on query performance from an execution time perspective, but also on the monetary costs associated to this processing. In particular, we study the applicability of several content indexing strategies, and show that they lead not only to reducing query evaluation time, but also, importantly, to reducing the monetary costs associated with the exploitation of the cloud-based warehouse.

Second, we consider the efficient parallelization of the execution of complex queries over XML documents, implemented within our system PAXQuery. We provide novel algorithms showing how to translate such queries into plans expressed in the PArallelization ConTracts (PACT) programming model. These plans are then optimized and executed in parallel by the Stratosphere system. We demonstrate the efficiency and scalability of our approach through experiments on hundreds of GB of XML data.

Finally, we present a novel approach for identifying and reusing common subexpressions occurring in Pig Latin scripts. In particular, we lay the foundation of our reuse-based algorithms by formalizing the semantics of the Pig Latin query language with extended nested relational algebra for bags. Our algorithm, named PigReuse, operates on the algebraic representations of Pig Latin scripts, identifies subexpression merging opportunities, selects the best ones to execute based on a cost function, and merges other equivalent expressions to share its result. We bring several extensions to the algorithm to improve its performance. Our experiment results demonstrate the efficiency and effectiveness of our reuse-based algorithms and optimization strategies.

Permanent link to this article:

M. Tamer Özsu: Web Data Management in the RDF Age

When: Wednesday, October 1, at 11.00

Where: PCRI building, room 455

Who: M. Tamer Özsu, University of Waterloo

Title: Web Data Management in the RDF Age

Web data management has been a topic of interest for many years during which a number of different modelling approaches have been tried. The latest in this approaches is to use RDF (Resource Description Framework), which seems to provide real opportunity for querying at least some of the web data systematically. RDF has been proposed by the World Wide Web Consortium (W3C) for modeling Web objects as part of developing the “semantic web”. W3C has also proposed SPARQL as the query language for accessing RDF data repositories. The publication of Linked Open Data (LOD) on the Web has gained tremendous momentum over the last number of years, and this provides a new opportunity to accomplish web data integration. A number of approaches have been proposed for running SPARQL queries over RDF­encoded Web data: data warehousing, SPARQL federation, and live linked query execution. In this talk, I will review these approaches with particular emphasis on some of our research within the context of gStore project (joint project with Prof. Lei Zou of Peking University and Prof. Lei Chen of Hong Kong University of Science and Technology), chameleon­db project (joint work with Günes Aluç, Dr. Olaf Hartig, and Prof. Khuzaima Daudjee of University of Waterloo), and live linked query execution (joint work with Dr. Olaf Hartig).

Short bio:
M. Tamer Özsu is Professor of Computer Science at the David R. Cheriton School of Computer Science, and Associate Dean (Research) of the Faculty of Mathematics at the University of Waterloo. His research is in data management focusing on large­scale data distribution and management of non­traditional data. He is a Fellow of the Association for Computing Machinery (ACM), and of the Institute of Electrical and Electronics Engineers (IEEE), an elected member of the Academy of Science of Turkey, and member of Sigma Xi and American Association for the Advancement of Science (AAAS). He currently holds a Cheriton Faculty Fellowship at the University of Waterloo.

Permanent link to this article:

C+J: Fact Checking and Analyzing the Web with FactMinder

“Fact Checking and Analyzing the Web with FactMinder”
by François Goasdoué, Konstantinos Karanasos, Yannis Katsis, Julien Leblay, Ioana Manolescu and Stamatis Zampetakis has been accepted for presentation as a poster/demo at the Computation + Journalism Symposium 2014

Permanent link to this article:

Alexandra Roatiş’ PhD thesis defense

Well done Alexandra!On September 22 the OAK team had the pleasure to assist to Alexandra’s thesis defense. Congratulations, new doctor!

Thesis directors
Mme. Ioana Manolescu, Research Director, Inria and Université Paris-Sud
M. François Goasdoué, Professor, Université Rennes 1
M. Dario Colazzo, Professor, Université Paris-Dauphine
Thesis reporters
M. Alon Halevy, Professor, Google Research
M. Frank van Harmelen, Professor, Vrije Universiteit Amsterdam
M. Serge Abiteboul, Research Director, Inria et ENS Cachan
Mme. Christine Froidevaux, Professor, Université Paris-Sud
M. François Goasdoué, Professor, Université Rennes 1
M. Frank van Harmelen, Professor, Vrije Universiteit Amsterdam
Mme. Ioana Manolescu, Research Director, Inria and Université Paris-Sud
M. Philippe Rigaux, Professor, Conservatoire National des Arts et Métiers

Thesis abstract

The utility and relevance of data lie in the information that can be extracted from it. The high rate of data publication and its increased complexity, for instance the heterogeneous, self-describing Semantic Web data, motivate the interest in efficient techniques for data manipulation. In this thesis we leverage mature relational data management technology for querying Semantic Web data.

The first part focuses on query answering over data subject to RDFS constraints, stored in relational data management systems. The implicit information resulting from RDF reasoning is required to correctly answer such queries. We introduce the database fragment of RDF, going beyond the expressive power of previously studied fragments. We devise novel techniques for answering Basic Graph Pattern queries within this fragment, exploring the two established approaches for handling RDF semantics, namely graph saturation and query reformulation.
In particular, we consider graph updates within each approach and propose a method for incrementally maintaining the saturation. We experimentally study the performance trade-offs of our techniques, which can be deployed on top of any relational data management engine.

The second part of this thesis considers the new requirements for data analytics tools and methods emerging from the development of the Semantic Web. We fully redesign, from the bottom up, core data analytics concepts and tools in the context of RDF data. We propose the first complete formal framework for warehouse-style RDF analytics. Notably, we define analytical schemas tailored to heterogeneous, semantic-rich RDF graphs, analytical queries which (beyond relational cubes) allow flexible querying of the data and the schema as well as powerful aggregation and OLAP-style operations. Experiments on a fully-implemented platform demonstrate the practical interest of our approach.

Thank you for the nice Romanian pot!After the defense, Alexandra and her family invited the attendees to a casual pot, where everyone had the change to get a taste of the delicious Romanian cuisine. And, to make thigs even better, Benjamin brought in a bottle of his famous Liqueur OAK. Thank you all!

Permanent link to this article:

OAK at VLDB 2014

Quite a number of present and former OAKs attended VLDB 2014 in Hangzhou, China!

IMG_3285 IMG_3291 IMG_3297Asterios presented the Delta paper:

2014-09-03 16.20.12Katerina had a paper in the PhD worskhsop, while Ioana and Zoi had also a poster based on their RDF cloud survey. Finally, Ioana co-chaired the BeRSys workshop on the last day of the conference.

There were many interesting papers at the conference, notably in the Knowledge and Web sessions; also, Fabian Suchanek and Gerhard Weikum gave again their Yago tutorial. The industrial session on joins was very interesting, too, with mostly parallel processing techniques for novel kinds of joins.

We also had some interesting restaurant and touristic experiences:

2014-09-05 21.53.00We ate some of that seafood, and survived.

We also went to visit surrounding temples:2014-09-06 11.50.05 2014-09-06 11.55.36and much more!

Permanent link to this article:

OAK at SIGMOD 2014

Bogdan, Francesca, Ioana, Jesús and Stamatis attended SIGMOD 2014 in Snowbird, Utah!

2014-06-23 17.06.53First, we attended DanaC which Asterios chaired:

2014-06-22 08.39.14

Jesús presented his short paper on PAXQuery in DanaC.

And we managed to take some group pictures with current and former OAKs (still, Julien is missing)!

2014-06-22 12.59.01Later on, Ioana and Zoi made their tutorial. This is a post-tutorial grin! 🙂


Ioana Ileana presented her paper with Bogdan, Alin Deutsch et Yannis Katsis.

Many other interesting papers were presented at the conference, in particular there was an RDF session at PODS ( and several papers on hybrid cloud-and-non cloud stores. (Admittedly, a narrow and subjective selection 😉 ).

Ah, and: hiking, beer, and rooftop pool visits were also part of the program…

2014-06-23 16.24.32 2014-06-23 16.17.15

Permanent link to this article:

ACM JDIQ: A Hybrid Approach to Answering Why-Not Questions on Relational Query Results

“A Hybrid Approach to Answering Why-Not Questions on Relational Query Results” by Melanie Herschel has been accepted for publication in the ACM Journal on Data and Information Quality.

Permanent link to this article:

SDSW 2014: Toward Social, Structured and Semantic Search

“Toward Social, Structured and Semantic Search” by Raphaël Bonaque, Bogdan Cautis, François Goasdoué and Ioana Manolescu has been accepted for publication in the “Surfacing the Deep and the Social Web” (SDSW), co-located with the 13th International Semantic Web Conference (ISWC 2014).

Permanent link to this article:

Team OAK @ Ecole thématique BDA, Oléron

The third edition of the MDD summer school was held in Oléron with a high participation from the OAK team.



Nicole made sure that everything ran smoothly. François and Ioana tutored us on the Semantic Web and experimental evaluation in the morning, and gave a brief introduction to smoked whiskey in the evening. From big to small, each member took the opportunity to express their view, on both topics 😉

Among others, the social activities featured:

  • a couple of birthdays


  • a boat trip


  • sailing


  • kayaking


  • and a lot of pool time fun


Don’t be fooled by the photo! The pool was rarely empty and the water slide was on high demand.

Permanent link to this article:

BDA 2014: How to deal with Cliques at Work

“How to deal with Cliques at Work” by Benjamin Djahandideh, François
Goasdoué, Zoi Kaoudi, Ioana Manolescu, Jorge Quiané-Ruiz and Stamatis
Zampetakis has been accepted for publication in BDA 2014.

Permanent link to this article: