The OakSad associated team is an Inria structure associating Oak with the database team from UC San Diego. The team has been created in 2013.
Publications
- Invisible Glue: Scalable Self-Tuning Multi-Stores, by Francesca Bugiotti, Damian Bursztyn, Alin Deutsch, Ioana Ileana and Ioana Manolescu accepted for publication in CIDR 2015: https://hal.inria.fr/hal-01087624
- XR query-view composition tech. report by Alin Deutsch (UCSD), François Goasdoué, Julien Leblay and Ioana Manolescu (OAK) is online (November 2013): http://hal.inria.fr/hal-00879511/en
- Complete Yet Practical Search for Minimal Query Reformulations Under Constraints, Ioana Ileana, Bogdan Cautis, Alin Deutsch, Yannis Katsis, SIGMOD 2014. https://hal.inria.fr/hal-01086494
- Reuse-based Optimization for Pig Latin, by Jesús Camacho-Rodríguez, Dario Colazzo, Melanie Herschel, Ioana Manolescu, Soudip Roy Chowdhury, BDA 2014. https://hal.inria.fr/hal-01086497
Events
- I. Ileana defended her PhD thesis working with B. Cautis and A. Deutsch. Alin visited OAK in October 22-24.
- J. Camacho-Rodríguez visited UC Irvine and also A. Deutsch (at GraphSQL) in the summer of 2014. He presented the PigReuse work there.
- I. Manolescu, F. Bugiotti and S. Chowdhury attended the BIS 2014 workshop in Paris (June 17-19).
- A. Deutsch and I. Manolescu met at the Peter Buneman Forum in October 2013 in Edinburgh. Together with Dan Suciu (U. Washington), Bogdan Cautis and François Goasdoué, we have started a new project on applying belief database models to social, semantic-rich Web content analysis.
- Soudip Roy Chowdhury joins Oak (as an Inria post-doc sponsored by Inria DRI) in September 2013.
- Bogdan Cautis, an associated OakSad partner from Telecom ParisTech, obtains a full professor position at IUT Orsay and thus joins the Oak group in September 2013
- Alin Deutsch visits Oak and B. Cautis (Telecom ParisTech partner) in July-August 2013. We advanced our research work on:
- Query-view composition algorithms for annotated documents (A. Deutsch, I. Manolescu and F. Goasdoué, with PhD student Julien Leblay)
- Factorization of common computations in large-scale parallel data processing (PigLatin and Map/Reduce). D. Colazzo, A. Deutsch, M. Herschel and I. Manolescu, with summer intern Varun Malhotra
- Provenance-directed chase&backchase for view-based query rewriting (Alin Deutsch, Yannis Katsis, Bogdan Cautis, with PhD student Ioana Ileana)
- The OakSad kick-off workshop took place in San Diego on May 23-24, 2013.
The following presentations were given:- Melanie Herschel: The Nautilus Query Analyzer
When developing data transformations—a task omnipresent in applications like data integration, data migration, data cleaning, or scientific data processing—developers quickly face the need to verify the semantic correctness of the transformation. Declarative specifications of data transformations, e.g., SQL or ETL tools, increase developer productivity but usually provide limited or no means for inspection or debugging. In this situation, developers to- day have no choice but to manually analyze the transformation and, in case of an error, to (repeatedly) fix and test the transformation.The goal of the Nautilus project is to semi-automatically sup-port this analysis-fix-test cycle. This talk focuses on the first main component of Nautilus, namely the Nautilus Analyzer that helps developers in understanding and debugging their data transformations specified in SQL. More specifically, we present different methods developed in our group to understand why the query result is incomplete, i.e., why some expected data do not appear in the query output. - Alexandra Roatis: Efficient Query Answering against Dynamic RDF Databases
A promising method for efficiently querying RDF data consists of translating SPARQL queries into efficient RDBMS-style operations. However, answering SPARQL queries requires handling RDF reasoning, which must be implemented outside the relational engines that do not support it.We introduce the database (DB) fragment of RDF, going beyond the expressive power of previously studied RDF fragments. We devise novel sound and complete techniques for answering Basic Graph Pattern (BGP) queries within the DB fragment of RDF, exploring the two established approaches for handling RDF semantics, namely reformulation and saturation.
In particular, we focus on handling database updates within each approach and propose a method for incrementally maintaining the saturation; updates raise specific difficulties due to the rich RDF semantics. Our techniques are designed to be deployed on top of any RDBMS(-style) engine, and we experimentally study their performance trade-offs.
- Stamatis Zampetakis: CliqueSquare: efficient Hadoop-based RDF query processing
Large volumes of RDF data collections are being created, published and used lately in various contexts, from scientific data to domain ontologies and to open government data, in particular in the context of the Linked Data movement. Managing such large volumes of RDF data is challenging due to the sheer size and the heterogeneity. To tackle the size challenge, a single isolated machine is not an efficient solution anymore. The MapReduce paradigm is a promising direction providing scalability and massively parallel processing of large-volume data.In this talk, I will present CliqueSquare, an ongoing work towards an efficient RDF data management platform based on Hadoop, an open source MapReduce implementation, and its file system, Hadoop Distributed File System (HDFS). CliqueSquare relies on a novel RDF data partitioning scheme enabling queries to be evaluated efficiently, by minimizing the number of MapReduce stages as well as the data transfers between nodes during query evaluation. Finally, I will present some preliminary experiments comparing our system against HadoopRDF, the state-of-the-art Hadoop-based RDF platform, which demonstrate the advantages of CliqueSquare in terms of query response times and network traffic.Joint work with François Goasdoué, Zoi Kaoudi, Ioana Manolescu and Jorge Quiané-Ruiz
- Ioana Manolescu: Delta: Scalable Data Dissemination under Capacity Constraints
In content-based publish-subscribe (pub/sub) systems, users express their interests as queries over a stream of publications. Scaling up content-based pub/sub to very large numbers of subscriptions is challenging: users are interested in low latency, that is, getting subscription results fast, while the pub/sub system provider is mostly interested in scaling, i.e., being able to serve large numbers of subscribers, with low computational resources utilization.We present a novel approach for scalable content-based pub/sub in the presence of constraints on the available CPU and network resources, implemented within our pub/sub system Delta. We achieve scalability by off-loading some subscriptions from the pub/sub server, and leveraging view-based query rewriting to feed these subscriptions from the data accumulated in others. Our main contribution is a novel algorithm for organizing views in a multi-level dissemination network, exploiting view-based rewriting and powerful linear programming capabilities to scale to many views, respect capacity constraints, and minimize latency. The efficiency and effectiveness of our algorithm are confirmed through extensive experiments and a large deployment in a WAN.Joint work with Asterios Katsifodimos (OAK) and Konstantinos Karanasos (IBM Almaden)
- Jesús Camacho-Rodríguez: PAXQuery: Massively Parallel Evaluation of XQuery
Increasing volumes of data are produced or exported into Web data formats. Among these, the W3C’s XML is the standard for structured documents (and in particular Web pages), leading to an interest in processing XML queries on large-scale distributed platforms such as those based on the MapReduce paradigm.
We consider a recent extension to MapReduce, namely PACTs (PArallelization ConTracts), providing much greater flexibility in writing complex data manipulation programs than MapReduce in particular through the usage of many- input tasks (as opposed to the single-input MapReduce tasks).
In this talk, I will present our ongoing work on an architecture for an XML query processor built on top of the PACT programming model. The core technical contribution is a formal translation of complex, nested XQuery queries into efficient PACT algebraic plans, using a record-based data model and relational-style operators, so that PACT efficiency can be leveraged for XQuery evaluation. Then, I will outline a set of optimizations that can be applied to the PACT program resulting from the translation to make its execution more efficient. Joint work with Dario Colazzo and Ioana Manolescu
- Melanie Herschel: The Nautilus Query Analyzer
- The Berkeley-Inria-Stanford 2013 (BIS 2013) workshop took place on May 21-22, 2013 in Stanford. From OakSad, the attendees were: Alin Deutsch (UCSD), Jesús Camacho-Rodríguez, Melanie Herschel, Alexandra Roatis and Stamatis Zampetakis (OAK). Melanie presented OakSad.
- The OakSad team is created (January 25, 2013)
Ongoing research carried within OakSad includes:
- query composition and optimization algorithms for XR. This involves Alin Deutsch (UCSD), François Goasdoué, Julien Leblay and Ioana Manolescu (OAK)
- common sub-expression factorization in large-scale parallel data processing workflows. This work involves Dario Colazzo (OAK), Alin Deutsch (UCSD), Melanie Herschel and Ioana Manolescu (OAK)
- efficient distributed data dissemination. This involves Yannis Papakonstantinou (UCSD), Asterios Katsifodimos and Ioana Manolescu (OAK), as well as Konstantinos Karanasos (formerly OAK, now a post-doc at IBM Almaden)