Dario Colazzo: Processing XML Queries and Updates on Map/Reduce Clusters

14.30, room 435, PCRI

Abstract
Very large XML documents are generated and processed in several contexts, in particular in those involving scientific data and logs. In order to process such large documents we have designed and implemented techniques based on data partitioning for the evaluation of XQuery queries and updates on Map/Reduce clusters.

The proposed technique applies when queries and updates are iterative, i.e., they iterate the same query/update operations on a sequence of subtrees of the input document. We have developed schema-less, static analysis techniques to i) recognize iterative queries/updates, and ii) extract path information to be used for data partitioning purposes. Our system exploits both dynamic and static data partitioning to distribute the processing load among the machines of a Map/Reduce cluster. To boost the I/O performance across the distributed file system, our system uses EXI compression at each stage of the computation, from data partitioning to query/update execution.

After an introduction to the main techniques behind our system, a demonstration will show its abilities in dealing with complex workloads and large documents.

Permanent link to this article: https://team.inria.fr/oak/2013/04/19/dario-colazzo-processing-xml-queries-and-updates-on-mapreduce-clusters/

Léon Leblay is born

Leon was born at 4 pm on March 29, and he is doing well, just like his parents. Congrats and welcome into our world!

Permanent link to this article: https://team.inria.fr/oak/2013/03/30/leon-leblay-is-born/

Bogdan Cautis: Social-aware search: instance optimality versus efficiency

14.00, room 445, PCRI

Abstract
We present in this talk our recent research on top-k query answering in social applications.

This problem requires a significant departure from socially agnostic techniques for information retrieval. One can now exploit the social links in order to obtain more relevant results, valid not only with respected to the keyword query but also with respect to the social context of the user who issued it. We propose a sound and complete algorithm, called TOPKS, which addresses important applicability issues of existing techniques. Moreover, we show that TOPKS is instance optimal in the case when the search relies exclusively on the social weight of the data. To further address the efficiency needs of online applications, for which the exact search, albeit optimal, may still be expensive, we also consider approximate algorithms. These rely on concise statistics about the social network or on approximate shortest-paths computations.

As a complementary direction for efficient, online answering, we also consider the materialization and exploitation of previous query results (views). We study social-aware query optimization based on views, presenting algorithms that address two important sub-problems. First, handling the possible differences in context between the various views and an input query leads to view results having uncertain scores, i.e., score ranges valid for the new context. As a consequence, current top-k algorithms are no longer directly applicable and need to be adapted to handle this uncertainty. Second, adapted view selection techniques are needed, which can leverage both the descriptions of queries and statistics over their results.

Extensive experiments on both synthetic and real-world data (from Delicious and Twitter) show that our techniques have the potential to scale and meet the requirements of real applications. They have been recently demonstrated in a prototype social-aware search called Taagle.

This is joint work with Silviu Maniu (now at Hong Kong University).

Short bio
Bogdan Cautis is an Associate Professor at the Computer Science and Networks Department of Télécom ParisTech, since October 2007. He received his Habilitation (HdR) in March 2012, from Universite Pierre et Marie Curie and his Ph.D. in September 2007 from the University of Paris XI — working in the Gemo research team of INRIA Futurs, advised by INRIA DR Serge Abiteboul and Tova Milo from Tel Aviv University. His recent research interests lie in the broad area of Web data management and information retrieval: data management on the Web, data extraction, social networks, search, recommender systems, XML and semi-structured databases.

Permanent link to this article: https://team.inria.fr/oak/2013/03/29/bogdan-cautis-social-aware-search-instance-optimality-versus-efficiency/

EDBT 2013 in Genova

The group was well represented at EDBT in Genoa, Italy. The proceedings of the conference can be found here.

Continue reading

Permanent link to this article: https://team.inria.fr/oak/2013/03/27/edbt-2013/

ACM TODS: Almost-Linear Inclusion for XML Regular Expression Types

Almost-Linear Inclusion for XML Regular Expression Types
by Dario Colazzo, Giorgio Ghelli, Luca Pardini and Carlo Sartiani
in ACM Transactions on Database Systems (TODS)

Permanent link to this article: https://team.inria.fr/oak/2013/03/16/acm-tods-almost-linear-inclusion-for-xml-regular-expression-types/

Roxana Horincar: Online Refresh Strategies for Content Based Feed Aggregation

14.00, room 445, PCRI

Abstract
With the rapid growth of data sources, services and devices connected to the Internet, online available web content is getting more and more diverse and dynamic. In order to facilitate the efficient dissemination of evolving and temporary information, many web applications publish their new information as RSS and Atom documents which are then collected and transformed by RSS aggregators like Google Reader or Yahoo! News. I address the particular issue of large-scale aggregation of highly dynamic information sources by focusing on the design of optimal refresh strategies for large collections of RSS feed documents.

First, I introduce two quality measures specific to RSS aggregation which reflect the information completeness and average freshness of the result feeds. Then, I propose a best-effort feed refresh strategy that achieves maximum aggregation quality compared with all other existing policies with the same average number of refreshes. This strategy is based on specific online change estimation models developed after a deep analysis of the temporal publication characteristics of a representative collection of real-world RSS feeds. The presented methods have been implemented and tested against synthetic and real-world RSS feed data sets.

Permanent link to this article: https://team.inria.fr/oak/2013/03/15/roxana-horincar-online-refresh-strategies-for-content-based-feed-aggregation/

EDBT 2013: Rehearsals

Alexandra and Jesús will do a rehearsal of the talks they will give at EDBT 2013. It will take place on Wednesday, March 13, at 11.00, room 445.

Please find the details of the talks below.

 

When: Wednesday, March 13, at 11.00

Where: PCRI building, room 445

—–

Authors: François Goasdoué, Ioana Manolescu and Alexandra Roatiş

Title: Efficient Query Answering against Dynamic RDF Databases

Abstract:
A promising method for efficiently querying RDF data consists of translating SPARQL queries into efficient RDBMS-style operations. However, answering SPARQL queries requires handling RDF reasoning, which must be implemented outside the relational engines that do not support it. We introduce the database (DB) fragment of RDF, going beyond the expressive power of previously studied RDF fragments. We devise novel sound and complete techniques for answering Basic Graph Pattern (BGP) queries within the DB fragment of RDF, exploring the two established approaches for handling RDF semantics, namely reformulation and saturation. In particular, we focus on handling database updates within each approach and propose a method for incrementally maintaining the saturation; updates raise specific difficulties due to the rich RDF semantics. Our techniques are designed to be deployed on top of any RDBMS(-style) engine, and we experimentally study their performance trade-offs.

—–

Authors: Jesús Camacho Rodríguez, Dario Colazzo and Ioana Manolescu

Title: Web Data Indexing in the Cloud: Efficiency and Cost Reductions

Abstract:
An increasing part of the world’s data is either shared through the Web or directly produced through and for Web platforms, in particular using structured formats like XML or JSON. Cloud platforms are interesting candidates to handle large data repositories, due to their elastic scaling properties. Popular commercial clouds provide a variety of sub-systems and primitives for storing data in specific formats (files, key-value pairs etc.) as well as dedicated sub-systems for running and coordinating execution within the cloud.

We propose an architecture for warehousing large-scale Web data, in particular XML, in a commercial cloud platform, specifically, Amazon Web Services. Since cloud users support monetary costs directly connected to their consumption of cloud resources, we focus on indexing content in the cloud. We study the applicability of several indexing strategies, and show that they lead not only to reducing query evaluation time, but also, importantly, to reducing the monetary costs associated with the exploitation of the cloud-based warehouse. Our architecture can be easily adapted to similar cloud-based complex data warehousing settings, carrying over the benefits of access path selection in the cloud.

Permanent link to this article: https://team.inria.fr/oak/2013/03/13/edbt-2013-rehearsals/

Datalyse project on Big Data Analytics

The Datalyse project on Big Data Analytics in the Cloud has been accepted in the  “Cloud / Big Data” call of Investissements d’Avenir. The project is with Business & Decision (coordinator), Les Mousquetaires, LIG, Inria Lille, and LIRMM. Datalyse will start on May 1st.

Permanent link to this article: https://team.inria.fr/oak/2013/02/22/datalyse-project-on-big-data-analytics/

Matteo Magnani: The skyline operator: recent research trends and applications

14.30, room 445, PCRI

Abstract
The skyline operator (aka Pareto front) extracts relevant records from multidimensional databases according to multiple criteria. This operator has received a lot of attention because of its ability to identify the best records in a database without requiring to specify complex parameters like the relative importance of each criterion (as it is done in ranking methods). However, recent attempts to apply the operator to real data analysis tasks have revealed some weaknesses of the original definition.

In this presentation I will introduce the skyline operator, indicate some recent research trends related to these weaknesses and focus on the so-called aggregate skyline queries, where the skyline is executed on sets of records instead of single items. This operator can be used to express queries in the form: return the best groups depending on the features of their elements, and thus provides a powerful combination of grouping and skyline functionality.

I will conclude the presentation by showing an application of the skyline operator to complex data representing multiple social networks.

Short bio
Matteo Magnani graduated in Computer Science at the University of Bologna in 2002. He studied at the University of Marne la Vallée (undergraduate level) and the Imperial College London (postgraduate research level). In 2006 he obtained a PhD in Computer Science (Bologna) where in 2011 he also graduated in Violin. He has received a Rotary Prize for the best student of the Science Faculty (UniBO), a Best Paper Award at ASONAM 2011, a Funniest Presentation award at SBP 2010 and his mother is very proud of him (or at least this is what she officially says). Until May 2012 he was a researcher (RTD) at the Dept. of Computer Science, University of Bologna and he currently holds a position at research assistant professor level at the Data Intensive Systems group, Dept. of Computer Science, Aarhus University, Denmark.

His main research interests span Database and Information Management systems, specifically uncertain information management and multidimensional database queries, and Social Computing/Complex Network Science. He has written around 1.5 Kg of papers on these topics (when printed on heavy A4 size sheets). He is currently the joint coordinator of the #sigsna research group on social network analysis, and has successfully attracted funding from Working Capital (Telecom Italia), PRIN and FIRB (MIUR – Italian Ministry for education, University and Research) schemes.

Permanent link to this article: https://team.inria.fr/oak/2013/02/19/matteo-magnani-the-skyline-operator-recent-research-trends-and-applications/

Melanie Herschel: The Nautilus Analyzer – Understanding and Debugging Data Transformations

14.00, room 445, PCRI

Abstract
When developing data transformations – a task omnipresent in applications like data integration, data migration, data cleaning, or scientific data processing – developers quickly face the need to verify the semantic correctness of the transformation.
Declarative specifications of data transformations, e.g., SQL or ETL~tools, increase developer productivity but usually provide limited or no means for inspection or debugging. In this situation, developers today have no choice but to manually analyze the transformation and, in case of an error, to (repeatedly) fix and test the transformation.

The goal of the Nautilus project is to semi-automatically support this analysis-fix-test cycle. This talk and demonstration focus on one main component of Nautilus, namely the Nautilus Analyzer that helps developers in understanding and debugging their data transformations. After a brief introduction to different algorithms implemented within Nautilus, the demonstration will show the capabilities of this component for data transformations specified in SQL on scenarios from different domains that are based on real-world data.

Permanent link to this article: https://team.inria.fr/oak/2013/02/15/melanie-herschel-the-nautilus-analyzer-understanding-and-debugging-data-transformations/