ACM PODS: “Skyline Queries with Noisy Comparisons”

“Skyline Queries with Noisy Comparisons” by Benoît Groz and Tova Milo has been accepted for publication in ACM PODS 2015.

Abstract: We study in this paper the computation of skyline queries – a popular tool for multicriteria data analysis – in
the presence of noisy input. Motivated by crowdsourcing applications, we focus on a computation model where the input data items can only be compared through noisy comparisons  and present the first algorithms for skyline evaluation in this context.
Specifically, we aim at minimizing the number of comparisons required for computing or verifying a candidate skyline, while returning the correct answer with high probability.
We design output-sensitive algorithms, namely algorithms that take advantage of the potentially small size of the skyline, and analyze the delay (number of comparison rounds) of our solutions.
We also consider the problem of predicting the most likely skyline given some partial information in the form of noisy comparisons, and show that optimal prediction is computationally intractable.

Permanent link to this article:

Nicoleta Preda: ANGIE in wonderland

When: Friday, February 13, at 14.00
Where: PCRI building, room 445
Title: ANGIE in wonderland

In recent years, several important content providers such as Amazon,
Musicbrainz, IMDb, Geonames, Google, and Twitter, have chosen to
export their data through Web services. To unleash the potential of
these sources for new intelligent applications, the data has to be
combined across different APIs.

To this end, we have developed ANGIE, a framework that maps the
knowledge provided by Web services dynamically into a local knowledge
base. ANGIE represents Web services as views with binding patterns
over the schema of the knowledge base. In this talk, I will focus on
two problems related to our framework.

In the first part, the focus will be on the automatic integration of
new Web services. I will present a novel algorithm for inferring the
view definition of a given Web service in terms of the schema of the
global knowledge base. The algorithm also generates a declarative
script can transform the call results into results of the view. Our
experiments on real Web services show the viability of our approach.

The second part will address the evaluation of conjunctive queries
under a budget of calls. Conjunctive queries may require an unbound
number of calls in order to compute the maximal answers. However, Web
services typically allow only a fixed number of calls per session.
Therefore, we have to prioritize query evaluation plans. We are working on distinguishing among all plans that could return answers those plans
that actually will. Finally, I will show an application for this new notion of plans.

Short bio:
Nicoleta Preda obtained her Ph.D. in computer science from the University Paris-Sud under the supervision of Serge Abiteboul and Ioana Manolescu. Before joining the University of Versailles in 2010, she was a post-doctoral researcher in the database group led by Gerhard Weikum at the Max Planck Institute for Informatics. Her research interests include the enrichment of KBs with dynamic data, rule mining, and querying large repositories of semi-structured data. Nicoleta teaches classes on data integration, database systems, XML technologies, and Web services.

Permanent link to this article:

Paolo Papotti: Beyond declarative mapping and cleaning

When: Monday, February 2, at 14.00
Where: PCRI building, room 445
Title: Beyond declarative mapping and cleaning

In the “big data” era, data integration is a popular activity both in academia and in industry. Integrating hundreds of heterogeneous sources on a daily basis requires a great amount of manual work in order to have data that is polished enough to be useful in the final applications, such as querying and mining. The problem is ever harder in practice, as data is often dirty in nature because of typos, duplicates, and so on, that can lead to poor results in the analytic tasks.

Over the last ten years, several successful systems have been proposed to tackle this challenge with a formal, declarative approach based on first order logic. However, despite the positive results, there is still a gap between these proposals and the leading commercial systems. The latter are harder to maintain, to debug, and to test, but provide the level of personalization and detail that are needed to solve “real-world” problems. In this talk, I will describe some of my results in tackling mapping and cleaning with a declarative approach, and how this experience has pushed me to explore a new way that can take the best of both worlds.

Short bio:
Paolo Papotti is a scientist in the Data Analytics center at Qatar
Computing Research Institute (QCRI). He holds a Ph.D degree in
computer science from Roma Tre University (Italy, 2007), where he also was Assistant Professor before joining QCRI. He had visiting appointments at IBM Almaden (USA) and at the UC Santa Cruz (USA). His research topics are in the general area of information integration and data quality.

Permanent link to this article:

PAXQuery demo accepted at SIGMOD 2015

The demonstration “PAXQuery: Parallel Analytical XML Processing”
by Juan A. M. Naranjo, Jesús Camacho-Rodríguez, Dario Colazzo and
Ioana Manolescu
has been accepted for publication at SIGMOD 2015.

Permanent link to this article:

CliqueSquare RDF platform on Hadoop available for download

We are pleased to announce the source code release of CliqueSquare, an RDF data management system based on Hadoop.

CliqueSquare is a system for storing and querying large RDF graphs relying on Hadoop’s distributed file system (HDFS) and Hadoop’s MapReduce open-source implementation. It provides a novel partitioning and storage scheme that permits 1-level joins to be evaluated locally using efficient map-only joins. In addition, CliqueSquare is equipped with a unique optimization algorithm based on graphs and cliques capable of generating highly parallelizable flat query plans relying on n-ary equality joins.
The system is described in an upcoming ICDE 2015 paper as well as an ICDE 2015 demonstration (see

CliqueSquare Features
* Scalable RDF storage using novel partitioning algorithms specially designed for Hadoop and HDFS that take into account the peculiarities of the RDF structure to reduce query-generated network traffic
* Scalable processing of SPARQL Basic Graph Pattern (BGP) queries relying on:
(i) novel optimization algorithms aiming to produce highly parallelizable query plans;
(ii) efficient MapReduce physical operators maximizing the usage of the Hadoop cluster.

Minimum system requirements
* Hadoop 1.2.1
* Linux / Mac OS
* Java 6

The initial release of CliqueSquare is available at:

Feature to be added soon: support for grouping and aggregation

Try it out and help us improve it by sending us your feedback:

Best regards,

The CliqueSquare Team

Team Website:

Permanent link to this article:

Puya – Hossein Vahabi: Social media and Blogging

When: Friday, January 16, at 14.00
Where: PCRI building, room 445
Title: Social media and Blogging

The talk will be focused on Social Media and Blogging. In particular I’ll present three different works: 1. A novel approach for as-you-type network-aware top-k keyword search over social media; 2. A novel approach to harness the social community information to discover and model the evolution of topics in social networks using matrix co-factorization; 3. A novel approach to enhance user engagement in online social-network, micro-blogging, and other online platforms.

Short bio:
Puya – Hossein Vahabi is a researcher at Yahoo Labs (2014-Now) working on social media, stream data, and advertising. He got his Ph.D. in Computer Science and Engineering (2009 – 2012) at the IMT Lucca, Italy. He graduated with a thesis on “Recommendation Techniques for Web Search and Social Media”. The thesis was supervised by Prof. Ricardo Baeza-Yates (Vice President of Research for Europe and Latin America, leading the Yahoo Labs), and Dr. Fabrizio Silvestri (Senior Researcher at Yahoo Labs and Researcher at National Research Council of Italy). Before joining Yahoo, he was involved in several startups on social blogging and social video streaming. In 2010, during his Ph.D., he worked on recommender systems (and massive query log analysis) at Yahoo Labs for a year, and he was also a research associate to the National Council for Research of Italy for three years.

Permanent link to this article:

Yanlei Diao: Supporting Scientific Analytics under Data Uncertainty and Query Uncertainty

When: Friday, January 16, at 10.00
Where: PCRI building, room 455
Who: Y. Diao
Title: Supporting Scientific Analytics under Data Uncertainty and Query Uncertainty

Data management is becoming increasingly important in large-scale scientific applications such as computational astrophysics, severe weather monitoring, and genomics. In this talk, I present our recent work to address two major challenges raised by those scientific applications. The first challenge regards “data uncertainty”, due to the fact that scientific measurements are inherently noisy and uncertain. In particular, we address uncertain data management under the array model, which has gained popularity for large-scale scientific data processing due to performance benefits. We propose a suite of storage and evaluation strategies to support array operations under data uncertainty. Results from Sloan Digital Sky Survey (SDSS) datasets show that our techniques outperform state-of-the-art methods by 1.7x to 4.3x for the Subarray operation and 1 to 2 orders of magnitude for Structure-Join.

As scientific data continues to grow in size and diversity, it is becoming harder for the user to express her data interests precisely in a formal language like SQL. We refer to this second problem as “query uncertainty”. This leads to a strong need for “interactive data exploration,” a service that efficiently navigates the user through a large data space to identify the objects of interest. We present our initial work on interactive data exploration, with results suggesting that it is possible to predict user interests modeled by conjunctive queries with a small number of samples, while providing interactive performance.

Short bio:
Yanlei Diao is Associate Professor of Computer Science at the University of Massachusetts Amherst. Her research interests are in information architectures and data management systems, with a focus on big data analytics, scientific analytics, data streams, uncertain data management, and RFID and sensor data management. She received her PhD in Computer Science from the University of California, Berkeley in 2005, her M.S. in Computer Science from the Hong Kong University of Science and Technology in 2000, and her B.S. in Computer Science from Fudan University in 1998.
Yanlei Diao was a recipient of the 2013 CRA-W Borg Early Career Award (one female computer scientist selected each year), IBM Scalable Innovation Faculty Award, and NSF Career Award, and she was a finalist of the Microsoft Research New Faculty Award. She spoke at the Distinguished Faculty Lecture Series at the University of Texas at Austin. Her PhD dissertation “Query Processing for Large-Scale XML Message Brokering” won the 2006 ACM-SIGMOD Dissertation Award Honorable Mention. She is currently Editor-in-Chief of the ACM SIGMOD Record, Associate Editor of ACM TODS, Area Chair of SIGMOD 2015, and member of the SIGMOD Executive Committee and SIGMOD Software Systems Award Committee. In the past, she has served as Associate Editor of PVLDB, organizing committee member of SIGMOD, CIDR, DMSN, and the New England Database Summit, as well as on the program committees of many international conferences and workshops. Her research has been strongly supported by industry with awards from Google, IBM, Cisco, NEC labs, and the Advanced Cybersecurity Center.

Permanent link to this article:

IEEE TKDE: “PAXQuery: Efficient Parallel Processing of Complex XQuery”

“PAXQuery: Efficient Parallel Processing of Complex XQuery”

by Jesús Camacho-Rodríguez, Dario Colazzo and Ioana Manolescu
has been accepted for publication in IEEE TKDE.

Permanent link to this article:

EDBT 2015: “Optimizing Reformulation-Based Query Answering in RDF”

“Optimizing Reformulation-based Query Answering in RDF”

by Damian Bursztyn, François Goasdoué and Ioana Manolescu

has been accepted for publication in EDBT 2015.

Permanent link to this article:

Francesca Bugiotti recruited by Supélec

Francesca has obtained a permanent teaching position within Supelec, to start in early 2015.


Permanent link to this article: