Sofian Maabout: Computation of Borders and Applications

14.00, Room 445 at PCRI

Given a set of objects O and a boolean interest function q:2^O-> {true, false}, the border of 2^O is the set of extremal (minimal or maximal) subsets o of O s.t. q(o)=true. This concept is present in many contexts, e.g., maximal frequent
itemsets, functional dependencies and partial matrialization of data cubes. We present a parallel algorithm for computing borders when q is anti-monotonic and show its performance from theoretical and experimental point of views. Some extensions will be discussed s.t. distributed data and map reduce.

Etant donnés un ensemble d’objets O et une fonction booléenne d’intérêt q: 2^O-> {vrai, faux}, la bordure de 2^O est l’ensemble des éléments o de 2^O extrémaux (minimaux ou maximaux) tels que q(o)=vrai. On retrouve le concept de bordures dans plusieurs applications, ex: les itemsets fréquents maximaux, les dépendances fonctionnelles approximatives et le stockage
partiel des cubes de données. Nous présentons un algorithme parallèle pour le calcul des bordures lorsque q est anti-monotone, en discutons ses performances théoriques et expérimentales. Quelques extensions seront abordées comme le cas des données distribuées et son utilisation sous le paradigme map reduce.

Permanent link to this article:

Kostas Stefanidis: Contextual Database Preferences

14:00, Room 445 at PCRI

As both the volume of data and the diversity of users accessing them increase, user preferences offer a useful means towards improving the relevance of the query results to the information needs of the specific user posing the query. In this talk, we will focus on enhancing preferences with context. Context may express conditions on situations external to the database or related to the data stored in the database. We will outline models for expressing both types of preferences. Then, given a user query and its surrounding context, we will consider the problem of selecting related preferences to personalize the query.

Permanent link to this article:

Jorge Quiane: Managing Very Large Datasets in a Cloudy World

15:45, Room 445 at PCRI

Nowadays, many enterprises and organizations are faced with large volumes of data that have to be analyzed in a per-day basis. In particular, scientific datasets are growing at unprecedented rates and are likely to continue growing to the order of Exabytes. These current needs of data management require applications to run over a large number of computing nodes. However, databases management systems (DBMS) have proven inefficient to deal with very large datasets as well as to scale out to a large number of computing nodes. In this context, MapReduce and the Cloud computing are two alternative technologies that respond to this challenge. While MapReduce allows enterprises, organizations, and researchers to easily process very large volumes of data, the Cloud provides the required computing infrastructure to scale applications out to a large number of computing nodes. The beauty of these approaches are their ease-to-use and almost-free-admin cost properties. However, this simplicity comes at a price: the performance of MapReduce applications in the Cloud often do not match the one of a well-configured parallel DBMS. In this talk, we present some of the main features that allow DBMS to achieve orders of magnitude better performance than MapReduce applications. Then, we analyze how our Hadoop++ project allows MapReduce applications to match DBMS performance in the Cloud. We also discussed the design choices we made in the Hadoop++ project in order to preserve the ease-of-use and the almost-free-admin cost of MapReduce applications in the Cloud. Finally, we conclude this talk by discussing some of the challenges imposed by the Cloud to achieve data management efficiently.

Permanent link to this article:

Xiao Bai: Toward Distributed Search

14.00, Room 455 at PCRI

The rapid increasing amount of data on the Web provides a huge source of information but makes efficient search more challenging. Distributing the search is appealing to improve both efficiency and scalability. In this talk, we first present, in the context of social tagging systems, two gossip-based approaches that personalize query processing in a peer-to-peer manner. The off-line approach relies on user’s past behavior to personalize the search, and the on-line approach relies on user’s past behavior and current query to further improve the result quality for queries depicting user’s emerging interests. We then present, for a multi-site search engine, two approaches that invalidate entries in result cache to guarantee the freshness of results served to users. The on-line invalidation approach invalidates an entry upon cache hit according to index changes in local search site. The threshold-based approach makes invalidations when index changes in remote search sites. Joint use of both approaches provides a promising solution to react to index updating that may arise in future multi-site search engines.

Permanent link to this article:

DanaC 2012: RDF Data Management in the Amazon Cloud

RDF Data Management in the Amazon Cloud
by Francesca Bugiotti, François Goasdoué, Zoi Kaoudi, and Ioana Manolescu
in the DanaC 2012 workshop (collocated with EDBT/ICDT 2012)

Permanent link to this article:

DMC 2012: Building Large XML Stores in the Amazon Cloud

Building Large XML Stores in the Amazon Cloud, by Jesús Camacho-Rodríguez, Dario Colazzo and Ioana Manolescu, in the Data Management in the Cloud (DMC) Workshop (collocated with ICDE 2012)

Permanent link to this article:

PhD defense of Wael Khemiri

“Data-intensive interactive workflows for visual analytics”

Room 455, PCRI, 2 pm, December 12, 2011

The increasing amounts of electronic data of all forms, produced by humans (e.g. Web pages, structured content such as Wikipedia or the blogosphere etc.) and/or automatic tools (loggers, sensors, Web services, scientific programs or analysis tools etc.) leads to a situation of unprecedented potential for extracting new knowledge, finding new correlations, or simply  making sense of the data.

Visual analytics aims at combining interactive data visualization with data analysis tasks. Given the explosion in volume and complexity of scientific data, e.g., associated to biological or physical processes or social networks, visual analytics is called to play an important role in scientific data management.

Most visual analytics platforms, however, are memory-based, and are therefore limited in the volume of data handled. Moreover, the integration of each new algorithm (e.g. for clustering) requires integrating it by hand into the platform. Finally, they lack the capability to define and deploy well-structured processes where users with different roles interact in a coordinated way sharing the same data and possibly the same visualizations.

This work is at the convergence of three research areas: information visualization, database query processing and optimization, and workflow modeling. It provides two main contributions: (i) We propose a generic architecture for deploying a visual analytics platform on top of a database management system (DBMS) (ii) We show how to propagate data changes to the DBMS and visualizations, through the workflow process. Our approach has been implemented in a prototype called EdiFlow, and validated through several applications.  It clearly demonstrates that visual analytics applications can benefit from robust storage and automatic process deployment provided by the DBMS while obtaining good performance and thus it provides scalability. Conversely, it could also be integrated into a data-intensive scientific workflow platform in order to increase its visualization features.

Key-words: Visual analytics, scientific workflow systems, dynamic changes.

Permanent link to this article:

IEEE TKDE: Robust Module-based Data Management

F. Goasdoué and M.-C. Rousset: “Robust Module-based Data Management”


Permanent link to this article:

PhD defense of Marina Sahakyan

University Paris-Sud 11, building 650 (PCRI), room 435

“Main Memory XML Update Optimization: algorithms and experiments”

XML projection is one of the main adopted optimization techniques for reducing memory consumption in XQuery in-memory engines. The main idea behind this technique is quite simple: given a query Q over an XML document D, instead of evaluating Q on D, the query Q is evaluated on a smaller document D’ obtained from D by pruning out, at loading-time, parts of D that are unrelevant for Q. The actual queried document D’ is a projection of the original one, and is often much smaller than D due to the fact that queries tend to be quite selective in general.

While projection techniques have been extensively investigated for XML querying, we are not aware of applications to XML updating.

This Thesis investigates application of a projection based optimization mechanism for XQuery Update Facility expressions in the presence of a schema. The current work includes study of the method and a formal development of Merge algorithm as well as experiments testifying its effectiveness.

Permanent link to this article:

RFIA 2012: Reformulation-based Query Answering in RDF Databases

F. Goasdoué, I. Manolescu and A. Roatis: “Reformulation-based Query Answering in RDF Databases

Permanent link to this article: