Patrick Valduriez named ACM Fellow

The prestigious distinction from the Association for Computing Machinery (ACM) was recently awarded to a French national for the third time. This is a major honour for Patrick Valduriez, Senior Researcher at Inria and leader of the Zenith joint project-team with LIRMM* in Montpellier.

As one of the most influential computing societies in the scientific and educational world, the ACM awards every year the title of ACM Fellow to a few of its members for her outstanding contributions to computer science, the origin of fundamental knowledge and technological progress. It signifies international recognition at the highest level by one’s peers.

Fore more information:

http://www.inria.fr/en/centre/sophia/news/patrick-valduriez-named-acm-fellow

(*) The Montpellier Laboratory of Informatics, Robotics, and Microelectronics (cross-faculty research entity of the University of Montpellier 2 (UM2) and the National Center for Scientific Research (CNRS) – Institut des sciences informatiques et de leurs interactions (INS2I))

Permanent link to this article: https://team.inria.fr/zenith/patrick-valduriez-named-acm-fellow/

Workshop Mastodons@Montpellier: Gestion de Données à Grande Echelle en Science de la Vie

Vendredi 7 décembre 2012 de 9h à 17h

Lieu: Salle des séminaires, LIRMM, 161 rue Ada, 34392 Montpellier

Contacts: Esther.Pacitti@lirmm.fr et Eric.Rival@lirmm.fr

Site (pour s’inscrire): http://www.lirmm.fr/~pacitti/Mastodons.html

La biologie et ses applications, de la médecine à l’agronomie ou l’écologie, deviennent des sciences productrices des données massives et exigent des nouvelles approches computationnelles pour analyser et partager ces données.

Les nouvelles technologies de Séquençage à Haut Débit (SHD) apparues en 2005, et aussi appelées Séquençage de Nouvelle Génération (NGS), révolutionnent la manière dont sont posées et résolues les questions de recherches en science du vivant. Elles permettent aussi bien d’appréhender la diversité génomique au sein d’une espèce, que l’expression des gènes dans les cellules, ou les marques épigénétiques sur le génome. Les volumes de séquences amène ces sciences dans le domaine des « Big Data » et posent des challenges gigantesques pour l’exploitation de ces données.

Dans le domaine végétal, les méthodes de génétique quantitative permettent d’identifier les gènes impliqués dans des variations phénotypiques en réponse aux conditions environnementales. Elles produisent de grandes quantités de données (par ex. 105 données par jour) à différents intervalles de temps (de minutes à des jours), sur différents sites et à différentes échelles depuis des échantillons de petits tissus jusqu’à la plante entière.

Ce workshop interdisciplinaire rassemblera les chercheurs impliqués dans les axes de recherche traitement de données, bioinformatique, echophysiologie, biologie, et d’autres, pour permettre d’approfondir les discussions concernant le traitement de données à grande échelle et divers aspects spécifiques au traitement de données pour la séquencage à haut débit et phenotypage végétal, etc, pour pouvoir identifier les perspectives de recherche pour 2013.

Programme

9h Accueil

Session: Techniques de Gestion de Données à Grande Echelle
9h30 Conférence invitée : Jens Dittrich (Saarland University): Efficient Big Data Processing in Hadoop MapReduce
10h30 P. Valduriez, (INRIA & IBC, LIRMM): Parallel Techniques for Big Data Management

11h Pause

Session: Phenotypage à Grande Echelle
11h30 F. Tardieu (INRA, Montpellier) : Data Management in Plant Phenotyping: the roles of plants and crop models
12h Godin (INRIA): Toward high-throughput imaging for studying organismal development

12h30 Repas

Session: Traitement de Données de Phenotypage
14h E. Pacitti, M. Servajean (INRIA & LIRMM): Challenges on Phenotyping Data Sharing and a Case Study
14h30 F. Masseglia (INRIA & LIRMM), F. Tardieu (INRA): Data Mining: current approaches and questions in plant phenotyping

14h45 Pause courte

Session: Données et Séquencage à Grande Echelle
15h E. Rivals (IBC & LIRMM – CNRS & UM2) Challenges in the Analysis of High Throughput Sequencing Data
15h30 A. Chateau (IBC & LIRMM, UM2): Genome Assembly Verification

16h Pause courte

16h15 – 17h Discussion

Permanent link to this article: https://team.inria.fr/zenith/workshop-mastodonsmontpellier-gestion-de-donnees-a-grande-echelle-en-science-de-la-vie/

Zenith scientific seminar: Marta Mattoso,”Exploring Provenance in High Performance Scientific Computing”, December 6, 2012.

Marta Mattoso is Professor of the Department of Computer Science at the COPPE Institute from Federal University of Rio de Janeiro (UFRJ) since 1994, where she leads the Distributed Database Research Group. She has received the D.Sc degree from UFRJ. Dr. Mattoso has been active in the database research community for more than twenty years and her current research interests include distributed and parallel databases, data management aspects of scientific workflows.

Title: Exploring Provenance in High Performance Scientific Computing

Abstract: Large-scale scientific computations are often organized as a composition of many computational tasks linked through data flows. After the completion of a computational scientific experiment, a scientist has to analyze its outcome, for instance, by checking inputs and outputs of computational tasks that are part of the experiment. This analysis can be automated using provenance management systems that describe, for instance, the production and consumption relationships between data artifacts, such as files, and the computational tasks that compose the scientific application. Due to its exploratory nature, large-scale experiments often present iterations that evaluate a large space of parameter combinations. In this case, scientists need to analyze partial results during execution and dynamically interfere on the next steps of the simulation. Features, such as user steering on workflows to track, evaluate and adapt the execution need to be designed to support iterative methods. In this talk we show examples of iterative methods, such as, uncertainty quantification, reduced-order models, CFD simulations and bioinformatics. We discuss challenges in gathering, storing and querying provenance as structured data enriched with information about the runtime behavior of computational tasks in high performance computing environments. We also show how provenance can enable interesting and useful queries to correlate computational resource usage, scientific parameters, and data set derivation. We briefly describe how provenance of many-task scientific computations are specified and coordinated by current workflow systems on large clusters and clouds.

Permanent link to this article: https://team.inria.fr/zenith/zenith-scientific-seminar-marta-mattosoexploring-provenance-in-high-performance-scientific-computing-december-6-2012/

Zenith scientific seminar: Duy Hoa Ngo,”Enhancing Ontology Matching by Using Machine Learning, Graph Matching and Information Retrieval Techniques”, December 3, 2012.

Hoa will give a talk about his thesis work on Ontology Matching. He will defend his thesis a few days later (date to be announced).

Title: Enhancing Ontology Matching by Using Machine Learning, Graph Matching and Information Retrieval Techniques

Abstract: In recent years, ontologies have attracted a lot of attention in the Computer Science community, especially in the Semantic Web field. They serve as explicit conceptual knowledge models and provide the semantic vocabularies that make domain knowledge available for exchange and interpretation among information systems. However, due to the decentralized nature of the semantic web, ontologies are highly heterogeneous. This heterogeneity mainly causes the problem of variation in meaning or ambiguity in entity interpretation and, consequently, it prevents domain knowledge from sharing. Therefore, ontology matching, which discovers correspondences between semantically related entities of ontologies, becomes a crucial task in semantic web applications.

Several challenges to the field of ontology matching have been outlined in recent research. Among them, selection of the appropriate similarity measures as well as configuration tuning of their combination are known as fundamental issues that the community should deal with. In addition, verifying the semantic coherent of the discovered alignment is also known as a crucial task. Furthermore, the difficulty of the problem grows with the size of the ontologies.

To deal with these challenges, in this thesis, we propose a novel matching approach which combines different techniques coming from the fields of machine learning, graph matching and information retrieval in order to enhance the ontology matching quality. Indeed, we make use of information retrieval techniques to design new effective similarity measures for comparing labels and context profiles of entities at element level. We also apply a graph matching method named similarity propagation at structure level that effectively discovers mappings by exploring structural information of entities in the input ontologies. In terms of combination similarity measures at element level, we transform the ontology matching task into a classification task in machine learning. Besides, we propose a dynamic weighted sum method to automatically combine the matching results obtained from the element and structure level matchers. In order to remove inconsistent mappings, we design a new fast semantic filtering method. Finally, to deal with large scale ontology matching task, we propose two candidate selection methods to reduce computational space.

All these contributions have been implemented in a prototype named YAM++. To evaluate our approach, we adopt various tracks namely Benchmark, Conference, Multifarm, Anatomy, Library and Large Biomedical Ontologies from the OAEI campaign. The experimental results show that the proposed matching methods work effectively. Moreover, in comparison to other participants in OAEI campaigns, YAM++ showed to be highly competitive and gained a high ranking position.

Permanent link to this article: https://team.inria.fr/zenith/zenith-scientific-seminar-duy-hoa-ngoenhancing-ontology-matching-by-using-machine-learning-graph-matching-and-information-retrieval-techniques-december-3-2012/

Clustering de séries temporelles en agronomie : regrouper les plantes pour mieux les étudier.

Sujet de Stage M2R Info, 2012-13.

Florent Masseglia, Inria-Lirmm, florent.masseglia@inria.fr

François Tardieu, Inra, francois.tardieu@supagro.inra.fr

Patrick Valduriez, Inria-Lirmm, patrick.valduriez@inria.fr

Plus une plante est arrosée et éclairée, plus elle pousse… Cette « analyse » n’est pas très informative, surtout pour la recherche en agronomie qui demande des résultats plus fins sur les données qu’elle produit. Malheureusement, de telles évidences sont dominantes dans certaines études, parce qu’elles sont très caractéristiques de la réalité. Et cette domination est un obstacle pour la découverte de connaissances plus fines et plus instructives dans ces données, en particulier dans le domaine du phénotypage.

Le phénotypage étudie les relations entre le génotype (le patrimoine génétique) et le phénotype (le comportement) des plantes, dans plusieurs scénarios environnementaux. En d’autres termes, il s’agit de comparer l’évolution de plusieurs variétés génétiques d’une plante dans un même environnement. Cette comparaison permet de mieux comprendre certaines caractéristiques (capacité de production, résistance aux conditions climatiques, etc.) des plantes en fonction de leurs variétés.

Pour étudier ces réactions, chaque génotype est représenté plusieurs fois (e.g. de 3 à 10 plantes) afin de diminuer les risques d’exceptions statistiques. L’ensemble des plantes qui partagent le même génotype est appelé une « accession » ci après. La plateforme PhénoArch permet l’analyse de 1650 plantes, qui correspondent à un ensemble de 100 – 400 accessions suivant le nombre de traitements expérimentaux. La plateforme recueille des informations sur les plantes et sur leur environnement à intervalles régulier. Les données issues de la plateforme PhénoArch se présentent sous forme de séries temporelles (des mesures prises à intervalles réguliers) et peuvent concerner l’environnement (e.g. l’éclairement, la température de l’air, l’humidité) ou des variables directement mesurées sur les plantes (e.g. la croissance, le nombre de feuilles, la transpiration).

Analyser ces séries temporelles présente à la fois un enjeu scientifique pour le phénotypage et des défis techniques pour la recherche en informatique.

Nettoyage des données

Les données issues de la plateforme concernent des accessions qui sont chacune représentée par plusieurs plantes. Un premier problème lors de l’analyse de ces données consiste à nettoyer les données issues de plantes qui ont un comportement “déviant” (une plante parmi les 3 à 6 représentant cette accession et qui se comporte de manière anormale). Un premier ensemble d’outils permettrait à ce stade de mieux détecter ces données aberrantes. Il peut s’agir de mettre en place une distance entre les séries afin de détecter si l’une d’elles s’éloigne particulièrement du lot. Cette détection serait alors utilisée sous forme « d’alarme » par les experts afin de mieux cibler les données à examiner pour l’analyse à venir.

Analyse des données

Une fois nettoyées, les données des plantes (i.e. des individus) peuvent permettre d’obtenir des données caractérisant une accession, sous forme de généralisation. Autrement dit, à partir des séries temporelles de 3 à 6 plantes d’une accession, on peut obtenir une série unique (une sorte de série agrégée pour cette accession). Avec une série par accession, on peut alors produire un clustering de l’ensemble des séries temporelles associées à ces accessions.

Le travail de ce stage consiste en trois étapes principales :

  1. Etat de l’art. L’étudiant devra proposer un état de l’art sur l’analyse de séries temporelles. Cela devra couvrir les questions de discrétisation, régression et clustering.

  2. Application d’une technique de l’état de l’art (choisie en concertation avec les encadrants) sur un jeu de données réelles issues de la plateforme PhénoArch en ne considérant qu’une seule variable phénotypique (e.g. la croissance). Cette application devra être réalisée via une implémentation, par l’étudiant, de la technique sélectionnée, dans un des langages C/C++ ou Java.

  3. Proposition d’une méthode permettant de prendre en compte plusieurs variables dans le processus de clustering.

Permanent link to this article: https://team.inria.fr/zenith/clustering-de-series-temporelles-en-agronomie-regrouper-les-plantes-pour-mieux-les-etudier/

Zenith scientific seminar: Tristan Allard,”Privacy-Preserving Data Publishing using Secure Devices”, November 16, 2012.


Tristan Allard will present part of his Ph.D. thesis work on Privacy-Preserving Data Publishing on November 16, 2012, at 10:30 am. Location: Galéra, Room 127.

Title: ETAP : Revisiting Privacy-Preserving Data Publishing using Secure Devices.

Abstract:The goal of Privacy-Preserving Data Publishing (PPDP) is to generate a sanitized (i.e. harmless) view of sensitive personal data (e.g. a health survey), to be released to some agencies or simply the public. However, traditional PPDP practices all make the assumption that the process is run on a trusted central server. In this talk, I will argue that the trust assumption on the central server is far too strong, and overview METAP, a generic fully distributed protocol designed to execute various forms of PPDP algorithms on an asymmetric architecture composed of low power secure devices and a powerful but untrusted infrastructure. This work, currently under submission, is joint with Benjamin Nguyen and Philippe Pucheral.

Permanent link to this article: https://team.inria.fr/zenith/zenith-scientific-seminar-tristan-allardprivacy-preserving-data-publishing-using-secure-devices-november-16-2012/

Zenith scientific seminar: Imene Mami,”A Declarative Approach to Modeling and Solving the View Selection Problem”, November 9, 2012.

Imene will defend her Ph.D thesis on November 15. She will give a talk about the view selection problem on November 9 at 10:30 am, room G.127.

Title: A Declarative Approach to Modeling and Solving the View Selection Problem

Abstract: View selection is important in many data-intensive systems e.g., commercial database and data warehousing systems to improve query performance. View selection can be defined as the process of selecting a set of views to be materialized in order to optimize query evaluation. To support this process, different related issues have to be considered. Whenever a data source is changed, the materialized views built on it have to be maintained in order to compute up-to-date query results. Besides the view maintenance issue, each materialized view also requires additional storage space which must be taken into account when deciding which and how many views to materialize.
The problem of choosing which views to materialize that speed up incoming queries constrained by an additional storage overhead and/or maintenance costs, is known as the view selection problem. This is one of the most challenging problems in data warehousing and it is known to be a NP-complete problem. In a distributed environment, the view selection problem becomes more challenging. Indeed, it includes another issue which is to decide on which computer nodes the selected views should be materialized. The view selection problem in a distributed context is now additionally constrained by storage space capacities per computer node, maximum global maintenance costs and the communications cost between the computer nodes of the network.
In this work, we deal with the view selection problem in a centralized context as well as in a distributed setting. Our goal is to provide a novel and efficient approach in these contexts. For this purpose, we designed a solution using constraint programming which is known to be efficient for the resolution of NP-complete problems and a powerful method for modeling and solving combinatorial optimization problems. The originality of our approach is that it provides a clear separation between formulation and resolution of the problem. Indeed, the view selection problem is modeled as a constraint satisfaction problem in an easy and declarative way. Then, its resolution is performed automatically by the constraint solver. Furthermore, our approach is flexible and extensible, in that it can easily model and handle new constraints and new heuristic search strategies for optimization purpose.
The main contributions of this thesis are as follows. First, we define a framework that enables to have a better understanding of the problems we address in this thesis. We also analyze the state of the art in materialized view selection to review the existing methods by identifying respective potentials and limits. We then design a solution using constraint programming to address the view selection problem in a centralized context. Our performance experimentation results show that our approach has the ability to provide the best balance between the computing time to be required for finding the materialized views and the gain to be realized in query processing by materializing these views. Our approach will also guarantee to pick the optimal set of materialized views where no time limit is imposed. Finally, we extend our approach to provide a solution to the view selection problem when the latter is studied under multiple resource constraints in a distributed context. Based on our extensive performance evaluation, we show that our approach outperforms the genetic algorithm that has been designed for a distributed setting.

Permanent link to this article: https://team.inria.fr/zenith/zenith-scientific-seminar-imene-mamia-declarative-approach-to-modeling-and-solving-the-view-selection-problem-november-9-2012/

Zenith scientific seminar: Florent Masseglia, “Mining Uncertain Data Streams”, October 17, 11am.

Florent Masseglia will present a recent work, done with Reza Akbarinia, about uncertain data stream mining on October 17 at 11am, room G.227.

Title: Mining Uncertain Data Streams.

Abstract: Dealing with uncertainty has gained increasing attention these past few years in both static and streaming data management and mining. There are many possible reasons for uncertainty, such as noise occurring when data are collected, noise injected for privacy reasons, semantics of the results of a search engine (often ambiguous),etc. Thus, many sensitive domains now involve massive uncertain data (including scientific applications). The problem is even more difficult for uncertain data streams where massive frequent updates need to be taken into account while respecting data stream constraints. In this context, discovering Probabilistic Frequent Itemsets (PFI) is very challenging since algorithms designed for deterministic data are not applicable.

In this talk, I will present our recent work with Reza Akbarinia on this topic. We propose FMU (Fast Mining of Uncertain data streams), the first solution for exact PFI mining in data streams with sliding windows. FMU allows updating the frequentness probability of an itemset whenever a transaction is added or removed from the observation window. Using these update operations, we are able to extract PFI in sliding windows with very low response times.

Permanent link to this article: https://team.inria.fr/zenith/zenith-scientific-seminar-florent-masseglia-mining-uncertain-data-streams-october-17-11am/

Zenith scientific seminar: Khalid Saleem, “Open Data Analytics – Research Perspectives”, September 19, 11am.

Before leaving our team, Khalid will give a synthetic presentation of his work about open data analytics during his stay, on September 19, at 11am, G.127. Title : Open Data Analytics – Research Perspectives

Title : Open Data Analytics – Research Perspectives

Abstract : According to a survey internet has grown to 98 peta bytes in 2011, comprising of web pages and raw data, with more than 2 billion web users. Although, the web page creation and access have been standardized over the years, the available data lacks such standards. Different terminologies are being used to tag the data; open, big or linked data. The contributed data is categorized as web scale and have a very high degree of format variance, thus making it very difficult to formalize a standard access technique. Based on these atypical data characteristics, data scientists are envisaging a new era of data analytics, requiring better algorithms and applications to deliver in-time benefits from this data.

The presentation explains the scenarios which help in typifying of data available on the web (open, big, linked), in different domains (Government, Science, Enterprise, Society). Secondly, we outline the open data characteristics and present a model framework, signifying the research domains related to open data analytics. The model can help the data scientists and the application developers in devising open data-driven real-time analytical tools. Alongside, examples of open data financial equity will also be highlighted.

Permanent link to this article: https://team.inria.fr/zenith/zenith-scientific-seminar-khalid-saleem-open-data-analytics-research-perspectives/

Zenith scientific seminar: Imene Mami “View Selection Under Multiple Resource Constraints in a Distributed Context”, August 27, 11:30am.

In a joint talk with Miguel Liroz (at 11am), Imene will present her recent work on view selection in a distributed context under resource constraints, room G.127 at 11:30.

Title: View Selection Under Multiple Resource Constraints in a Distributed Context

Abstract: The use of materialized views in commercial database systems and data warehousing systems is a common technique to improve the query performance. In past research, the view selection issue has essentially been investigated in the centralized context. In this paper, we address the view selection problem in a distributed scenario. We first extend the AND-OR view graph to capture the distributed features. Then, we propose a solution using constraint programming for modeling and solving the view selection problem under multiple resource constraints in a distributed context. Finally, we experimentally show that our approach provides better performance resulting from evaluating the quality of the solutions in terms of cost saving.

Permanent link to this article: https://team.inria.fr/zenith/zenith-scientific-seminar-imene-mami-view-selection-under-multiple-resource-constraints-in-a-distributed-context-august-27-1130am/