IBC seminar: Themis Palpanas “Data Series Management: Fulfilling the Need for Big Sequence Analytics” 19 jan. 2018

Séminaire IBC, organisé par  Zenith
Lundi 19 mars 2018, 11h
Salle 1/124, Bat. 5

Data Series Management: Fulfilling the Need for Big Sequence Analytics
Themis Palpanas
IUF et Université Paris Descartes

There is an increasingly pressing need, by several applications in diverse domains, for developing techniques able to index and mine very large collections of sequences, or data series. Examples of such applications come from a multitude of social and scientific domains, including biology, where high-throughput sequencing is generating massive sequence collections. It is not unusual for these applications to involve numbers of data series in the order of hundreds of millions to billions, which are often times not analyzed in their full detail due to their sheer size. However, no existing data management solution (such as relational databases, column stores, array databases, and time series management systems) can offer native support for sequences and the corresponding operators necessary for complex analytics.
In this talk, we argue for the need to study the theory and foundations for sequence management of big data sequences, and to build corresponding systems that will enable scalable management and analysis of very large sequence collections. We describe recent efforts in designing techniques for indexing and mining truly massive collections of data series that will enable scientists to easily analyze their data. We discuss novel techniques that adaptively create data series indexes, allowing users to correctly answer queries before the indexing task is finished. Finally, we present our vision for the future in big sequence management research, including the promising directions in terms of storage, distributed processing, and query benchmarks.

short bio
Themis Palpanas is Senior Member of the Institut Universitaire de France (IUF), a distinction that recognizes excellence across all academic disciplines, and professor of computer science at the Paris Descartes University (France), where he is director of diNo, the data management group. He received the BS degree from the National Technical University of Athens, Greece, and the MSc and PhD degrees from the University of Toronto, Canada. He has previously held positions at the University of Trento, and at IBM T.J. Watson Research Center, and visited Microsoft Research, and the IBM Almaden Research Center.
His interests include problems related to data science (big data analytics and machine learning applications). He is the author of nine US patents, three of which have been implemented in world-leading commercial data management products. He is the recipient of three Best Paper awards, and the IBM Shared University Research (SUR) Award.
He is curently serving on the VLDB Endowment Board of Trustees, as an Editor in Chief for the BDR Journal, Associate Editor for VLDB 2019, Associate Editor in the TKDE, and IDA journals, as well as on the Editorial Advisory Board of the IS journal, and the Editorial Board of the TLDKS Journal. He has served as General Chair for VLDB 2013, Associate Editor for VLDB 2017, and Workshop Chair for EDBT 2016, ADBIS 2013 and ADBIS 2014, General Chair for the PDA@IOT International Workshop (in conjunction with VLDB 2014), and General Chair for the Event Processing Symposium 2009.

Permanent link to this article: https://team.inria.fr/zenith/ibc-seminar-themis-palpanas-data-series-management-fulfilling-the-need-for-big-sequence-analytics-19-jan-2018/

Journée d’étude Méthode, Intégrité Scientifique & Données, 16 février 2018, Montpellier.

Zenith participe à la Journée d’étude Méthode, Intégrité Scientifique & Données, Vendredi 16 février 2018, MSH SUD, Site Saint Charles 2, Montpellier.


Permanent link to this article: https://team.inria.fr/zenith/zenith-participe-a-la-journee-detude-methode-integrite-scientifique-donnees-vendredi-16-fevrier-2018-msh-sud-site-saint-charles-2-montpellier/

Zenith Seminar: Vitor Silva “A methodology for capturing and analyzing dataflow paths in computational simulations” 31 jan. 2018

Mercredi 31 janvier, 11h, Salle 2/124

A methodology for capturing and analyzing dataflow paths in computational simulations
Vitor Silva, COPPE/UFRJ, Rio de Janeiro

Scientific applications in large-scale are based on the execution of complex computational models in a specific field of the science. Moreover, a huge volume of scientific data is commonly generated and stored in data sources, which can be raw data files or in-memory data structures. In this context, domain specialists often need to analyze part of these scientific data to validate their scientific hypotheses. Besides the analysis of single data sources, they also need to relate scientific data from different data sources and to perform analysis during the execution of scientific application, since it may take days or weeks, even in high performance computing environments. Therefore, it is important a solution that enables scientific and provenance data extraction (for providing dataflow monitoring) and online dataflow analysis support. According to this exploratory scientific data analysis scenario, we propose a methodology for capturing and analyzing dataflow paths from scientific applications based on the modeling of the dataflow, scientific data, and queries.

Permanent link to this article: https://team.inria.fr/zenith/zenith-seminar-vitor-silva-a-methodology-for-capturing-and-analyzing-dataflow-paths-in-computational-simulations-31-jan-2018/

Zenith Seminar: Christophe Godin “Can we Manipulate Tree-forms like Numbers ?” 7 dec. 2017

Can we manipulate tree-forms like numbers ?
Christophe Godin, Inria

Thursday 7 December at 14h30

Salle des séminaires, Bat. 4

Abstract: Tree-forms are ubiquitous in nature and recent observation technologies make it increasingly easy to capture their details, as well as the dynamics of their development, in 3 dimensions, with unprecedented accuracy. These massive and complex structural data raise new conceptual and computational issues related to their analysis and to the quantification of their variability. Mathematical and computational techniques that usually successfully apply to traditional scalar or vectorial datasets fail to apply to such structural objects: How to define the average form of a set of tree-forms ? how to compare and classify tree-forms ? Can we solve efficiently optimization problems in tree-form spaces ? how to approximate tree-forms ? Can their intrinsic exponential computational curse be circumvented ? In this talk, I will present a recent work that we have made with my colleague Romain Azais to approach these questions from a new perspective, in which tree-forms show properties similar to that of numbers or real functions: they can be decomposed, approximated, averaged, transformed in dual spaces where specific computations can be carried out more efficiently. I will discuss how these first results can be applied to the analysis and simulation of tree-forms in developmental biology

Permanent link to this article: https://team.inria.fr/zenith/zenith-seminar-christophe-godin-can-we-manipulate-tree-forms-like-numbers-7-dec-2017/

Zenith seminar: Ji Liu “Efficient Uncertainty Analysis of Very Big Seismic Simulation Data ” 6 dec. 2017

Efficient Uncertainty Analysis of Very Big Seismic Simulation Data
Ji Liu
Zenith Postdoc
Wednesday 6 December at 11h
Room: 02/124, Bat 5
In recent years, big  simulation data is commonly generated from specific models,  in  different applications domains (astronomy, bioinformatics social networks, etc). In general, the simulation data corresponds to meshes that represent for instance a seismic soil area. It is of much importance to analyze the uncertainty of the simulation data in order to safely identify geological or seismic phenomenons, e.g. seismic faults. In order to analyze the uncertainty,  a  Probability Density Function (PDF) of each point in the mesh is computed to be analyzed.  However, this may be very time consuming (from several hours to even months) using a baseline approach based on parallel processing frameworks such as Spark. In this paper, we propose new solutions to efficiently compute and  analyze the uncertainty of very big simulation data using Spark. Our solutions use an original distributed architecture design. We propose three general approaches: data aggregation, machine learning  prediction and fast processing. We validate our approaches by extensive experimentations using big data ranging from hundreds of GB to several TB. The experimental results show that our approach  scales up very well and reduce the execution time by a factor of 33 (in the order of seconds or minutes) compared with a baseline approach.
This work is part of the  HPC4E European project, joint work with LNCC, Brazil co-authored with  N. Moreno, E. Pacitti, F. Porto and P. Valduriez

Permanent link to this article: https://team.inria.fr/zenith/zenith-seminar-ji-liu-efficient-uncertainty-analysis-of-very-big-seismic-simulation-data-6-dec-2017/

IBC & SciDISC seminar: Marta Mattoso “Human-in-the-loop to Fine-tune Data in Real Time” 14 dec. 2017

Human-in-the-loop to Fine-tune Data in Real Time
Marta Mattoso
COPPE/UFRJ, Rio de Janeiro

14 december 2017, 11h

Room 1/124, Bat.5

In long-lasting exploratory executions, it is often needed to fine-tune several parameters of complex computational models, because they may significantly impact performance.  Listing all possible combinations of parameters and exhaustively trying them all is nearly impossible even in high performance computers. Because of the exploratory nature of those computations, it is hard to determine, before the execution, which parameters and which values will work best to validate the initial hypothesis, even for the most experienced users. For this reason, after the initial setups, the user starts the computation and fine-tunes specific parameters based on online intermediate data analysis. In this talk we present the challenges in supporting the user with data analysis to monitor, evaluate and adjust executions in real time.  One of the problems in these executions is that, after some hours, the users can lose track of what has been tuned at early execution stages if the adaptations are not properly registered. We discuss on using techniques from provenance data management and human-in-the-loop to address the problem of adapting and tracking online parameter fine-tuning in several applications.

Permanent link to this article: https://team.inria.fr/zenith/marta-2017/

Post-doc position: Similarity Search in Large Scale Time Series

Title: Similarity Search in Large Scale Time Series

We are seeking a postdoctoral fellow in time series analytics, in collaboration with Safran ( https://www.safran-group.com/ ).


Nowadays, sensors technology is improving, and the number of sensors used for collecting information from different environments is increasing, e.g., from critical systems such as airplane engines. This huge utilization of sensors results in the production of large scale data, usually in the form of time series. With such complex and massive sets of time series, fast and accurate similarity search is a key to perform many data mining tasks like Shapelets, Motifs Discovery, Classification or Clustering.

This PostDoc position is proposed in the context of collaboration between the INRIA Zenith team and Safran (a multinational company specialized in the aircraft and rocket engines). We are interested in the correlation detection over multi dimensional time series, e.g. generated by engine check tests. For instance given a time slice (generated using a set of input parameters) of a very large time series, the objective is to detect quickly the time slice that is the most similar to it, and by this to find the input parameter values that generate similar outputs.

One of the distinguishing features of our underlying application is the huge volume of data to be analyzed. To deal with such a dataset, we intend to develop scalable solutions that take advantage of parallel frameworks (such as Mapreduce, Spark or Flink) that allow us to make efficient parallel data mining systems over ordinary machines. We will capitalize on our recent projects where we developed parallel solutions for indexing and analyzing very large datasets, e.g. [YAMP2017, SAM2017, SAM2015, AHMP2015].

One possibility for scalable correlation detection in this project is to build on top of related work, including the matrix profile index [YZUB+2016] over time series generated by thousands of sensors.  One of the tasks, in the context of this project, will be to develop distributed solutions for constructing and exploiting such  indexes over large scale time series coming from massively distributed sensors.

[YAMP2017] Djamel-Edine Yagoubi, Reza Akbarinia, Florent Masseglia, Themis Palpanas. DPiSAX: Massively Distributed Partitioned iSAX. IEEE International Conference on Data Mining (ICDM),  2017.

[SAM2017] Saber Salah, Reza Akbarinia, Florent Masseglia. Data placement in massively distributed environments for fast parallel mining of frequent itemsets. Knowledge and Information Systems (KAIS), 53(1), 207-237, 2017.

[SAM2015] Saber Salah, Reza Akbarinia, Florent Masseglia, Fast Parallel Mining of Maximally Informative k-Itemsets in Big Data. IEEE International Conference on Data Mining (ICDM), 2015.

[AHMP2015] Tristan Allard, Georges Hébrail, Florent Masseglia, Esther Pacitti. Chiaroscuro: Transparency and Privacy for Massive Personal Time-Series Clustering.  ACM Conference on Management of Data (SIGMOD), pp. 779-794, 2015.

[YZUB+2016] C-C M. Yeh, Y. Zhu, L. Ulanova, N. Begum, Y. Ding, H. Anh Dau, D. Furtado Silva, A. Mueen, E. Keogh. Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View that Includes Motifs, Discords and Shapelets. IEEE International Conference on Data Mining (ICDM), 2016.


This work will be done in the context of collaboration between INRIA Zenith team and Safran. The Zenith project-team ( https://team.inria.fr/zenith/ ), headed by Patrick Valduriez, aims to propose new solutions related to scientific data and activities. Our research topics incorporate the management and analysis of massive and complex data, such as uncertain data, in highly distributed environments. Our team is located in Montpellier that is a very active town located in south of France.

Safran ( https://www.safran-group.com/ ; https://en.wikipedia.org/wiki/Safran ) is a multinational company specialized in the aircraft/ rocket engines and aerospace component manufacturing.

Skills and profiles

Strong background in data mining

Strong skill of parallel data processing in Spark

A Ph.D. in computer science or mathematics

Duration, Salary and Location

Duration: 12 months

Annual Gross Salary: up to 42K€ depending on your experience.

Starting date: flexible but ideally as soon as possible.

This work will be done mainly in Montpellier, with regular visits to the Safran team in Paris.


Florent Masseglia (florent.masseglia@inria.fr) and Reza Akbarinia (reza.akbarinia@inria.fr).

Permanent link to this article: https://team.inria.fr/zenith/similarity-search-in-large-scale-time-series/

PhD Position: Distributed Management of Scientific Workflows for High-Throughput Plant Phenotyping

Directors: Esther Pacitti (Zenith team, University Montpellier and Inria, LIRMM), François Tardieu (UMR LEPSE, INRA) and Christophe Pradal (CIRAD)

Contact: christophe.pradal@cirad.fr

Funding: #Digitag-Inria (Inria PhD contract, Net salary/month: 1600€)

Keywords : Scientific Workflow, Distributed Computing, Cloud & Grid Computing, Phenotyping, Computer Vision, Reproducibility

Skills: We look for efficient candidates strongly motivated by challenging research topics in a multi-disciplinary environment. The applicant should present a good background in computer science including distributed computing, databases and computer vision. Basic knowledge in scientific workflow would be a plus. As regards software development, C, Python or Java languages are preferred.


This work is part of a new project on Scientific Workflows for Plant Phenotyping using cloud and grid computing, in the context of the Digital Agriculture Convergence Lab (#DigitAg) and in collaboration with the PIA Phenome project. This PhD will be directed both by computer scientists (E. Pacitti, C. Pradal) and by a biologist (F. Tardieu) that will provide both the data and the use cases relevant in plant phenotyping.

In the context of climate change and performance improvement of the crops, plant scientists study traits of interest in order to discover their natural genetic variations and identify their genetic controls. One important category is the morphological traits, which determine the 3D plant architecture [8]. This geometric information is useful to compute in-silico light interception and radiation-use efficiency (RUE), which are essential components to understand the genetic controls of biomass production and yield [9].

During the last decade, high-throughput phenotyping platforms have been designed to acquire quantitative data that will help understanding plant responses to the environmental conditions and the genetic control of these responses. Plant phenotyping consists in the observation of physical and biochemical traits of plant genotypes in response to environmental conditions. Recently, projects such as the Phenome project, have started to use high-throughput platforms to observe the dynamic growth of a large number of plants under different conditions, in field and platform conditions. These widely instrumented platforms produce huge datasets (images of thousands of plants, data collected by various sensors…) that keep increasing with complex in-silico experiments. For example the seven facilities of Phenome produce from 150 to 200 Terabytes of data per year. These data are heterogeneous (images, time courses), multiscale (from the organ to the field) and come from different sites. Farmers and breeders who use sensors from precision agriculture are now able to capture huge amounts of diverse data (e.g. images). Thus, the major problem becomes the automatic analysis of these massive datasets and the reproducibility of the in-silico experiments.

We define a scientific workflow as a pipeline to analyze experiments in an efficient and reproducible way, allowing scientists to express multi-step computational tasks (e.g. upload input files, preprocess the data, run various analyses and simulations, aggregate the results, …). OpenAlea [6] is a scientific workflow system that provides methods and software for plant modeling at different scales. It has been in constant use since 2004 by the plant community: the system has been downloaded 670 000 times and the web site has 10 000 unique visitors a month according to the OpenAlea web repository (https://openalea.gforge.inria.fr).

In the frame of Phenome, we are developing Phenomenal, a software package in OpenAlea that is dedicated to the analysis of phenotyping data in connection with ecophysiological models [9,10]. Phenomenal provides fully automatic workflows dedicated to the 3D reconstruction, segmentation and tracking of plant organs. It has been tested on maize, cotton, sorgho and apple tree. OpenAlea radiative models are used to estimate the light use efficiency and the in silico crop performance in a large range of contexts. To illustrate, Figure 1 shows the Phenomenal workflow that automatically reconstructs the 3D shoot architecture of plants from multi-view images acquired with the Phenoarch platform. This workflow has been tested on various annual and perennial plants such as maize, cotton, sorghum and young apple trees.

Executing such complex scientific workflows on huge datasets may take a lot of time. Thus, we have started to design an infrastructure, called InfraPhenoGrid, to distribute the computation of workflows using the EGI/France Grilles computing facilities [1]. EGI provides access to a grid with multiple sites around the world, each with one or more clusters. This environment is now well suited for data-intensive science, with different research groups collaborating at different sites. In this context, the goal is to address two critical issues in the management of plant phenotyping experiments: (i) scheduling distributed computation and (ii) allowing reuse and reproducibility of experiments [1,2].

Thesis subject

The proposed PhD thesis consists in scheduling the Phenomenal workflow on distributed resources and provide proofs of concepts

Scheduling distributed computation.

We shall adopt an algebraic approach, which is better suited for the optimization and parallelization of data-intensive scientific workflows [3]. The scheduling problem resembles scientific workflow execution in a multisite cloud [4,5]. The objective of the thesis is to go further and propose workflow parallelization and dynamic task allocation and data placement techniques to work with heterogeneous sites, as in EGI. To exchange and share intermediate data, we plan to use iRODS, an open-source data management software that federates distributed and heterogeneous data resources into a single logical file system [7]. In this context, the challenge is to deal with both task allocation and data placement among the different sites, while taking into account their heterogeneity, for instance, different transfer capabilities and cost models.

Allowing reuse and reproducibility of experiments

Modern scientific workflow systems are now equipped with modules that offer assistance for this. This is notably the case of the provenance modules, able to trace the parameter settings chosen at runtime and the data sets used as input of (or produced by) each workflow task. However, allowing workflow reproducibility and reuse depends on providing users with the means to interact with provenance information. The originality of the thesis lies in considering popular tools among data scientists, named interactive notebooks (like RStudio or Jupyter) as a means for users to interact with provenance information directly extracted from workflow runs. Challenges are numerous and include providing users with a simplified (sequential), yet correct (in terms of data dependencies involved) provenance information, hiding the complexity of highly parallel executions.

The approaches that proposed in this PhD will be implemented in OpenAlea. Image data from controlled and field phenotyping experiments will be provided by the Phenome project. The grid and cloud infrastructure for experimenting will be France Grille (European Grid Institute).


[1] C. Pradal, S. Artzet, J. Chopard, D. Dupuis, C. Fournier, M. Mielewczik, V. Nègre, P. Neveu, D. Parigot, P. Valduriez, S. Cohen-Boulakia: InfraPhenoGrid: A scientific workflow infrastructure for plant phenomics on the Grid. Future Generation Comp. Syst. 67: 341-353 (2017).

[2] S. Cohen-Boulakia, K. Belhajjame, O. Collin, J. Chopard, C. Froidevaux, A. Gaignard, K. Hinsen, P. Larmande, Y. Le Bras, F. Lemoine, F. Mareuil, H. Ménager, C. Pradal, C. Blanchet: Scientific workflows for computational reproducibility in the life sciences: Status, challenges and opportunities. Future Generation Computer Systems. doi: 10.1016/j.future.2017.01.012 (2017).

[3] E. Ogasawara, J. F. Dias, D. de Oliveira, F. Porto, P. Valduriez, M. Mattoso. An Algebraic Approach for Data-centric Scientific Workflows. In Proceedings of the VLDB Endowment (PVLDB), 4(12): 1328-1339 (2011).

[4] J. Liu, E. Pacitti, P. Valduriez, M. Mattoso: A Survey of Data-Intensive Scientific Workflow Management. J. Grid Comput. 13(4): 457-493(2015).

[5] J. Liu, E. Pacitti, P. Valduriez, D. de Oliveira, M. Mattoso: Multi-objective scheduling of Scientific Workflows in multisite clouds. Future Generation Computer Systems, 63: 76-95 (2016)

[6] C. Pradal, C. Fournier, P. Valduriez, S. Cohen-Boulakia: OpenAlea: scientific workflows combining data analysis and simulation. SSDBM: 11:1-11:6 (2015).

[7] A. Rajasekar, R. Moore, C. Y. Hou, C. A. Lee, R. Marciano, A. de Torcy et al. iRODS Primer: integrated rule-oriented data system. Synthesis Lectures on Information Concepts, Retrieval, and Services, 2(1), 1-143. (2010).

[8] M. Balduzzi, B. M. Binder, A. Bucksch, C. Chang, L. Hong, A. Lyer-Pascuzzi, C. Pradal, E. Sparks. Reshaping plant biology: Qualitative and quantitative descriptors for plant morphology. Frontiers in Plant Science 8:117 (2017).

[9] L. Cabrera‐Bosquet, C. Fournier, N. Brichet, C. Welcker, B. Suard, F. Tardieu. High‐throughput estimation of incident light, light interception and radiation‐use efficiency of thousands of plants in a phenotyping platform. New Phytologist, 212(1), 269-281 (2016).

[10] S. Artzet, N. Brichet, L. Cabrera, T. W. Chen, J. Chopard, M. Mielewczik, C. Fournier, C. Pradal. Image workflows for high throughput phenotyping platforms. BMVA technical meeting: Plants in Computer Vision, London, United Kingdom (2016).

Permanent link to this article: https://team.inria.fr/zenith/phd-position-distributed-management-of-scientific-workflows-for-high-throughput-plant-phenotyping/

IBC seminar: Dennis Shasha “Reducing Errors by Refusing to Guess (Occasionally)” 1 June 2017

IBC Seminar
Thursday 1 June 2017, 15h, room 1/124, bat. 5 Campus Saint Priest, Montpellier

Reducing Errors by Refusing to Guess (Occasionally)

Dennis Shasha

New York University
We propose a meta-algorithm to reduce the error rate of state-of-the-art machine learning algorithms by refusing to make predictions in certain cases even when the underlying algorithms suggest predictions. Intuitively, our new Conjugate Prediction approach estimates the likelihood that a prediction will be in error and when that likelihood is high, the approach refuses to go along with that prediction. Unlike other approaches, we can probabilistically guarantee an error rate on predictions we do make (denoted the {\em decisive predictions}). Empirically on seven diverse data sets from genomics, ecology, image-recognition, and gaming,, our method can probabilistically guarantee to reduce the error rate to 1/4 of what it is in the state-of-the-art machine learning algorithm at a cost of between 11% and 58% refusals. Competing state-of-the-art methods refuse at roughly twice the rate  of ours (sometimes refusing all suggested predictions).

Permanent link to this article: https://team.inria.fr/zenith/ibc-seminar-dennis-shasha-reducing-errors-by-refusing-to-guess-occasionally-1-june-2017/

IBC seminar: Fabio Porto “Simulation Data Management” 1 June 2017

IBC Seminar
Thursday 1 June 2017, 14h, room 1/124, bat. 5 Campus Saint Priest, Montpellier

Simulation Data Management

Fabio Porto

LNCC, Rio de Janeiro, Brazil

Numerical Simulation has attracted the interest of different areas from engineering and biology to astronomy. By using simulations scientists can analyse the behaviour of hard to observe phenomena and practitioners  can test techniques before engaging into dangerous actions, such as brain surgery.  Simulations are CPU intensive applications, normally running in super-computers, such as the Santos_Dumont machine at LNCC. They also produce a huge amount of data, distributed in hundreds of files, and including different structures, such as: a domain discretisation mesh, field values and domain topology information. These data must be integrated
and structured in a way that scientists can easily and efficiently interpret the simulation outcome, using analytical queries or scientific visualization tools. In this talk we will presents the main challenges involved in managing simulation data  and highlight the recent results in this area.

Permanent link to this article: https://team.inria.fr/zenith/ibc-seminar-fabio-porto-simulation-data-management-1-june-2017/