First (Virtual) Workshop of the HPDaSc project
22 September 2020
9h-12h30 Rio de Janeiro, 14h-17h30 Montpellier
Workshop objective : focus on current joint work (between Brazil and France) and discuss progress
09:00 -09:30 (BR)/14:00-14:30 (FR) – Fabio Porto and Patrick Valduriez: Opening: project and workshop overview
09:30 – 9:50 (BR)/14:30 – 14:50 (FR) – Rafael Pereira, Alexis Joly, Fabio Porto: Deep learning techniques on small data
With the introduction of machine learning techniques, many different complex tasks could be solved by learning their solutions directly from the data. In special deep learning techniques can learn solutions that are complex and non linear from different data modalities (images,text, etc). However, many of these methods are data hungry and lack generalization capability when optimized on small data. When considering the classification problem usual methods are also defined only for the closed set classification where optimization is well defined for the classes seen in the training set. Given this, in this presentation we discuss methods that constrain the hypothesis space the model transverses during optimization in order to improve generalization on small data. As well as present approaches that define how classification may be defined for the open set problem, so that a model may be useful for classifying new samples.
09:50- 10:10 (BR)/ 14:50-15:10(FR) – Heraldo Borges , Esther Pacitti, Florent Masseglia, Reza Akbarinia, Eduardo Ogasawara: Spatial-Time Motifs Discovery
Discovering motifs in time series data has been widely explored. Various techniques have been developed to tackle this problem. However, when it comes to spatial-time series, a clear gap can be observed according to the literature review. This paper tackles such a gap by presenting an approach to discover and rank motifs in spatial-time series, denominated Combined Series Approach (CSA). CSA is based on partitioning the spatial-time series into blocks. Inside each block, subsequences of spatial-time series are combined in a way that hash-based motif discovery algorithm is applied. Motifs are validated according to both temporal and spatial constraints. Later, motifs are ranked according to their entropy, the number of occurrences, and the proximity of their occurrences. The approach was evaluated using both synthetic and seismic datasets. CSA outperforms traditional methods designed only for time series. CSA was also able to prioritize motifs that were meaningful both in the context of synthetic data and also according to seismic specialists.
10:10- 10:30(BR)/15:10-15:30(FR) Gaetan Heidsieck, Daniel de Oliveira, Esther Pacitti, Christophe Pradal, François Tardieu and Patrick Valduriez: Cache-aware scheduling of scientific workflows in multisite cloud
We consider the efficient execution of such workflows in multisite cloud, leveraging the heterogeneous resources available at multiple geo-distributed data centers.Since it is common for workflow users to reuse code or data from previous workflows, a promising approach for efficient workflow execution is to cache intermediate data in order to avoid re-executing entire workflows. However, caching intermediate data and scheduling workflows to exploit such caching in a multisite cloud with heterogeneous sites is complex. In particular, workflow scheduling must be cache-aware, in order to decide whether reusing cached data or re-executing workflows entirely. In this work, we propose a solution for cache-aware scheduling of scientific workflows in multisite cloud. Our solution is based on a distributed and parallel architecture and includes new algorithms for adaptive caching, cache site selection, and dynamic workflow scheduling. We implemented our solution in the OpenAlea workflow system, together with cache-aware distributed scheduling algorithms. Our experimental evaluation in a three-site cloud with a real application in plant phenotyping shows that our solution can yield majors performance gains, reducing total time up to 42% with 60% of the same input data for each new execution.
10:30-10:50(BR)/15:30 – 15:50 (FR) – Break
10:50-11:10(BR)/15:50 – 16:10(FR): Debora Pina, Liliane Neves, Daniel de Oliveira, Patrick Valduriez, Marta Mattoso: Using provenance for data analyses in physics informed neural networks
Scientific applications in Computational Science and Engineering (CSE) have been using Deep Neural Networks (DNNs) with an architecture that results in scientific data that respects the conservation laws of Physics. These are known as Physics-Informed Neural Networks (PINNs) also named Physics-constrained deep learning among other related technologies. Tuning hyperparameters is time-consuming and relies on the experience of the DNN specialist. The process of hyperparameters’ tuning involves training different configurations and evaluating the results at each trial. These evaluations often require the association of different data, e.g. performance data, environment data, domain data and hyperparameters. Provenance data capture and storage can help in data analyses for these fine-tunings. We present provenance data services to be invoked by Keras/Tensorflow pipelines. The provenance database is compatible to LP representations helping interoperability and reproducibility. These services have been used in PINNs for forward and inverse problems governed by the Eikonal equation. The Eikonal equation often appears in problems including, but not limited to, geometric optics, shortest path problems, image segmentation, seismic and medical imaging. While there are efficient and stable techniques for solving the Eikonal equation for regular or arbitrary geometries in several dimensions, it remains a big challenge to solve inverse problems governed by this equation, especially when it comes to uncertainty quantification. The provenance database helps on several analyses without having to run the DNN under a specific framework or portal. Queries on the loss function data values help on evaluating epochs to fine-tune.
11:10 – 11:30(BR)/16:10-16:30(FR): Anderson Chaves, Patrick Valduriez, Fabio Porto: SAVIME Extension for ML Models
SAVIME is a multidimensional array in-memory database management system. It has been developed to support analytical queries over scientific data. It offers an extremely efficient ingestion procedure, which practically eliminates the waiting time to analyze incoming data. It also supports dense and sparse arrays and non-integer dimension indexing. It offers a functional query language processed by a query optimizer that generates efficient query execution plans. In this talk, in addition to a brief introduction to SAVIME, we will describe its extension to support the execution of machine learning models on data extracted from the database and its API to interface it with python transforming its TAR data structure into a numpy array. We compare the costs of using SAVIME for running predictions with a direct invocation on Python scripts. There are plenty of opportunities for future work, including: the use of ML models to improve SAVIME performance; to allow SAVIME subTAR (array partitions) to be distributed in a cluster of machines and extend the execution engine to process on distributed subTARs. Finally, we want to investigate the convergence of Linear Algebra operators and Array operators enabling model training and execution within SAVIME.
11:30-12:00(BR)/16:30 – 17:00 (FR): Discussion – Workshop Closing
Zenith : Esther Pacitti, Patrick Valduriez, Alexis Joly, Florent Masseglia, Reza Akbarinia, Christophe Pradal, Baldwin Dumontier, Oleksandra Levchenko, Jean-Christophe Lombardo. PhD students: Gaetan Heidsieck, Benjamin Deneu, Lamia Djebour, Daniel Rosendo (Kerdata team), Alena Shilova (Cepage team).
LNCC : Fabio Porto, Kary Ocaña, Luiz Gadelha. PhD students: Anderson Chaves, Gabriel Machado, Maria Luiza Modelli. MSc students: Rafael Pereira.
COPPE/UFRJ : Marta Mattoso, Alvaro Coutinho. PhD students: Debora Pina, Liliane Kunstmann Neves, Gabriel Barros, Romulo Silva, Renan Souza (PosDoc).
UFF: Daniel de Oliveira, Aline Paes,Yuri Frota. PhD students: Marcello Willians Messina, Carlos Gracioli, Raama Costa, Luiz Gustavo Dias, Maria Luiza Falci.
CEFET-RJ : Eduardo Ogasawara, Rafaelli Coutinho. PhD students: Heraldo Borges, Rebecca Salles, Lais Baroni. Master student: Antonio Castro Jr.