RISC2: New European H2020 project (2021-2022) between Europe and Latin America in HPC

The RISC2 project is a coordination network for High Performance Computing (HPC) between Europe and Latin America, funded by the European H2020 FETHPC program and the partner countries.  It is managed by Barcelona Computing Center and has eight main European HPC actors, including three Inria teams (Nachos, Seism and Zenith) and Atos Bull, and the main HPC actors from Brazil, including LNCC, Mexico, Argentina, Colombia, Uruguay, Costa Rica and Chile.

Permanent link to this article: https://team.inria.fr/zenith/risc2-new-european-h2020-project-2021-2022-between-europe-and-latin-america-in-hpc/

DEXA 2020 best paper award by Gaëtan Heidsieck, Daniel de Oliveira, Esther Pacitti, Christophe Pradal, François Tardieu, and Patrick Valduriez

Distributed Caching of Scientific Workflows in Multisite Cloud” by Gaëtan Heidsieck, Daniel de Oliveira, Esther Pacitti, Christophe Pradal, François Tardieu, and Patrick Valduriez, obtained the best paper award from the 31st International Conference on Database and Expert Systems Applications (DEXA), Springer, Sep 2020. The work has been done in collaboration with CIRAD and INRAe, in the context of the #Digitag project, and Brazil in the context of the HPDaSc Inria associated team.

 

Permanent link to this article: https://team.inria.fr/zenith/dexa-2020-best-paper-award/

SBBD 2020 tutorial by Patrick Valduriez “Principles of Distributed Database Systems: spotlight on NewSQL” 29 September 2020.

Tutorial at SBBD 2020
https://sbbd.org.br/2020/tutorial-3/
29 September 2020, 14h-16h30

Principles of Distributed Database Systems: spotlight on NewSQL
Patrick Valduriez
Inria, University of Montpellier, CNRS, LIRMM, France
LeanXcale, Spain

The first edition of the book Principles of Distributed Database Systems, co-authored with Prof. Tamer Özsu (University of Waterloo) appeared in 1991 when the technology was new and there were not too many products. In the Preface to the first edition, we had quoted Michael Stonebraker who claimed in 1988 that in the following 10 years, centralized DBMSs would be an “antique curiosity” and most organizations would move towards distributed DBMSs. That prediction has certainly proved to be correct, and most systems in use today are either distributed or parallel.

The fourth edition of this classic textbook [Özsu & Valduriez 2020] provides major updates, in particular, new chapters on big data platforms, NoSQL, NewSQL and polystores. In this tutorial, we introduce these major updates, with a focus on NewSQL.

NewSQL is the latest technology in the big data management landscape, enjoying a fast-growing rate in the DBMS and BI markets. NewSQL combines the scalability and availability of NoSQL with the consistency and usability of SQL. By providing online analytics over operational data, NewSQL opens up new opportunities in many application domains where real-time decision is critical. Important use cases are eAdvertisement (such as Google Adwords), IoT, performance monitoring, proximity marketing, risk monitoring, real-time pricing, real-time fraud detection, etc. NewSQL may also simplify data management, by removing the traditional separation between NoSQL and SQL (ingest data fast, query it with SQL), as well as between operational database and data warehouse / data lake (no more ETLs!). However, a hard problem is scaling out transactions in mixed operational and analytical (HTAP) workloads over big data, possibly coming from different data stores (HDFS, SQL, NoSQL). Today, only a few NewSQL systems have solved this problem.

A first in-depth presentation of NewSQL was given in a tutorial at IEEE Big Data 2019 with Prof. Ricardo Jimenez-Peris (CEO and founder at LeanXcale) [Valduriez & Jimenez-Peris 2019]. In this tutorial, we provide a taxonomy of NewSQL systems based on major dimensions including targeted workloads, capabilities and implementation techniques. We illustrate with popular NewSQL systems such as Google Spanner, LeanXcale, CockroachDB, SAP HANA, MemSQL and Splice Machine. In particular, we give a spotlight on some of the more advanced systems. We also compare with major NoSQL and SQL systems, and discuss integration within big data ecosystems and corporate information systems, using polystores. Finally, we discuss the current trends and research directions.

References

[Özsu & Valduriez 2020] Tamer Özsu, Patrick Valduriez. Principles of Distributed Database Systems, 4th Edition, Springer, 2020.

[Valduriez & Jimenez-Peris 2019] Patrick Valduriez, Ricardo Jimenez-Peris. NewSQL : principles, systems and current trends. IEEE Big Data Conference, Los Angeles, December 2019.

 

Permanent link to this article: https://team.inria.fr/zenith/2381-2/

New France-Brazil research partnership: Inria and LNCC sign Memorandum of Understanding, 2 July 2020.

Inria and LNCC, the Brazilian National Scientific Computing Laboratory, signed a Memorandum of Understanding to strengthen their collaboration  in High Performance Computing, Big Data and Artificial Intelligence. It is headed by Frédéric Valentin (LNCC, Inria International Chair) and Patrick Valduriez.

 

Permanent link to this article: https://team.inria.fr/zenith/new-france-brazil-research-partnership-inria-and-lncc-sign-memorandum-of-understanding-2-july-2020/

Seminar by Khadidja Meguelati “Clustering Massivement Distribué via Mélange de Processus de Dirichlet” 9 March 2020

Séminaire Zenith  : 9 mars 2020, 14h
Campus St Priest, BAT5, 03.124
Clustering Massivement Distribué via Mélange de Processus de Dirichlet
Khadidja Meguelati
Zenith, Inria & LIRMM
La classification non supervisée (ou clustering) a pour objectif d’identifier des classes pertinentes dans les données. elle est largement utilisée dans de nombreuses applications telles que le marketing, la reconnaissance de patterns, l’analyse de données et le traitement d’images. Déterminer le nombre optimal de clusters dans un ensemble de données est un défi fondamental qui a ouvert de nombreuses directions de recherche. De multiples méthodes sont alors proposées pour résoudre ce problème.
Le Mélange de Processus de Dirichlet (DPM) est utilisé pour le clustering car il permet de définir automatiquement le nombre de classes, mais les temps de calculs qu’il implique sont généralement trop importants, nuisant à son adoption et rendant inefficaces ses versions centralisées.
Nous visons le problème de la parallélisation du mélange de processus de Dirichlet pour améliorer ces performances en exploitant des environnements massivement distribués. En effet, d’après la littérature, l’algorithme de DPM distribué fait appel à de nombreux problèmes tels que : l’équilibre de charge entre les nœuds de calcul, les coûts de communication, et le plein bénéfice de propriétés du DPM.
Nous proposons deux nouvelles approches pour le clustering parallèle via DPM. Tout d’abord, nous proposons DC-DPM (Clustering Distribué via mélange de processus de Dirichlet), une version parallélisée, qui permet le clustering de millions de points de données, ce qui représente un vrai défi. Nos expérimentations, tant sur des données synthétiques que réelles, illustrent la performance de notre approche. Comparativement, l’algorithme centralisé ne passe pas à l’échelle. Son temps de réponse est de plus de 7 heures sur des données de 100K points, quand notre approche prend moins de 30 secondes.
Dans un deuxième temps, nous nous intéressons au problème de dimensionalité de données qui devient un défi important avec les obstacles numériques et théoriques dans ce cas. Nous proposons HD4C (Clustering de Dirichlet Distribué pour des Données de Haute Dimension), une solution de clustering Parallèle qui s’adresse à la dimensionnalité par deux moyens. Premièrement, elle s’adapte à des données massives en exploitant les architectures distribuées. Deuxièmement, elle effectue le clustering de données de haute dimension telles que les séries temporelles (en fonction du temps), les données hyperspectrales (en fonction de la longueur d’onde), etc. Nous avons réalisé des expériences exhaustives  sur des jeux de données synthétiques et réels pour confirmer l’efficacité de notre solution.

Permanent link to this article: https://team.inria.fr/zenith/seminar-by-khadidja-meguelati-clustering-massivement-distribue-via-melange-de-processus-de-dirichlet-9-march-2020/

Seminar by Patrick Valduriez “Innovation : startup strategies” 19 March 2020 ** postponed

**Postponed to June

Zenith seminar: 19 march 2020, 10h30
Campus Saint Priest, BAT5, 01.124

Innovation : startup strategies
Patrick Valduriez
Inria and LIRMM, Univ. Montpellier, France

Technological innovation as driven by startups is hard to formalize (and manage) as the context may be unknown or quickly changing. To be successful, the innovation process involves not only inventions (new methods) but also context, e.g. user behavior, and timing, e.g. market readiness. In this talk, I illustrate various innovation strategies based on startup success stories, in particular LeanXcale, which delivers a new generation HTAP DBMS product. I also give hints to promote innovation within startups.

Permanent link to this article: https://team.inria.fr/zenith/seminar-by-patrick-valduriez-innovation-startup-strategies-19-march-2020/

The book “Principles of Distributed Database Systems – Fourth Edition” is now online.

The book Principles of Distributed Database Systems – Fourth Edition (700 pages, Springer), co-authored with Prof. Tamer Özsu (University of Waterloo), is now online, with major revision of previous chapters and addition on new material on big data, NoSQL, NewSQL, polystores, web data integration and blockchain.

The paper version is also available at various online stores (Amazon, …).

Permanent link to this article: https://team.inria.fr/zenith/the-book-principles-of-distributed-database-systems-fourth-edition-is-now-online/

Zenith winner at the Global Pytorch Summer Hackaton 2019

Antoine Liutkus and Fabian Stoter won the second place at the Global Pytorch Summer Hackaton 2019 organized by FaceBook with the open-unmix software. See the demo here.

Permanent link to this article: https://team.inria.fr/zenith/zenith-second-at-the-global-pytorch-summer-hackaton-2019/

Séminaire en ligne Franco-Africain par Patrick Valduriez “Blockchain 2.0: opportunités et risques”, 13 nov. 2019

Séminaire en ligne Franco-Africain du LIRIMA

Diffusé par l’agence universitaire de la Francophonie (AUF) et Inria

Salle Métivier, Inria Rennes, 13 nov 2019 à 16h

Blockchain 2.0: opportunités et risques
Patrick Valduriez

Inria and LIRMM, Université de Montpellier

Permanent link to this article: https://team.inria.fr/zenith/seminaire-en-ligne-franco-africain-par-patrick-valduriez-blockchain-2-0-opportunites-et-risques-13-nov-2019/

Inaugural lecture by Esther Pacitti: “Data Processing: an evolutionary and multidisciplinary perspective”, CEFET/RJ, Rio de Janeiro on 12 August 2019

Inaugural lecture by Esther Pacitti
Graduate Program in Computer Science , CEFET/RJ, Rio de Janeiro
12 August 2019, 10:00– Auditorium 5,  Maracanã campus

Data Processing: an evolutionary and multidisciplinary perspective
E. Pacitti
Inria and LIRMM, Montpellier, France

The inaugural lecture will address the context of the growth of the amount and variety of data (images, audio, matrixes, text, etc.), produced in various areas (social networks, agronomy, botany, medicine and others), which has also increased the technological and research challenges in the processing  of this large volume of data, termed by the term Big Data.

In the lecture, Professor Esther Pacitti will present a vision of the evolution of data processing  methods from relational databases, distributed databases, and big data to data science. It will also expose some specific applications in Agronomy, Botany and Seismology, and share research experiences in France and Brazil.

Permanent link to this article: https://team.inria.fr/zenith/inaugural-lecture-by-esther-pacitti-data-processing-an-evolutionary-and-multidisciplinary-perspective-cefet-rj-rio-de-janeiro-on-12-august-2019/