Seminar by Khadidja Meguelati “Clustering Massivement Distribué via Mélange de Processus de Dirichlet” 9 March 2020

Séminaire Zenith  : 9 mars 2020, 14h
Campus St Priest, BAT5, 03.124
Clustering Massivement Distribué via Mélange de Processus de Dirichlet
Khadidja Meguelati
Zenith, Inria & LIRMM
La classification non supervisée (ou clustering) a pour objectif d’identifier des classes pertinentes dans les données. elle est largement utilisée dans de nombreuses applications telles que le marketing, la reconnaissance de patterns, l’analyse de données et le traitement d’images. Déterminer le nombre optimal de clusters dans un ensemble de données est un défi fondamental qui a ouvert de nombreuses directions de recherche. De multiples méthodes sont alors proposées pour résoudre ce problème.
Le Mélange de Processus de Dirichlet (DPM) est utilisé pour le clustering car il permet de définir automatiquement le nombre de classes, mais les temps de calculs qu’il implique sont généralement trop importants, nuisant à son adoption et rendant inefficaces ses versions centralisées.
Nous visons le problème de la parallélisation du mélange de processus de Dirichlet pour améliorer ces performances en exploitant des environnements massivement distribués. En effet, d’après la littérature, l’algorithme de DPM distribué fait appel à de nombreux problèmes tels que : l’équilibre de charge entre les nœuds de calcul, les coûts de communication, et le plein bénéfice de propriétés du DPM.
Nous proposons deux nouvelles approches pour le clustering parallèle via DPM. Tout d’abord, nous proposons DC-DPM (Clustering Distribué via mélange de processus de Dirichlet), une version parallélisée, qui permet le clustering de millions de points de données, ce qui représente un vrai défi. Nos expérimentations, tant sur des données synthétiques que réelles, illustrent la performance de notre approche. Comparativement, l’algorithme centralisé ne passe pas à l’échelle. Son temps de réponse est de plus de 7 heures sur des données de 100K points, quand notre approche prend moins de 30 secondes.
Dans un deuxième temps, nous nous intéressons au problème de dimensionalité de données qui devient un défi important avec les obstacles numériques et théoriques dans ce cas. Nous proposons HD4C (Clustering de Dirichlet Distribué pour des Données de Haute Dimension), une solution de clustering Parallèle qui s’adresse à la dimensionnalité par deux moyens. Premièrement, elle s’adapte à des données massives en exploitant les architectures distribuées. Deuxièmement, elle effectue le clustering de données de haute dimension telles que les séries temporelles (en fonction du temps), les données hyperspectrales (en fonction de la longueur d’onde), etc. Nous avons réalisé des expériences exhaustives  sur des jeux de données synthétiques et réels pour confirmer l’efficacité de notre solution.

Permanent link to this article: https://team.inria.fr/zenith/seminar-by-khadidja-meguelati-clustering-massivement-distribue-via-melange-de-processus-de-dirichlet-9-march-2020/

Seminar by Patrick Valduriez “Innovation : startup strategies” 19 March 2020 ** postponed

**Postponed to June

Zenith seminar: 19 march 2020, 10h30
Campus Saint Priest, BAT5, 01.124

Innovation : startup strategies
Patrick Valduriez
Inria and LIRMM, Univ. Montpellier, France

Technological innovation as driven by startups is hard to formalize (and manage) as the context may be unknown or quickly changing. To be successful, the innovation process involves not only inventions (new methods) but also context, e.g. user behavior, and timing, e.g. market readiness. In this talk, I illustrate various innovation strategies based on startup success stories, in particular LeanXcale, which delivers a new generation HTAP DBMS product. I also give hints to promote innovation within startups.

Permanent link to this article: https://team.inria.fr/zenith/seminar-by-patrick-valduriez-innovation-startup-strategies-19-march-2020/

The book “Principles of Distributed Database Systems – Fourth Edition” is now online.

The book Principles of Distributed Database Systems – Fourth Edition (700 pages, Springer), co-authored with Prof. Tamer Özsu (University of Waterloo), is now online, with major revision of previous chapters and addition on new material on big data, NoSQL, NewSQL, polystores, web data integration and blockchain.

The paper version is also available at various online stores (Amazon, …).

Permanent link to this article: https://team.inria.fr/zenith/the-book-principles-of-distributed-database-systems-fourth-edition-is-now-online/

Zenith winner at the Global Pytorch Summer Hackaton 2019

Antoine Liutkus and Fabian Stoter won the second place at the Global Pytorch Summer Hackaton 2019 organized by FaceBook with the open-unmix software. See the demo here.

Permanent link to this article: https://team.inria.fr/zenith/zenith-second-at-the-global-pytorch-summer-hackaton-2019/

Séminaire en ligne Franco-Africain par Patrick Valduriez “Blockchain 2.0: opportunités et risques”, 13 nov. 2019

Séminaire en ligne Franco-Africain du LIRIMA

Diffusé par l’agence universitaire de la Francophonie (AUF) et Inria

Salle Métivier, Inria Rennes, 13 nov 2019 à 16h

Blockchain 2.0: opportunités et risques
Patrick Valduriez

Inria and LIRMM, Université de Montpellier

Permanent link to this article: https://team.inria.fr/zenith/seminaire-en-ligne-franco-africain-par-patrick-valduriez-blockchain-2-0-opportunites-et-risques-13-nov-2019/

Inaugural lecture by Esther Pacitti: “Data Processing: an evolutionary and multidisciplinary perspective”, CEFET/RJ, Rio de Janeiro on 12 August 2019

Inaugural lecture by Esther Pacitti
Graduate Program in Computer Science , CEFET/RJ, Rio de Janeiro
12 August 2019, 10:00– Auditorium 5,  Maracanã campus

Data Processing: an evolutionary and multidisciplinary perspective
E. Pacitti
Inria and LIRMM, Montpellier, France

The inaugural lecture will address the context of the growth of the amount and variety of data (images, audio, matrixes, text, etc.), produced in various areas (social networks, agronomy, botany, medicine and others), which has also increased the technological and research challenges in the processing  of this large volume of data, termed by the term Big Data.

In the lecture, Professor Esther Pacitti will present a vision of the evolution of data processing  methods from relational databases, distributed databases, and big data to data science. It will also expose some specific applications in Agronomy, Botany and Seismology, and share research experiences in France and Brazil.

Permanent link to this article: https://team.inria.fr/zenith/inaugural-lecture-by-esther-pacitti-data-processing-an-evolutionary-and-multidisciplinary-perspective-cefet-rj-rio-de-janeiro-on-12-august-2019/

New book on Data-Intensive Workflow Management, May 2019

Release of the new book:
Data-Intensive Workflow Management: For Clouds and Data-Intensive and Scalable Computing Environments.
Synthesis Lectures on Data Management

by Daniel de Oliveira (Universidade Federal Fluminense, Brazil), Ji Liu & Esther Pacitti (University of Montpellier, Inria & CNRS, France)
May 2019, 179 pages, Morgan&Claypool Publishers.
(https://doi.org/10.2200/S00915ED1V01Y201904DTM060)

Permanent link to this article: https://team.inria.fr/zenith/new-book-on-data-intensive-workflow-management-may-2019/

Seminar by Patrick Valduriez at Inria Lille “The Case for Hybrid Transaction Analytical Processing”, 17 May 2019

Seminar by Patrick Valduriez (Inria) at Inria, Lille
17 May, 10:30 – Amphi B – Inria – Bat B

The Case for Hybrid Transaction Analytical Processing
P. Valduriez
Inria and LIRMM, Montpellier, France

Abstract. Hybrid Transaction Analytical Processing (HTAP) is poised to revolutionize data management. By providing online analytics over operational data, HTAP systems open up new opportunities in many application domains where real-time decision is critical. Important use cases are proximity marketing, real-time pricing, risk monitoring, real-time fraud detection, etc. HTAP also simplifies data management, by removing the traditional separation between operational database and data warehouse/ data lake (no more ETLs!). However, a hard problem is scaling out transactions in mixed operational and analytical workloads over big data, possibly coming from different data stores (HDFS, SQL, NoSQL, …).
In this talk, I will introduce HTAP systems and illustrate with LeanXcale, a new generation HTAP DBMS that provides ultra-scalable transactions, big data analytics, SQL/JSON support and polystore capabilities

 

Permanent link to this article: https://team.inria.fr/zenith/seminar-by-patrick-valduriez-at-inria-lille-the-case-for-hybrid-transaction-analytical-processing-17-may-2019/

Zenith seminar: Hervé Bredin “Neural speaker diarization” 13 May 2019

Zenith seminar : 13/05/2019, 10h30

Campus Saint Priest, BAT5-02.124

Neural speaker diarization

Hervé Bredin (CNRS, LIMSI)

Speaker diarization is the task of determining “who speaks when” in an audio stream. It is an enabling technology for multiple downstream applications such as meeting transcription or indexing of ever-growing audio-visual archives.

Speaker diarization workflows usually consist of four consecutive tasks: speech activity detection, speaker change detection, speech turn clustering, and re-segmentation.

Recent advances in deep learning led to major improvements in multiple domains such as computer vision or natural language processing, and speaker diarization is no exception to the rule. In this talk, I will discuss our recent progress towards end-to-end neural speaker diarization (including speech and overlap detection with recurrent neural networks, and triplet loss for speaker embedding).

# References

“Tristounet: Triplet Loss for Speaker Turn Embedding.” Bredin 2017. ICASSP.

“Speaker Change Detection in Broadcast TV Using Bidirectional Long Short- Term Memory Networks.”

Yin 2017. Interspeech.

“Neural Speech Turn Segmentation and Affinity Propagation for Speaker Diarization.”

Yin 2018. Interspeech.

# Code

pyannote.audio: Neural building blocks for speaker diarization: speech activity detection, speaker change detection, speaker embedding github.com/pyannote/pyannote-audio

Permanent link to this article: https://team.inria.fr/zenith/zenith-seminar-herve-bredin-neural-speaker-diarization-13-may-2019/

Zenith Seminar: Dennis Shasha (NYU) “Bounce Blockchain: a secure, energy-efficient permission less blockchain”, 27 May 2019

Zenith seminar : 27/05/2019, 15h
Campus Saint Priest, BAT5-01.124

Bounce Blockchain: a secure, energy-efficient permissionless blockchain
Dennis Shasha
New York University & Inria International Chair in the Zenith team at LIRMM

Performing proof-of-work for the Bitcoin blockchain currently requires as much electricity as consumed by the country of Denmark. This enormous energy expenditure translates into higher costs for users (on the order of $US 6.00 or more per transaction) and is frankly ecologically irresponsible. As of this writing, it can be subverted (i.e. the blockchain can be forked) by a collusion attack of just a handful of data centers. This paper proposes the design of a cheap (less than $US 0.01 per transaction), essentially energy-free public blockchain called a Bounce Blockchain which cannot be forked in any reasonable failure scenario, even most Byzantine failure scenarios.
The basic idea is to send one or  more cubesats  into orbit, each equipped with a hardware security module. Users would send their transaction to the cubesats which would collect them into blocks, sign them, and send (bounce) them back to earth (and to one another).  Bounce Blockchain provides scalability through sharding (transactions will be partitioned over cubesats).
Because modern hardware security modules are tamper-resistant (become inoperable if tampered with) or tamper-responsive (erase their keys if tampered with), take their keys from physical processes, and have been validated, socio-technical protocols can ensure that it is infeasible to forge the identity of a hardware security module in a cubesat with another cubesat. If, however, some cubesats are destroyed, the blockchain will continue to execute correctly though some transactions will be lost. New cubesats can be sent up in short order as they are quite cheap to launch. If, in spite of these assurances, some cubesats fail traitorously, the blockchain can survive through algorithms similar to Practical Byzantine Fault Tolerance techniques.

Biography
Dennis Shasha is a Julius Silver Professor of computer science at the Courant Institute of New York University and an Associate Director of NYU Wireless. He works on meta-algorithms for machine learning to achieve guaranteed correctness rates, with biologists on pattern discovery for network inference; with physicists and financial people on algorithms for time series; on computational reproducibility; and on energy-efficient blockchains. Other areas of interest include database tuning as well as tree and graph matching. Because he likes to type, he has written six books of puzzles about a mathematical detective named Dr. Ecco, a biography about great computer scientists, and a book about the future of computing. He has also written five technical books about database tuning, biological pattern recognition, time series, DNA computing, resampling statistics, and causal inference in molecular networks. He has written the puzzle column for various publications including Scientific American, Dr. Dobb’s Journal, and currently the Communications of the ACM. He is a fellow of the ACM and an Inria International Chair in the Zenith team at LIRMM.

Permanent link to this article: https://team.inria.fr/zenith/zenith-seminar-dennis-shasha-nyu-bounce-blockchain-a-secure-energy-efficient-permission-less-blockchain-27-may-2019/