New book on Data-Intensive Workflow Management, May 2019

Release of the new book:
Data-Intensive Workflow Management: For Clouds and Data-Intensive and Scalable Computing Environments.
Synthesis Lectures on Data Management

by Daniel de Oliveira (Universidade Federal Fluminense, Brazil), Ji Liu & Esther Paciiti (University of Montpellier, Inria & CNRS, France)
May 2019, 179 pages, Morgan&Claypool Publishers.
(https://doi.org/10.2200/S00915ED1V01Y201904DTM060)

Permanent link to this article: https://team.inria.fr/zenith/new-book-on-data-intensive-workflow-management-may-2019/

Seminar by Patrick Valduriez at Inria Lille “The Case for Hybrid Transaction Analytical Processing”, 17 May 2019

Seminar by Patrick Valduriez (Inria) at Inria, Lille
17 May, 10:30 – Amphi B – Inria – Bat B

The Case for Hybrid Transaction Analytical Processing
P. Valduriez
Inria and LIRMM, Montpellier, France

Abstract. Hybrid Transaction Analytical Processing (HTAP) is poised to revolutionize data management. By providing online analytics over operational data, HTAP systems open up new opportunities in many application domains where real-time decision is critical. Important use cases are proximity marketing, real-time pricing, risk monitoring, real-time fraud detection, etc. HTAP also simplifies data management, by removing the traditional separation between operational database and data warehouse/ data lake (no more ETLs!). However, a hard problem is scaling out transactions in mixed operational and analytical workloads over big data, possibly coming from different data stores (HDFS, SQL, NoSQL, …).
In this talk, I will introduce HTAP systems and illustrate with LeanXcale, a new generation HTAP DBMS that provides ultra-scalable transactions, big data analytics, SQL/JSON support and polystore capabilities

 

Permanent link to this article: https://team.inria.fr/zenith/seminar-by-patrick-valduriez-at-inria-lille-the-case-for-hybrid-transaction-analytical-processing-17-may-2019/

Zenith seminar: Hervé Bredin “Neural speaker diarization” 13 May 2019

Zenith seminar : 13/05/2019, 10h30

Campus Saint Priest, BAT5-02.124

Neural speaker diarization

Hervé Bredin (CNRS, LIMSI)

Speaker diarization is the task of determining “who speaks when” in an audio stream. It is an enabling technology for multiple downstream applications such as meeting transcription or indexing of ever-growing audio-visual archives.

Speaker diarization workflows usually consist of four consecutive tasks: speech activity detection, speaker change detection, speech turn clustering, and re-segmentation.

Recent advances in deep learning led to major improvements in multiple domains such as computer vision or natural language processing, and speaker diarization is no exception to the rule. In this talk, I will discuss our recent progress towards end-to-end neural speaker diarization (including speech and overlap detection with recurrent neural networks, and triplet loss for speaker embedding).

# References

“Tristounet: Triplet Loss for Speaker Turn Embedding.” Bredin 2017. ICASSP.

“Speaker Change Detection in Broadcast TV Using Bidirectional Long Short- Term Memory Networks.”

Yin 2017. Interspeech.

“Neural Speech Turn Segmentation and Affinity Propagation for Speaker Diarization.”

Yin 2018. Interspeech.

# Code

pyannote.audio: Neural building blocks for speaker diarization: speech activity detection, speaker change detection, speaker embedding github.com/pyannote/pyannote-audio

Permanent link to this article: https://team.inria.fr/zenith/zenith-seminar-herve-bredin-neural-speaker-diarization-13-may-2019/

Zenith Seminar: Dennis Shasha (NYU) “Bounce Blockchain: a secure, energy-efficient permission less blockchain”, 27 May 2019

Zenith seminar : 27/05/2019, 15h
Campus Saint Priest, BAT5-01.124

Bounce Blockchain: a secure, energy-efficient permissionless blockchain
Dennis Shasha
New York University & Inria International Chair in the Zenith team at LIRMM

Performing proof-of-work for the Bitcoin blockchain currently requires as much electricity as consumed by the country of Denmark. This enormous energy expenditure translates into higher costs for users (on the order of $US 6.00 or more per transaction) and is frankly ecologically irresponsible. As of this writing, it can be subverted (i.e. the blockchain can be forked) by a collusion attack of just a handful of data centers. This paper proposes the design of a cheap (less than $US 0.01 per transaction), essentially energy-free public blockchain called a Bounce Blockchain which cannot be forked in any reasonable failure scenario, even most Byzantine failure scenarios.
The basic idea is to send one or  more cubesats  into orbit, each equipped with a hardware security module. Users would send their transaction to the cubesats which would collect them into blocks, sign them, and send (bounce) them back to earth (and to one another).  Bounce Blockchain provides scalability through sharding (transactions will be partitioned over cubesats).
Because modern hardware security modules are tamper-resistant (become inoperable if tampered with) or tamper-responsive (erase their keys if tampered with), take their keys from physical processes, and have been validated, socio-technical protocols can ensure that it is infeasible to forge the identity of a hardware security module in a cubesat with another cubesat. If, however, some cubesats are destroyed, the blockchain will continue to execute correctly though some transactions will be lost. New cubesats can be sent up in short order as they are quite cheap to launch. If, in spite of these assurances, some cubesats fail traitorously, the blockchain can survive through algorithms similar to Practical Byzantine Fault Tolerance techniques.

Biography
Dennis Shasha is a Julius Silver Professor of computer science at the Courant Institute of New York University and an Associate Director of NYU Wireless. He works on meta-algorithms for machine learning to achieve guaranteed correctness rates, with biologists on pattern discovery for network inference; with physicists and financial people on algorithms for time series; on computational reproducibility; and on energy-efficient blockchains. Other areas of interest include database tuning as well as tree and graph matching. Because he likes to type, he has written six books of puzzles about a mathematical detective named Dr. Ecco, a biography about great computer scientists, and a book about the future of computing. He has also written five technical books about database tuning, biological pattern recognition, time series, DNA computing, resampling statistics, and causal inference in molecular networks. He has written the puzzle column for various publications including Scientific American, Dr. Dobb’s Journal, and currently the Communications of the ACM. He is a fellow of the ACM and an Inria International Chair in the Zenith team at LIRMM.

Permanent link to this article: https://team.inria.fr/zenith/zenith-seminar-dennis-shasha-nyu-bounce-blockchain-a-secure-energy-efficient-permission-less-blockchain-27-may-2019/

Seminar by Patrick Valduriez at IBM Research, Rio de Janeiro, Brazil “The Case for Hybrid Transaction Analytical Processing”, 25 April 2019

Seminar by Patrick Valduriez (Inria) at IBM Research, Rio de Janeiro, Brazil
Chair: Renan Francisco Santos Souza (IBM Brazil)
Time: Thu, Apr 25, 2019 11:00 AM – 12:30 PM
Location: Corcovado

The Case for Hybrid Transaction Analytical Processing
P. Valduriez
Inria and LIRMM, Montpellier, France

Abstract. Hybrid Transaction Analytical Processing (HTAP) is poised to revolutionize data management. By providing online analytics over operational data, HTAP systems open up new opportunities in many application domains where real-time decision is critical. Important use cases are proximity marketing, real-time pricing, risk monitoring, real-time fraud detection, etc. HTAP also simplifies data management, by removing the traditional separation between operational database and data warehouse/ data lake (no more ETLs!). However, a hard problem is scaling out transactions in mixed operational and analytical workloads over big data, possibly coming from different data stores (HDFS, SQL, NoSQL, …).
In this talk, I will introduce HTAP systems and illustrate with LeanXcale, a new generation HTAP DBMS that provides ultra-scalable transactions, big data analytics, SQL/JSON support and polystore capabilities.

Permanent link to this article: https://team.inria.fr/zenith/seminar-by-patrick-valduriez-at-ibm-research-rio-de-janeiro-brazil-the-case-for-hybrid-transaction-analytical-processing-25-april-2019/

Postdoc Database Engineer (2019)

Postdoc Database Engineer: query optimization

LeanXcale, Madrid, Spain

  • Career level: PostDoc
  • Keywords: Databases, Storage Engine, Query Engine, Query Optimizer
  • Supervisors: Ricardo Jimenez-Péris (LeanXcale) and Patrick Valduriez (Inria)

LeanXcale is a NewSQL company developing a scalable Hybrid Transactional Analytics Processing (HTAP) DBMS for both OLTP and OLAP workloads.

You will work with the R&D team in one or more of the different subsystems of LeanXcale database (storage engine, transactional engine, SQL query engine). Depending on the candidate background, the focus of the work will be one or more of the following:

  • Extend the query engine with new functionality (introduce support for SQL not yet supported, extend SQL with polyglot capabilities).
  • Work on query optimization (characterize cases where the optimizer does not select the optimal query plan and introduce rules and transformations so the optimizer selects the optimal plan).
  • Work on a new query optimizer using new technology.
  • Improve the functionality of the storage engine.
  • Characterize the performance issues on any of the layers, redesign algorithms, subsystems, etc. in order to solve the performance issue, and validate the new design by means of micro-benchmarking and benchmarking.

Skills and profile:

  • Background in databases, query processing, query optimization, storage engine (either SQL or NoSQL)
  • A Ph.D. in computer science around a database topic (SQL or NoSQL)

Environment, salary, duration: The postdoc will be supervised by LeanXcale and Inria, while being located in the LeanXcale facilities in Madrid, Spain.

Net salary: up to 3300 Euros net/month depending on your experience.

Duration: 1 Year

Starting date: flexible but ideally as soon as possible

Contact:rjimenez@leanxcale.comor Patrick.valduriez@inria.fr

Permanent link to this article: https://team.inria.fr/zenith/postdoc-database-engineer-2019/

Zenith seminar: Youcef Djenouri “Urban traffic outlier detection”, 14 Feb 2019

Youcef Djenouri will visit the team from Feb12 to Feb19 and he will work on Time Series analytics with us.

He will give a talk on Feb14 at 4pm in BAT5-02.022-JPN.

Title: “Urban traffic outlier detection”

Abstract:
In this talk, I present solutions to outlier detection approaches in urban traffic analysis. We divide existing solutions into two main categories: flow outlier detection and trajectory outlier detection. The first category groups solutions that detect flow outliers and includes statistical, similarity and pattern mining approaches. The second category contains solutions where the trajectory outliers are derived, including offline processing for trajectory outliers and online processing for sub-trajectory outliers. Solutions in each of these categories are described, illustrated, and discussed, and open perspectives and research trends are drawn. In this context, we can better understand the intuition, limitations, and benefits of the existing outlier urban traffic detection algorithms. As a result, practitioners can receive some guidance for selecting the most suitable methods for their particular case.

About Youcef Djenouri: 
YOUCEF DJENOURI received the Ph.D. degree in computer engineering from the University of Science and Technology Houari Boumediene, Algiers, Algeria, in 2014. From 2014 to 2015, he was a permanent Teacher-Researcher with the University of Blida, Algeria. He focused on BPM Project supported by Unist University, South Korea, in 2016. In 2017, he joined Southern Denmark University as a Postdoctoral Researcher, where he focused on urban traffic data analysis. He is now with the Norwegian University of Science and Technology, Trondheim, Norway, where he is granted funding from European Research Consortium on Informatics and Mathematics. He focuses on topics related to artificial intelligence and data mining, with focus on time series analysis, frequent pattern mining, parallel computing, and evolutionary algorithms. He has been granted short-term research visitor internships to many renown universities including ENSMEA, Poitiers; University of Poitiers; and University of Lorraine. He has published more than 50 published journal and conference papers, and two book chapters, and one tutorial paper. Some of his selected papers are published in good and top journals and conferences including ACM Computing Surveys, IEEE Intelligent Systems, IEEE Access, Information Sciences, ICDM or PAKDD.

Permanent link to this article: https://team.inria.fr/zenith/zenith-seminar-youcef-djenouri-urban-traffic-outlier-detection-14-feb-2019/

Thèse CIFRE Ina et Inria : “Apprentissage profond (Deep Learning) à large échelle pour la création de bases de connaissances et la valorisation d’archives”

Thèse CIFRE Ina et Inria : “Apprentissage profond (Deep Learning) à large échelle pour la création de bases de connaissances et la valorisation d’archives”

Sujet

L’accroissement du nombre de programmes audiovisuels à archiver impose de nouvelles contraintes de productivité sur la documentation. Le développement d’outils automatiques et semi-automatiques pour assister le travail des documentalistes est désormais indispensable pour exploiter au mieux la très grande quantité d’informations disponibles. Ces dernières années, sont ainsi apparues des techniques d’indexation et d’analyse de contenu visuel ou sonore, permettant la modélisation d’information de haut niveau, comme par exemple : des visages, des locuteurs, des monuments, des logos, des décors, des noms de chansons, etc. La modélisation consiste à construire des représentations visuelles des entités avec lesquelles on désire annoter des archives multimédias. Les processus de modélisation sont basés sur des méthodes d’apprentissage non-supervisées, supervisées, ou parfois pauvrement supervisées.

Avec l’essor des réseaux de neurones convolutionnels durant ces dernières années, les représentations visuelles ad-hoc (“hand-crafted”) sont progressivement remplacées par des représentations à base de Deep Learning apprises à partir de données d’apprentissage dédiées à la tâche d’annotation visée. Ces stratégies d’apprentissage supervisées allant du signal (pixels) jusqu’aux classes ou entités dans un même formalisme ont permis d’atteindre des performances très importantes pour la reconnaissance d’objets dans les images.

Ces méthodes ont toutefois deux limitations majeures pour envisager une exploitation dans le contexte de la documentation professionnelle à large échelle. Premièrement, elles fonctionnent en monde fermé c’est à-dire avec un nombre fixe de classes préalablement connues. Dans le cadre de l’Ina, il est essentiel de fonctionner en monde ouvert car à chaque instant :

  • des utilisateurs peuvent vouloir créer de nouvelles classes,
  • et le système de prédiction peut être sollicité pour des images n’appartenant pas à la base d’apprentissage, ce qui est essentiel à détecter.

Deuxièmement, à jour ces méthodes ne permettent être envisagées efficacement dans des processus d’apprentissage actif et incrémentaux du type bouclage de pertinence ou propagation d’annotation. Hors ces modes de fonctionnement dynamiques et interactifs sont indispensables à une mise en oeuvre métier. Il y au sein de l’Ina des dizaines de documentalistes qui ont pour mission d’annoter les documents vidéo. Il est essentiel que ces documentalistes puissent interagir avec le système de reconnaissance et que celui-ci soit suffisamment réactif.

Plus formellement, le coeur de la thèse sera de s’attaquer aux problèmes d’apprentissage actif multi-label et de détection de la nouveauté dans le contexte de l’apprentissage profond de représentations visuelles. Cela nécessitera de résoudre des verrous liés au passage à l’échelle des méthodes à base de modèles profonds.

Encadrement et contexte

L’encadrement de la thèse sera assuré par Alexis Joly (HDR, Inria, https://scholar.google.fr/citations?user=kbpkTGgAAAAJ&hl=fr&oi=ao)  et Olivier Buisson (Dr, Ina, https://scholar.google.fr/citations?user=rWunhTEAAAAJ&hl=fr). Elle s’inscrit dans la continuité de plus de 10 ans de collaboration. Deux thèses CIFRE ont notamment déjà été soutenues en 2013 et 2016 sous leur co-supervision.  Par ailleurs, une plateforme de R&D nommée Snoop a été co-développée. Celle-ci est en cours d’expérimentation au sein de l’Ina mais aussi utilisée pour l’application de reconnaissance des plantes PlantNet (http://identify.plantnet-project.org).

Les acteurs institutionnels de cette thèse, l’équipe Zénith de l’Inria et l’Ina ont une expérience solide dans l’analyse de données multimédia et le passage à l’échelle et apporteront des compétences complémentaires sur le sujet. Les travaux de Zenith s’articulent autour de la gestion, l’analyse et de la recherche d’informations dans des données hétérogènes de très grandes tailles. Au sein de l’Ina, le doctorant rejoindra le département de la Recherche et d’Innovation qui s’intéresse à tous les sujets de recherche en lien avec l’archivage audiovisuel.

Candidature

Envoyez par email et en PDF à l’adresse thcand@ina.fr, les documents suivants :

  • CV,
  • lettre de motivation ciblée sur le sujet,
  • au moins deux lettres de recommandation,
  • relevés de notes + liste des enseignements suivis en M2 et en M1.

 

Informations sur le poste

Début : courant 2019, dès l’acceptation du dossier Cifre par l’ANRT.

Salaire : 36 000€ bruts sur 13 mois.

Lieu : Ina (Institut national de l’audiovisuel) à Bry-sur-Marne.

 

Permanent link to this article: https://team.inria.fr/zenith/these-cifre-ina-et-inria-apprentissage-profond-deep-learning-a-large-echelle-pour-la-creation-de-bases-de-connaissances-et-la-valorisation-darchives/

Zenith seminar: Renan Souza “Providing Online Data Analytical Support for Humans in the Loop of Computational Science and Engineering Applications”, 15 jan. 2019

Zenith seminar: 15/01/19, 15h – BAT5-02.124

Providing Online Data Analytical Support for Humans in the Loop of Computational Science and Engineering Applications

Renan Souza (IBM Research Brazil and UFRJ, Rio de Janeiro)

Abstract.Computational Scientists and Engineers analyze complex and big data during the execution of long-lasting data processing workflows in parallel machines. Depending on the results, they may need to steer the workflows by adapting predefined input data or settings. Being able to analyze the resulting data online knowing that certain results may have been directly influenced by specific actions they took is of paramount importance for result interpretability, reuse, and reproducibility. However, three major challenges hinder such analysis: online analytical support, user steering tracking, and efficient performance. In this talk, I will focus on online analytical support particularly for problems that require integrated data analysis by multi-workflows. Multi-workflows are distributed and parallel workflows that process data in heterogeneous data stores (e.g., DBMSs with various data models or raw data files) and share data dependencies. Such heterogeneity makes online analytical support even more challenging. We propose a solution to capture workflow provenance and domain data online to provide an integrated view over the data stores. We explore a real case study composed of four workflows that preprocess data for a Deep Learning classifier for Oil and Gas exploration. We show that our solution allows users to run online integrated data analysis of the multi-workflow data. Also, for certain scenarios, the performance of our solution is two orders of magnitude faster than a state-of-the-art solution.

Permanent link to this article: https://team.inria.fr/zenith/zenith-seminar-renan-souza-15-jan-2019/

Post-doc position: A/B testing guided clustering

Amadeus ( https://amadeus.com/en ) and the Zenith team of Inria ( https://team.inria.fr/zenith/ ) are seeking a postdoctoral fellow in A/B testing, clustering and time series analytics.

Title: A/B testing guided clustering

Description:

The post-doc position takes place in a new partnership between Amadeus and Inria. It is linked to Amadeus’ developments in implementation of intelligent and evolving flight recommendation search means for online travel agencies (OTAs). The general principle is to choose recommendations by optimizing several criteria simultaneously (price, duration of the trip, number of stops, etc.). Each flight recommendation is associated with a score defined as a linear combination of criteria and weight. Weights therefore define how important each criterion is. To be able to adapt the importance of the criteria according to the profile of the user, user queries are segmented by means of unsupervised classification (or clustering). Weight values are optimized independently on each segment by maximizing the estimated reservation probability of returned flight recommendations. Thus, a set of weights is associated with each of the user profiles, called segments. During the weight creation process, large volumes of data are used, especially during the segmentation phase. The ability of the flight recommendation search system to increase the conversion rate is evaluated using A / B test campaigns.

The expected work in this postdoc position is comprised of two complementary topics:
1. optimizing the planning of A / B test campaigns,
2. developing incremental methods of adaptation of flight search segmentation from the results of A / B tests.

The objective of the first point is to improve the use of A / B tests in order to draw conclusions as quickly and as safely as possible, as well as to be able to know at each stage the uncertainty about the results of the A / B test.

The second topic is directly related to the first, since it is a question of using the results of A / B test obtained on each segment to improve the segmentation. The initial idea is to develop an incremental clustering algorithm in which phases of search segmentation and A / B test follow one another.

About Amadeus
Amadeus builds the critical solutions that help airlines and airports, hotels and railways, search engines, travel agencies, tour operators and other travel players to run their operations and improve the travel experience, billions of times a year, all over the world.

About Zenith
The Zenith project-team, headed by Patrick Valduriez, aims to propose new solutions related to scientific data and activities. Our research topics incorporate the management and analysis of massive and complex data, such as uncertain data, in highly distributed environments.

Skills and profile:

– Background in data mining / data analytics
– A Ph.D. in computer science or mathematics

Environment, salary, duration:

The postdoc will be supervised by Amadeus and Inria, while being located in the Amadeus facilities of Sophia Antipolis.

Net salary: up to 3300 Euros net/month depending on your experience.
Duration: 1 Year
Starting date: flexible but ideally as soon as possible.

Contact:

Nicolas Maillot ( nicolas.maillot@amadeus.com )
Florent Masseglia ( florent.masseglia@inria.fr )

 

Permanent link to this article: https://team.inria.fr/zenith/post-doc-position-a-b-testing-guided-clustering/