Software

Data Search

Pl@ntnet (2012 -)

Pl@ntNet is a participatory platform and information system dedicated to the collection of botanical data through deep learning-based plant identification. It includes 3 main front-ends, an Android app (the most advanced and the most used one), an iOs app (being currently re-developed) and a web version. The main feature of the application is to return the ranked list of the most likely species providing an image or an image set of an individual plant. In addition, Pl@ntNet’s search engine returns the images of the dataset that are the most similar to the queried observation allowing interactive validation by the users. The back-office running on the server side of the platform is based on Snoop visual search engine (a software developed jointly by ZENITH and INA) and on NewSQL technologies for the data management. The application is distributed in more than 200 countries (20M downloads) and allows identifying about 30K plant species at present time.

The PlantGame (2016 -)

The Plant Game is a participatory game whose purpose is the production of big taxonomic data to improve our knowledge of biodiversity. One major contribution is the active training of the users based on innovative sub-task creation and assignment processes that are adaptive to the increasing skills of the user. Thousands of players are registered and produce on average about tens new validated plant observations per day. The accuracy of the produced taxonnomic tags is very high (about 95%), which is quite impressive considering the fact that a majority of users are beginners when they start playing.

Data Analytics

Chiaroscuro (2015 -)

Chiaroscuro is a complete solution for clustering personal data with strong privacy guarantees. The execution sequence produced by Chiaroscuro is massively distributed on personal devices, coping with arbitrary connections and disconnections. Chiaroscuro builds on our novel data structure, called Diptych, which allows the participating devices to collaborate privately by combining encryption with differential privacy. Our solution yields a high clustering quality while minimizing the impact of the differentially private perturbation.

Imitates (2016-2018)

Time series indexing is at the center of many scientific works or business needs. The number and size of the series may well explode depending on the concerned domain.  These data are still very difficult to handle and, often, a necessary step to handling them is in their indexing. Imitates is a Spark Machine Learning Library that implements two algorithms developped by Zenith. Both algorithms allow indexing massive amounts of time series (billions of series, several terabytes of data).  A demo is available here

Scientific Workflow Management

OpenAlea  (2012 – )

OpenAlea is an open source project primarily aimed at the plant research community. It is a distributed collaborative effort to develop Python libraries and tools that address the needs of current and future works in Plant Architecture modeling. It includes modules to analyze, visualize and model the functioning and growth of plant architecture. It was formally developed in the Inria VirtualPlants team. OpenAlea is used heavily by INRA for the analysis of phenotyping data.

DfAnalyzer : a tool for runtime analysis of scientific data flows (2018 -)

DfAnalyzer is a tool for monitoring, debugging, steering, and analysis of dataflows while being generated by scientific applications. It works by capturing strategic domain data, registering provenance and execution data to enable queries at runtime. DfAnalyzer provides lightweight dataflow monitoring components to be invoked by high performance applications. It can be plugged in scripts, or Spark applications, in the same way users already plug visualization library components.

Distributed Data Management

CloudMdsQL Polystore (2015-2018)

The CloudMdsQL (Cloud Multi-datastore Query Language) polystore transforms queries expressed in a common SQL-like query language into an optimized query execution plan to be executed over multiple cloud data stores (SQL, NoSQL, HDFS, etc.) through a query engine. The compiler/optimizer is implemented in C++ and uses the Boost.Spirit framework for parsing context-free grammars. CloudMdsQL has been validated on relational, document and graph data stores in the context of the CoherentPaaS European project.

SAVIME – Simulation And Visualization IN-Memory (2017 -)

SAVIME is a multi-dimensional array DBMS for scientific applications. It supports a novel data model called TARS (Typed ARray Schema), which extends the basic array data model with typed arrays. In TARS, the support of application dependent data characteristics is provided through the definition of TAR objects, ready to be manipulated by TAR operators. This approach provides much flexibility for capturing internal data layouts through mapping functions, which makes data ingestion independent of how simulation data has been produced, thus minimizing ingestion time.

Triton End-to-End Graph Mapper (2017 -)

A server for managing graph data and applications for mobile social networks. The server is built on top of the OrientDB graph database system and a distributed middleware. It provides an End-to-end Graph Mapper (EGM) for modeling the whole application as (i) a set of graphs representing the business data, the in-memory data structure maintained by the application and the user interface (tree of graphical components), and (ii) a set of standardized mapping operators that maps these graphs with each other.

Tools

Hadoop_g5k (2013)

Hadoop_g5k is a tool that makes it easier to manage Hadoop and Spark clusters and prepare reproducible experiments in the Grid 5000 platform. Hadoop_g5k offers a set of scripts to be used in command-line interfaces and a Python API to interact with the clusters. It is currently active within the G5k community, facilitating the preparation and execution of experiments in the platform.

FP-Hadoop (2012-2013)

FP-Hadoop makes the reduce side of Hadoop MapReduce more parallel and efficiently deals with the problem of data skew in the reduce side. In FP-Hadoop, there is a new phase, called intermediate reduce (IR), in which blocks of intermediate values, constructed dynamically, are processed by intermediate reduce workers in parallel.  Our experiments using FP-Hadoop using synthetic and real benchmarks have shown excellent performance gains compared to native Hadoop, e.g. more than 10 times in reduce time and 5 times in total execution time.

SON – Shared-data Overlay Network (2012-2014)

SON is a development tool for P2P networks using web services, JXTA and OSGi.  The development of a SON application is done through the design and implementation of a set of components. Each component includes a technical code that provides the component services and a code component that provides the component logic (in Java). The complex aspects of asynchronous distributed programming  are separated from code components and automatically generated from an abstract description of services for each component by the component generator.

museval (2018-)

museval is a python package to evaluate source separation results using the MUSDB18 dataset also released by Zenith. This package has been first proposed as part of the MUS task of the Signal Separation Evaluation Campaign (SISEC 2018). It includes the official reference implementation of the new BSSEval version 4 objective metrics, that are widely used in the community to assess performance.

VersionClimber (2018-)

VersionClimber is an automated system to help update the package and data infrastructure of a software application based on priorities that the user has indicated (e.g. the user cares more about having a recent version of this package). The system does a systematic and heuristically efficient exploration (using bounded upward compatibility) of a version search space in a sandbox environment (Virtual Env or conda env), finally delivering a lexicographically maximum configuration based on the user-specified priority order. It works for Linux and Mac OS on the cloud.

Permanent link to this article: https://team.inria.fr/zenith/software/

Data analytics

Chiaroscuro (2015 -) Chiaroscuro is a complete solution for clustering personal data with strong privacy guarantees. The execution sequence produced by Chiaroscuro is massively distributed on personal devices, coping with arbitrary connections and disconnections. Chiaroscuro builds on our novel data structure, called Diptych, which allows the participating devices to collaborate privately by combining encryption with differential …

Data search

Pl@ntnet (2012 -) Pl@ntNet is a citizen science platform that uses deep learning and big data to help people identify plants with their mobile phones. It is used in more than 200 countries by 25M users and allows up to 2M identifications per day of about 50K plant species. Pl@ntNet includes  an Android app, an …

Distributed data management

CloudMdsQL Polystore (2015-2018) The CloudMdsQL (Cloud Multi-datastore Query Language) polystore transforms queries expressed in a common SQL-like query language into an optimized query execution plan to be executed over multiple cloud data stores (SQL, NoSQL, HDFS, etc.) through a query engine. The compiler/optimizer is implemented in C++ and uses the Boost.Spirit framework for parsing context-free …

Scientific workflows

OpenAlea  (2012 – ) OpenAlea is an open source project primarily aimed at the plant research community. It is a distributed collaborative effort to develop Python libraries and tools that address the needs of current and future works in Plant Architecture modeling. It includes modules to analyze, visualize and model the functioning and growth of …

Tools

Hadoop_g5k (2013) Hadoop_g5k is a tool that makes it easier to manage Hadoop and Spark clusters and prepare reproducible experiments in the Grid 5000 platform. Hadoop_g5k offers a set of scripts to be used in command-line interfaces and a Python API to interact with the clusters. It is currently active within the G5k community, facilitating the …