Data Search

Pl@ntnet (2012 -)

Pl@ntNet is a participatory platform and information system dedicated to the production of botanical data through deep learning-based plant identification. It includes 3 main front-ends, an Android app (the most advanced and the most used one), an iOs app (being currently re-developed) and a web version. The main feature of the application is to return the ranked list of the most likely species providing an image or an image set of an individual plant. In addition, Pl@ntNet’s search engine returns the images of the dataset that are the most similar to the queried observation allowing interactive validation by the users. The back-office running on the server side of the platform is based on Snoop visual search engine (a software developed by ZENITH) and on NewSQL technologies for the data management. The application is distributed in more than 180 countries (10M downloads) and allows identifying about 20K plant species at present time.

The PlantGame (2016 -)

The Plant Game is a participatory game whose purpose is the production of big taxonomic data to improve our knowledge of biodiversity. One major contribution is the active training of the users based on innovative sub-task creation and assignment processes that are adaptive to the increasing skills of the user. Thousands of players are registered and produce on average about tens new validated plant observations per day. The accuracy of the produced taxonnomic tags is very high (about 95%), which is quite impressive considering the fact that a majority of users are beginners when they start playing.

Snoop (2012 -)

Snoop is a C++ framework dedicated to large-scale content-based image retrieval. Its main features are (i) the extraction and efficient indexing of visual features (hand-crafted or learned through deep learning), (ii) the search of similar images through approximate k-nearest neighbors and (iii), the supervised recognition of trained visual concepts. The framework can be used either as a set of C++ libraries or as a set of web services through a RESTFUL API. Snoop is the visual search engine used by the Pl@ntNet applications (very large audience).

Data Analytics

Chiaroscuro (2015 -)

Chiaroscuro is a complete solution for clustering personal data with strong privacy guarantees. The execution sequence produced by Chiaroscuro is massively distributed on personal devices, coping with arbitrary connections and disconnections. Chiaroscuro builds on our novel data structure, called Diptych, which allows the participating devices to collaborate privately by combining encryption with differential privacy. Our solution yields a high clustering quality while minimizing the impact of the differentially private perturbation.

Imitates (2016-2018)

Time series indexing is at the center of many scientific works or business needs. The number and size of the series may well explode depending on the concerned domain.  These data are still very difficult to handle and, often, a necessary step to handling them is in their indexing. Imitates is a Spark Machine Learning Library that implements two algorithms developped by Zenith. Both algorithms allow indexing massive amounts of time series (billions of series, several terabytes of data).  A demo is available here

Data Integration

CloudMdsQL Polystore (2015-2018)

The CloudMdsQL (Cloud Multi-datastore Query Language) polystore transforms queries expressed in a common SQL-like query language into an optimized query execution plan to be executed over multiple cloud data stores (SQL, NoSQL, HDFS, etc.) through a query engine. The compiler/optimizer is implemented in C++ and uses the Boost.Spirit framework for parsing context-free grammars. CloudMdsQL has been validated on relational, document and graph data stores in the context of the CoherentPaaS European project.

WebSmatch – Web Schema Matching (2011-2014)

WebSmatch is a flexible, open environment for discovering and matching complex schemas from heterogeneous Web data sources. It provides three basic functions: (1) metadata extraction from data sources; (2) schema matching, (3) schema clustering to group similar schemas together. WebSmatch is delivered through Web services, to be used directly by data integrators or other tools with RIA clients. It is implemented in Java, delivered as Open Source Software (under LGPL). WebSmatch has been used by Data Publica and CIRAD to integrate public and private data sources.

Scientific Workflow Management

OpenAlea  (2012 – )

OpenAlea is an open source project primarily aimed at the plant research community. It is a distributed collaborative effort to develop Python libraries and tools that address the needs of current and future works in Plant Architecture modeling. It includes modules to analyze, visualize and model the functioning and growth of plant architecture. It was formally developed in the Inria VirtualPlants team.

DfAnalyzer : a tool for runtime analysis of scientific data flows (2018 -)

DfAnalyzer is a tool for monitoring, debugging, steering, and analysis of dataflows while being generated by scientific applications. It works by capturing strategic domain data, registering provenance and execution data to enable queries at runtime. DfAnalyzer provides lightweight dataflow monitoring components to be invoked by high performance applications. It can be plugged in scripts, or Spark applications, in the same way users already plug visualization library components.

Scifloware (2013 -)

SciFloware is a middleware for the execution of scientific workflows in a distributed and parallel way. It capitalizes on our experience with the Shared-Data Overlay Network and an innovative algebraic approach to the management of scientific workflows. SciFloware provides a development environment and a runtime environment for scientific workflows, interoperable with existing systems. We validate SciFloware with workflows for analyzing biological data provided by our partners CIRAD, INRA and IRD.

Distributed Data Management

SAVIME – Simulation And Visualization IN-Memory (2017 -)

SAVIME is a multi-dimensional array DBMS for scientific applications. It supports a novel data model called TARS (Typed ARray Schema), which extends the basic array data model with typed arrays. In TARS, the support of application dependent data characteristics is provided through the definition of TAR objects, ready to be manipulated by TAR operators. This approach provides much flexibility for capturing internal data layouts through mapping functions, which makes data ingestion independent of how simulation data has been produced, thus minimizing ingestion time.

Triton End-to-End Graph Mapper (2017 -)

A server for managing graph data and applications for mobile social networks. The server is built on top of the OrientDB graph database system and a distributed middleware. It provides an End-to-end Graph Mapper (EGM) for modeling the whole application as (i) a set of graphs representing the business data, the in-memory data structure maintained by the application and the user interface (tree of graphical components), and (ii) a set of standardized mapping operators that maps these graphs with each other.

Hadoop_g5k (2013)

Hadoop_g5k is a tool that makes it easier to manage Hadoop and Spark clusters and prepare reproducible experiments in the Grid 5000 platform. Hadoop_g5k offers a set of scripts to be used in command-line interfaces and a Python API to interact with the clusters. It is currently active within the G5k community, facilitating the preparation and execution of experiments in the platform.

FP-Hadoop (2012-2013)

FP-Hadoop makes the reduce side of Hadoop MapReduce more parallel and efficiently deals with the problem of data skew in the reduce side. In FP-Hadoop, there is a new phase, called intermediate reduce (IR), in which blocks of intermediate values, constructed dynamically, are processed by intermediate reduce workers in parallel.  Our experiments using FP-Hadoop using synthetic and real benchmarks have shown excellent performance gains compared to native Hadoop, e.g. more than 10 times in reduce time and 5 times in total execution time.

SON – Shared-data Overlay Network (2012-2014)

SON is a development platform for P2P networks using web services, JXTA and OSGi.  The development of a SON application is done through the design and implementation of a set of components. Each component includes a technical code that provides the component services and a code component that provides the component logic (in Java). The complex aspects of asynchronous distributed programming  are separated from code components and automatically generated from an abstract description of services for each component by the component generator.


Permanent link to this article: