Data Search

  • Pl@ntnet

    Plantnet iphone app is an image sharing and retrieval application for the identification of plants. It is developed in the context of the Pl@ntNet project by scientists from four French research organisations (INRIA, Cirad, INRA, IRD) and the members of Tela Botanica social network with the financial support of Agropolis fondation.
    Among other features, this free app helps identifying plant species from photographs, through a visual search engine using several research results of Zenith on large-scale content-based retrieval and high-dimensional data hashing.
  • ThePlantGame: crowdsourced plants identification

    The Plant Game is a participatory game whose purpose is the production of large masses of taxonomic data to improve our knowledge of biodiversity. The interest of the game is twofold: (i) train and progress in botany while having fun, and (ii) participate to a large citizen sciences project in biodiversity. The game relies on consistent scientific contributions compared to classical crowdsourcing models and algorithms that are not scalable to classification problems with thousands of complex classes such as plant species. The most remarkable one is the active training of the users based on innovative sub-task creation and assignment processes that are adaptive to the increasing skills of the user. The first public version of the game was released in July 2015. Nowadays, about 1000 players are registered and produce on average about 35 new validated plant observations per day. The accuracy of the produced taxonnomic tags is about 94%, which is quite impressive considering the fact that a majority of users are beginners when they start playing.

  • PlantRT: Gossip-Based Recommendation

Many fields of science are currently massive producers of diverse data items. PlantRT focuses on plants observations produced at a large scale by botanists. The objective of our prototype is to allow the study of the evolutions and correlations of the families of various plants. Each image or observation is thus produced on a personal basis – by each citizen or botanist involved in the project – and can be stored on different types of sites (e.g. personal computers, smartphones, servers, clouds). Moreover, the emergence of distributed recommender systems promotes the sharing, discovery, and the relationship of these data produced by each citizen involved. PlantRT is a distributed gossip-based platform for content sharing enabling plants observation keywords search and GPS position based recommendation. It takes into account the diversity of citizens profiles or users and data, promoting, for example, the discovery of new species of plants of the same family or the same geographical area.

  • Pl@ntNet-Identify

    Pl@ntNet-Identify is a web application dedicated to the image-based identification of plants. It has been developed jointly by Zenith, the AMAP UMR lab. (CIRAD) and the Inria team IMEDIA. It allows submitting one or several query pictures  of a plant  and browse the matching species in a large collection of social image data, i.e. plant images collected by  the members of a social network.  It also allows users to enrich the knowledge of the application by uploading their own pictures in the reference collection. Nowadays,
    the dataset includes more than 17K images posted by about 100 members of Telabotanica social network. In 2012, about 5000 identification sessions have been recorded. The client side of the application is implemented in Javascript  whereas the server side (visual search engine) is mostly implemented in C++.

  • Pl@ntNet-DataManager

    Pl@ntNet-DataManager is a software dedicated to managing and sharing distributed heterogeneous botanical data. It is developed jointly by Zenith, the AMAP UMR team (CIRAD) and the Telabotanica non profit organization. It allows scientists to define data structures dedicated to their own datasets, and share parts of their structures and data with collaborators in a decentralized way. Pl@ntNet DataManager offers innovative features like partial or complete P2P synchronization between distant databases (master-master), and a user friendly data structure editor. It also provides full text search, querying, CSV import/export, SQL export,
    image management, and geolocation. DataManager is built on NoSQL technology (CouchDB database), Javascript (Node.js), HTML5 and CSS3, and may be deployed on a server or run on a local machine (standalone version for Linux, Windows, Mac). It is being used by researchers and engineers of the Pl@ntNet Project (CIRAD, INRA, INRIA, IRD, Tela-Botanica) to manage taxonomical referentials, herbarium data and geolocated plant observations.

  • SnoopIm

    SnoopIm is a content-based search engine allowing to discover and retrieve small visual patterns or objects in large collections of pictures (such as logos on clothes, road signs in the background, paintings on walls,
    etc.) and to derive statistics from them (frequency,  visual cover, size variations, etc.). Query objects to be searched can be either selected from the indexed collection of photos, or selected from an external picture (by simply providing its URL). The web application allows online search of multiple users and has a cache feature to speed-up the processing of seen queries.  It is implemented in Javascript on top of a C++ library
    developed in collaboration with INA. The software is used  at INA by archivists and sociologists in the context of the Transmedia Observatory project.

  • MultiSite-Rec

    Recommender systems are used as a mean to supply users with content that may be of interest to them. They have become a popular research topic, where many aspects and dimensions have been studied to make them more accurate and effective. In practice, recommender systems suffer from cold-start problems. However, users use many online services, which can provide information about their interest and the content of items (e.g. Google search engine, Facebook, Twitter, etc). These services may be valuable data sources, which supply information to help a recommender system in modeling users and items’ preferences, and thus, make the recommender system more precise. Moreover, these data sources are distributed, and geographically distant from each other, which raise many research problems and challenges to design a distributed recommendation algorithm. Hence, MultiSite-Rec is a distributed collaborative filtering recommender system which exploits and combine these multiple and heterogeneous data sources to improve the recommendation quality.

Data Analytics

  • Chiaroscuro

The advent of on-body/at-home sensors connected to personal devices leads to the generation of fine grain highly sensitive personal data at an unprecendent rate. However, despite the promises of large scale analytics there are obvious privacy concerns that prevent individuals to share their personnal data. Chiaroscuro is a complete solution for clustering personal data with strong privacy guarantees. The execution sequence produced by Chiaroscuro is massively distributed on personal devices, coping with arbitrary connections and disconnections. Chiaroscuro builds on our novel data structure, called Diptych, which allows the participating devices to collaborate privately by combining encryption with differential privacy. Our solution yields a high clustering quality while minimizing the impact of the differentially private perturbation. Our study show that Chiaroscuro is both correct and secure.

  • LogMagnet

LogMagnet is a software for analyzing streaming data, and in particular log data. Log data usually arrive in the form of lines containing activities of human or machines. In the case of human activities, it may be the behavior on a Web site or the usage of an application. In the case of machines, such log may contain the activities of software and hardware components (say, for each node of a computing cluster, the calls to system functions or some hardware alerts). Analyzing such data is often difficult and crucial in the meanwhile. LogMagnet allows to summarize this data, and to provide a first analysis as a clustering. This summary may also be exploited as easily as the original data.

  • Imitates

    Time series indexing is at the center of many scientific works or business needs for many industrial. The number and size of the series may well explode depending on the concerned domain. Our team, for instance, is working on seismic data from a few hundred to several hundred of thousands of measures by series and which represent up to tens of terabytes of data. These data are still very difficult to handle and, often, a necessary step to handling them is in their indexing. Imitates is an ADT of Inria, with a funding for 2 years of a confirmed engineer (already recruited). The goal of this work is to implement two algorithms designed by our team into the Spark Machine Learning Library. Both algorithms allow indexing massive amounts of time series (billions of series, several terabytes of data). Finally, part of this project will be dedicated to the implementation of a demonstrator allowing visualization of massively distributed data and exploiting indexing techniques implemented in the context of this ADT.

  • FP-Hadoop

    In processing parallel jobs with MapReduce, there are important cases where a high percentage of processing in the reduce side is done by a few nodes, or even one node, while the others remain idle. In particular, this happens when most of the intermediate values correspond to a single key, or when the number of keys is less than the number of reduce workers.
    FP-Hadoop is a system that makes the reduce side of MapReduce more parallel, and efficiently deals with the problem of data skew in the reduce side. In FP-Hadoop, there is a new phase, called intermediate reduce (IR), in which blocks of intermediate values, constructed dynamically, are processed by intermediate reduce workers in parallel, by using a scheduling strategy. By using the IR phase, even if all intermediate values belong to only one key, the main part of the reducing work can be done in parallel by using the computing resources of all available workers.We implemented a prototype of FP-Hadoop, and conducted extensive experiments over synthetic and real datasets. We achieved excellent performance gains compared to native Hadoop, e.g. more than 10 times in reduce time and 5 times in total execution time.

Data Integration

  • CloudMdsQL Compiler

The CloudMdsQL (Cloud Multi-datastore Query Language) compiler transforms queries expressed in a common SQL-like query language into an optimized query execution plan to be executed over multiple cloud data stores (SQL, NoSQL, HDFS, etc.) through a query engine. The compiler/optimizer is implemented in C++ and uses the Boost.Spirit framework for parsing context-free grammars. CloudMdsQL is being validated on relational, document and graph data stores in the context of the CoherentPaaS European project.

  • AgroLD (Agronomic Linked Data)

The aim of the Agronomic Linked Data (AgroLD) project is to provide a portal for bioinformatics and domain experts to exploit the homogenized data models towards efficiently generating research hypotheses. AgroLD is a RDF knowledge base that is designed to integrate data from various publicly available plant centric data sources and ontologies, using Web Ontology Language (OWL) and the SPARQL Query Language (SPARQL).

  • WebSmatch (Web Schema Matching)

Started in the context of  an Action de Développement Technologique (ADT)  2010-2013, WebSmatch is a flexible, open environment for discovering and matching complex schemas from many heterogeneous data sources over the Web. It provides three basic functions: (1) metadata extraction from data sources; (2) schema matching (both 2-way and n-way schema matching), (3) schema clustering to group similar schemas together. WebSmatch is being delivered through Web services, to be used directly by data integrators or other tools, with RIA clients. Implemented in Java, delivered as Open Source Software (under LGPL) and protected by a deposit at APP (Agence de Protection des Programmes). WebSmatch is being used by Datapublica and CIRAD to integrate public and private data sources. It is the basis of our work in the Xdata project and IBC.

Distributed Data Management

  • SON (Shared-data Overlay Network)

SON is an open source development platform for P2P networks using web services, JXTA and OSGi.  SON combines three powerful paradigms: components, SOA and P2P. Components communicate by asynchronous message passing to provide weak coupling between system entities. To scale up and ease deployment, we rely on a decentralized organization based on a DHT for publishing and discovering services or data. In terms of communication, the infrastructure is based on JXTA virtual communication pipes, a technology that has been extensively used within the Grid community. Using SON, the development of a P2P application is done through the design and implementation of a set of components. Each component includes a technical code that provides the component services and a code component that provides the component logic (in Java). The complex aspects of asynchronous distributed programming (technical code) are separated from code components and automatically generated from an abstract description of services (provided or required) for each component by the component generator.

  • Scifloware

SciFloware is a middleware for the execution of scientific workflows in a distributed and parallel way. It capitalizes on our experience with the Shared-Data Overlay Network and an innovative algebraic approach to the management of scientific workflows. SciFloware provides a development environment and a runtime environment for scientific workflows, interoperable with existing systems. We validate SciFloware with workflows for analyzing biological data provided by our partners CIRAD, INRA and IRD.

  • Hadoop_g5k: Hadoop and Spark clusters in Grid5000

Grid5000 (G5k) is a scientific instrument that supports large-scale, reproducible experiments in the context of research on distributed systems, providing access to more than 1000 nodes and 8000 cores. Apache Hadoop and Apache Spark and their related projects are the most popular frameworks used in big data, thus making them suitable targets to experiment in G5k. However its management and configuration may be difficult, especially under the dynamic nature of clusters within Grid 5000 reservations We have developed Hadoop_g5k, a tool that makes it easier to manage Hadoop and Spark clusters and prepare reproducible experiments in the G5k platform. Hadoop_g5k offers a set of scripts to be used in command-line interfaces and a Python API to interact with the clusters. It is currently active within G5k community, facilitating the preparation and execution of experiments in the platform.


Permanent link to this article:

F-ParSketch: Fully Parallel Sketches for Time Series Indexing in Massively Distributed Environments

This page comes with the paper about F-ParSketch (*). It gives links to the code and documentation of: ParSketch & F-ParSketch : Sketch : iSAX2+ : (*)  “ParSketch: Massively distributed indexing of time series” Djamel-Edine Yagoubi, Reza Akbarinia, Florent Masseglia and Dennis Shasha.