Large-scale Scientific Data Sharing and Analysis – application to plant phenotyping data
Recent progress in agronomy, bio-informatics, physics and environmental science result in the generation of overwhelming amounts of experimental data produced through observation and simulation. Lately, new observational instruments (e.g. satellites, sensors, large hadron collider) and simulation tools create a huge data overload. For example, climate modeling data are growing so fast that they will lead to collections of hundreds of exabytes expected by 2020. Such data must be processed, i.e. cleaned, transformed and analyzed in order to draw conclusions, prove scientific theories and produce knowledge. The goal of scientific data management is to make scientific data easier to access, reproduce, and share by scientists of different disciplines and institutions.
The datasets generated this way are complex, in particular because of heterogeneous methods used for producing them, of the uncertainty of captured data and, above all, the inherently multi-scale nature (spatial and temporal scales). This results in data with hundreds of attributes, dimensions or descriptors. Processing and analyzing such massive sets of complex data is therefore a major challenge, with solutions that combine new data management techniques with large-scale parallelism in cluster, grid or cloud environments.
Furthermore, current scientific issues require integrated datasets and involve scientists from different disciplines (e.g. biologists, soil scientists, and geologists working on an environmental project), in some cases from different organizations distributed in different countries. But each discipline or organization tends to produce and manage its own data, in specific formats, with its own processes, it is increasingly difficult to share distributed data.
This raises two major challenges for data management. The first challenge refers to the sharing of these datasets among scientists of different disciplines who want to collaborate and the second one refers to data analysis.
This project also involves Mab team (link to Mab) of Lirmm and their partners. Here we present the scientific activities carried by Zenith team related to plant phenotyping with our partners. The two teams share the same events.