High-Performance Data Management for Precision Agriculture

  • Advisors: François Tessier (KerData team), Gabriel Antoniu (KerData team)
  • Main contact: francois.tessier (at) inria.fr
  • Expected start date:  October 1st, 2021
  • Application deadline: as early as possible, no later than April 30th, 2021

Description

Feeding the growing world’s population is a decisive challenge, especially in view of climate change, which adds a certain level of uncertainty in food production. Sustainable and precision agriculture is one of the answers that can be implemented to partly overcome this issue. Precision agriculture consists in using new technologies to improve crop management by considering environmental parameters such as temperature, soil moisture or weather conditions, for example. These techniques now need to scale up to improve their accuracy. Since a few years, we have seen the emergence of precision agriculture workflows running across the digital continuum, that is to say all the computing resources from the edge to High-Performance Computing (HPC) and Cloud-type infrastructures. This move to scale is accompanied by new problems, particularly with regard to data movements.

CybeleTech[1,2] is a French company that aims at developing the use of numerical technologies in agriculture. The core products of CybeleTech are based on numerical simulation of plant growth through dedicated biophysical models and machine learning methods extracting knowledge on processes through large databases. To feed its models, CybeleTech collects data from sensors installed on open agricultural plots or in crop greenhouses. Plant growth models take weather variables as input and the accuracy of agronomic indices estimation heavily rely on the accuracy of these variables. To this purpose, CybeleTech wishes to collect precise meteorological information from large forecasting centers such as the European Center for Medium-Range Weather Forecasting (ECMWF) [3]. This data gathering is not trivial since it involves large data movements between two distant sites under severe time constraints.

The objective of this thesis is to propose new data management techniques and data movement algorithms to accelerate the execution of these hybrid geo-distributed workflows running on large-scale systems in the area of precision agriculture.

ECMWF’s production workflow uses ensemble simulations to refine its weather forecasts. To date, these simulations generate approximately 60TB per hour, while the center predicts an annual increase of 40% of this volume. Structured datasets called “products” are then generated from this output data and are disseminated to different clients, such as public institutions or private companies, at a rate of 1PB per month transmitted. Again, this substantial volume tends to increase considerably. CybeleTech’s goal is to be able to retrieve and ingest these data in order to refine its precision farming models. This cross-site workflow will use a specific dedicated architecture as described in the European project EUPEX. In particular, this architecture will feature nodes equipped with ephemeral storage resources (NVMe, NVDIMM) and high-bandwidth memory (HBM), called Datanodes, as a gateway to data dissemination.

Among the challenges raised here, this thesis will focus on how the data will be aggregated on these nodes by ECMWF, how data will be processed (generation, filtering, transformation) and how it will be sent to CybeleTech. New techniques and algorithms that need to adapt to specific data structures are required to enable efficient and scalable data management.

The selected candidate will contribute to the definition, implementation and validation of a multi-tier storage architecture meeting the I/O requirements of the target use-cases proposed by CybeleTech. In particular, an important goal is to provide fast and scalable access to data for workflows and applications at the Exascale level that minimize data movements across the storage hierarchy.

This PhD thesis will focus on an in-memory tier for distributed data management across nodes hosting ephemeral storage resources, called Datanodes. For that purpose, this research track could be built on Damaris [4,5,6] (a mature KerData project) that leverages various storage capabilities to provide scalable asynchronous I/O and non-intrusive in-situ and in-transit data processing on the Datanodes. Emphasis will be put on exploring data aggregation algorithms [7] coupled with new techniques for hybrid (both stream-based and batch-based) in-transit data analysis on shared datanodes. This research will be implemented and tested in Damaris to support workflows and Big Data analytics.

Requirements of the candidate

– An excellent Master degree in computer science or equivalent
– Strong knowledge of distributed systems
– Knowledge on storage and (distributed) file systems
– Ability and motivation to conduct high-quality research, including publishing the results in relevant venues
– Strong programming skills (Python, C/C++)
– Working experience in the areas of Big Data management, Cloud computing, HPC, is an advantage
– Very good communication skills in oral and written English.
– Open-mindedness, strong integration skills and team spirit

How to apply?

Send an email with a cover letter, CV, contact address of at least two references (internship, teacher in a related field, …) and copies of degree certificates to Dr. François Tessier and Dr. Gabriel Antoniu. Incomplete applications will not be considered or answered.

References

[1] The CybeleTech company. URL: https://www.cybeletech.com/en/home/
[2] L’agribusiness pour la prospérité de l’entreprise. Le Parisien, 20 octobre 2016. URL: https://www.leparisien.fr/economie/business/l-agribusiness-pour-la-prosperite-de-l-entreprise-17-10-2016-6216093.php
[3] European Center for Medium-Range Weather Forecasting. URL: https://www.ecmwf.int/
[4] The Damaris project: https://project.inria.fr/damaris/.
[5] Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O, M. Dorier, G. Antoniu, F. Cappello, M. Snir, L. Orf, in Proc. of the IEEE Cluster 2012 conference. September 2012 (Beijing, China).
[6] Damaris/Viz: a Nonintrusive, Adaptable and User-Friendly In Situ Visualization Framework, M. Dorier, R. Sisneros, T. Peterka, G. Antoniu, D. Semeraro, in Proc. of the IEEE LDAV 2013 conference. October 2013 (Atlanta, GA, USA).
[7] TAPIOCA: An I/O Library for Optimized Topology-Aware Data Aggregation on Large-Scale Supercomputers, F. Tessier, V. Vishwanath, E. Jeannot, in Proc. of the IEEE Cluster 2017 conference. September 2017 (Honolulu, HI, USA)

Comments are closed.