Advisors: Gabriel Antoniu (KerData team), Alexandru Costan (KerData team), Patrick Valduriez (Zenith team)
Main contacts: gabriel.antoniu (at) inria.fr, alexandru.costan (at) inria.fr, patrick.valduriez (at) inria.fr, alexis.joly (at) inria.fr
Expected start date: October 1st, 2019
Funding: secured (Inria) , in the context of the HPC-Big Data Inria Project Lab
Application deadline: as early as possible, no later than June 30, 2019
Big Data Analytics (BDA) refers to the process of examining and extracting relevant knowledge from sets of data which are so huge, exhibit such a high format variety and are generated at such a high speed that traditional systems for data storage and processing cannot be used any longer in an efficient way to extract knowledge in an acceptable time. Potentially coming from a very large variety of sources (e.g., sensors from the Internet of Things, social networks, business applications), such data are curated, (sometimes partially) stored, processed and fed into analysis engines that build representations through data-driven models that further enable descriptive, predictive and prescriptive analytics to get valuable insights for decision making.
High-Performance Computing (HPC) refers to the use of parallel processing techniques on high-end machines (supercomputers) to solve complex problems from science and industry which require extreme amounts of computation. The major focus is on performance, therefore HPC typically relies on extreme-scale aggregation of the fastest available hardware and software technologies for processing, communication and storage. Examples of representative HPC application areas include physics, biochemistry, materials science, environment and industrial design (e.g., for car manufacturing). Such applications typically rely on modelling and simulation of the evolution of a complex system, thanks to mathematical models expressing physics-based laws underlying that system.
Why HPC and BDA need each other. As HPC supercomputers evolve towards what is called the exascale, they enable simulations at an increasingly high precision, which results in overwhelming amounts of data generated faster and faster. The analysis of HPC-generated data is thus becoming a Big Data Analytics problem, which makes it perfectly relevant to envision the application of Big Data Analytics techniques to get relevant insights from that data. At the same time, Big Data Analytics exhibits an increasing use of high-performance computational capacities in order to allow extremely-fast knowledge extraction to be performed (e.g., real-time high-frequency analysis), to enable timely and precise decisions. Thus HPC and BDA exhibit clearer and clearer dependencies, which has recently motivated intense efforts to identify how the two areas could leverage each other in a converged way.
Why Artificial Intelligence is a catalyst for HPC-Big Data convergence. Big Data Analytics increasingly relies on Machine Learning (ML), a subfield of Artificial Intelligence typically used for data classification and feature extraction. While traditional ML deals with tractable feature extraction, Deep Learning recently attracted a very high interest as a particularly efficient approach when classical machine learning is intractable. DL relies on neural network representations with a high number of layers, able to learn very complex representations and subsequently use them for predictions (inference). It uses dense linear algebra kernels and allows for lower-precision representation and arithmetic, for which general-purpose GPU (GPGPU) accelerators (increasingly available on HPC systems) are a relevant infrastructure. Consequently, DL-based Big Data Analytics generates workloads that naturally fit HPC systems, thereby acting as a catalyst for HPC-Big Data convergence.
The overall challenge: overcome diverged cultures, methodologies and tools. HPC’s initial focus was on computational performance for tightly-coupled workloads requiring fast computation and frequent communication. The associated software stacks and programming models (e.g., MPI, OpenMP) were therefore developed and optimized accordingly. In contrast, traditional Big Data workloads are loosely coupled and can typically be divided into a huge number of independent jobs. BDA frameworks followed a scale-out model where a very high number of standard-performance processing units can be aggregated and used together without substantial communication. This favored the usage of cloud systems as reference infrastructures for BDA. Map-Reduce (consisting of two stages – map and reduce – each of which can be executed efficiently by a large set of highly parallel tasks) emerged as the dominant programming model. It was further generalized to other operators – not only map and reduce – and to multi-stage processing – not only two stages – in frameworks like Spark and Flink.
HPC and BDA thus underwent divergent evolutions motivated by different optimization goals. The major challenge posed by the convergence of HPC-Big Data comes precisely from the difficulty to ”put together” the inherited methodologies and tools as such, as they developed following diverging targets. For instance, preliminary experiments have shown that Big Data Analytics frameworks perform inefficiently on HPC systems and are totally ignorant of the huge optimization potential allowed by the high-performance underlying hardware.
Focus of the thesis: the data processing level. In the high-performance computing area (HPC), the need to get fast and relevant insights from massive amounts of data generated by extreme-scale computations led to the emergence of in situ and in transit processing approaches. They allow data to be visualized and processed in real-time, in an interactive way, as they are produced, as opposed to traditional approach consisting of transferring data off-site after the end of the computation, for offline analysis. In the Big Data area, the search for real-time, fast analysis was materialized through a different approach: stream-based processing, in support to intelligent, ML-based data analytics.
Thesis goal. This PhD thesis aims to propose an approach enabling HPC-Big Data convergence at the data processing level, by exploring alternative solutions to build a unified framework for extreme-scale data processing. The architecture of such a framework will leverage the extreme scalability demonstrated by in situ/in transit data processing approaches originated in the HPC area, in conjunction with Big Data processing approaches emerged in the BDA area (batch-based, streaming-based and hybrid). The high-level goal of this framework is to enable the usage of a large spectrum of Big Data analytics techniques at extreme scales, to support precise predictions in real-time and fast decision making.
Target use cases. The thesis will start by analyzing the needs of a concrete use-case scenario available to the project: the Pl@ntNet project from the Zenith team. It exhibits challenging data- analysis requirements in terms data volumes and data processing velocity. The goal is to enable the online computation and visualization of species distribution models from Pl@ntNet data stream. The platform actually generates millions of observations each month, but today, the analysis of that data is only done punctually as an offline process. For instance, all plant observations that occurred in 2016 are crawled and analyzed by ecologists who apply various niche modeling approaches on top of that static data. The ultimate objective, however, is to allow a more dynamic and more timely monitoring of species. For instance, one would want to study the flourishing dynamics of a species in real-time, based on the analysis and visualization of the observations of the last few weeks or days.
In a second phase, we will analyze and address the requirements of a second use case on machine learning coherent diffraction data made available by the group of Tom Peterka at Argonne National Lab, with which the KerData team is collaborating.
Enabling techniques. In the process of designing the unified data processing framework, we will leverage in particular techniques for data processing already investigated by the participating teams as proof-of-concept software, validated in real-life environments:
- The Damaris framework for scalable, asynchronous I/O and in situ and in transit visualization and processing (developed at Inria, https://project.inria.fr/damaris/). Damaris already demonstrated its scalability up to 16,000 cores on some of the top supercomputers of Top500, including Titan, Jaguar and Kraken). Developments are currently in progress in a contractual framework between Total and Inria to use Damaris for in situ visualization for extreme-scales simulations at Total. For the purpose of this work, Damaris will have to be extended to support Big Data analytics plugins for data processing (e.g., based on the Flink and Spark engines and on their higher-level machine-learning libraries).
- The KerA approach for low-latency storage for stream processing (currently under development at Inria, in collaboration with UPM, in the framework of a contractual partnership between Inria and Huawei Munich). By eliminating storage redundancies between data ingestion and storage, preliminary experiments with KerA successfully demonstrated its capability to increase throughput for stream processing. Kera is now subject of interest for exploitation plans by Huawei.
The resulting framework will be integrated in a state-of-the-art data processing ecosystem (Spark or Flink) and allow to apply in situ/in transit advanced tools for Big Data analytics (e.g. ML-based) using stream-based techniques, to combine the result with historical data and thereby derive insights from data in real time. These insights can further be used to steer the simulation.
Location and Mobility
The thesis will be mainly hosted by the KerData team at Inria Rennes Bretagne Atlantique and will be co-advised by the Zenith team, in Montpellier (south of France), where the student is expected to be hosted for long visits. It will include collaborations with two other IPL partners: the DataMove team in Grenoble and Argonne National Lab (which provides one of the target applications, where the student is expected to be hosted for a 3-month internship). Rennes is the capital city of Britanny, in the western part of France. It is easy to reach thanks to the high-speed train line to Paris. Rennes is a dynamic, lively city and a major center for higher education and research: 25% of its population are students.
- Matthieu Dorier, Gabriel Antoniu, Franck Cappello, Marc Snir, Leigh Orf. Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O. In Proc. CLUSTER – IEEE International Conference on Cluster Computing, Sep 2012, Beijing, China. URL: https://hal.inria.fr/hal-00715252
- Matthieu Dorier, Robert Sisneros, Tom Peterka, Gabriel Antoniu, Dave Semeraro. Damaris/Viz: a Nonintrusive, Adaptable and User-Friendly In Situ Visualization Framework. Proc. LDAV – IEEE Symposium on Large-Scale Data Analysis and Visualization, Oct 2013, Atlanta, USA. URL: https://hal.inria.fr/hal-00859603
- Ovidiu-Cristian Marcu, Alexandru Costan, Gabriel Antoniu, María Pérez-Hernández, Radu Tudoran, Stefano Bortoli, Bogdan Nicolae. Towards a Unified Storage and Ingestion Architecture for Stream Processing. Second Workshop on Real-time & Stream Analytics in Big Data Colocates with the 2017 IEEE International Conference on Big Data, Dec 2017, Boston, USA. To Appear. URL: https://hal.inria.fr/hal-01649207
- The Pl@ntNet project: https://plantnet.org
This PhD will be done in the context of the Inria Project Lab (IPL) HPC-BigData: High Performance Computing and Big Data. The goal of this IPL is to gather teams from HPC, Big Data and Machine Learning (ML) areas to work at the intersection between these domains. External partners include: ATOS/Bull, Argonne National Lab (ANL), Laboratoire de Biochimie Théoerique (LBT), CNRS, ESI-Group, Grid’5000.
Requirements of the candidate
- An excellent Master degree in computer science or equivalent
- Strong knowledge of computer networks and distributed systems
- Knowledge on storage and (distributed) file systems
- Ability and motivation to conduct high-quality research, including publishing the results in relevant venues
- Strong programming skills (e.g. C/C++, Java, Python).
- Very good communication skills in oral and written English.
- Open-mindedness, strong integration skills and team spirit
- Working experience in the areas of Big Data management, Cloud computing, HPC, is an advantage
For any questions about the subject, please contact Dr. Gabriel Antoniu, Dr. Alexandru Costan or Dr. Patrick Valduriez (see Contact information above).
Please email a cover letter, CV, contact address of at least two professional references and copies of degree certificates to Dr. Gabriel Antoniu, Dr. Alexandru Costan and Dr. Patrick Valduriez. Incomplete applications will not be considered or answered. Then formally apply using this link.