OBJECTIVES – UNIFY Associate Team

Objective 1 – Provide a unified data processing architecture for hybrid workflows includ- ing simulations and analytics

The goal is to enable the integration of traditional data processing techniques for Big Data analytics (batch-based and stream-based) with data processing techniques for simulations at extreme scales (in situ/in transit processing), in support of hybrid workflows combining simulations and analytics, including ensemble runs that require to run thousands of the same parallel simulations but with different parameters for sensitivity analysis, uncertainty quantification or data as- similation. This will be achieved by enhancing batch- and stream- analytics with support for faster data ingestion, data persistence and stateful data management, and by including Big Data analytics capabil- ities to in situ processing frameworks. We will enhance the Damaris software developed by KerData for in situ processing, a successful result of the currently active Inria-ANL collaboration, to support in situ data analytics. We will consider use cases such as Machine Learning of Coherent Diffraction Data, proposed by ANL. The global performance will be evaluated in terms of latency for data access, makespan and throughput for the overall data processing in various extreme conditions (large throughput of simulated data, large number of real-time simulation requests).

Objective 2 – Efficient Data Management for hybrid workflows using transient storage systems

Whereas traditional HPC systems usually separate computational resources from storage (parallel file systems), upcoming HPC infrastructures aiming to support hybrid HPC and BDA workflows will have local storage devices (such as SSDs or NVRAM) located with the compute nodes, to better support collocated computations and data analytics. This local storage allows users to dynamically deploy new types of distributed storage systems along with the applications, in a cloud-inspired fashion. Such a storage system, deployed only for the duration of an application execution, is called a transient storage system. We will explore ways to answer the following research question: “How can we efficiently manage the dynamic deployment of transient storage systems?” Optimizing the related data transfers for deployment and termination of transient storage systems (while ensuring data persistence) can speed up application deployment and termination.

Objective 3 – ML-based resource management for hybrid exascale workflows and adaptive data services `

Resource management is critical to ensure efficient execution and resource usage in the complex context of exascale workflows, which can combine heavy computations, assimilation of observation data, on-line data analytics and visualization. The problem is extremely difficult being multi-criteria and relying on uncertain and partial data. Machine learning is emerging as an interesting approach to tackle this issue with some promising results for batch scheduling for instance. We will investigate novel ML-based approaches to integrate dynamic resource management at various levels of the exascale software stack:

New OS-level containerization approaches providing modern HPC applications with dynamic con- trol over a wide range of low-level controls to improve resource isolation and limit performance interference between application components. In particular, we will target the use case of data- intensive applications using all available memory for dynamic data structures in their heaps, relying in particular on the workflows studied in Objective 1.
Autonomic data services that can adapt to the use of data, the state of the system, and would make resources available for more effective usage; lower risk of data loss; and provide more predictable performance. While some initial efforts have already been made on autonomic data services, this type of adaptation is almost absent in HPC environments. We will explore solutions to key challenges in enabling composed autonomic data services. We will investigate methods for scalable monitoring, generation of surrogate models for assessing performance adaptations, and means for automating reconfiguration. Prior work on understanding and predicting I/O behavior or on job scheduling using machine learning will be leveraged to develop appropriate telemetrics and collection facilities that will provide the data necessary as input to achieve the target service level.