2020 Scientific Progress
Tuning Floating-Point Precision Using Dynamic Program Information and Temporal Locality – With the demise of Moore’s Law, it is anticipated that future hardware performance improvements will be obtained through specialization, such as fixed function units that manipulate narrow data types. One such example is the NVIDIA Volta V100, which offers FP32/FP16 mixed-precision 4×4 matrix multiply units. NVIDIA GPUs featuring Tensor Cores will be deployed on all large-scale US Department of Energy (DOE) installations under procurement in the 2020-2022 time frame. This announcement has already motivated work in algorithm-specific refactoring [1] for mixed precision iterative refinement solver.
We have developed a methodology for precision tuning of full applications. These techniques must select a search space composed of either variables or instructions and provide a scalable search strategy. We argue for an instruction-based search space and our method exploits dynamic program information based on call stacks as well as the iterative nature of scientific codes, combined with temporal locality. We applied the methodology to tune the implementation of scientific codes written in a combination of Python, CUDA, C++ and Fortran, tuning calls to math exp library functions. The iterative search refinement always reduces the search complexity and the number of steps to solution. Dynamic program information increases search efficacy. Using this approach, we obtain application runtime performance improvements up to 27% [2]. We plan to use this work to optimize a cardiac application developed at Inria Bordeaux.
Hiding the latency of MPI communications – The Message Passing Interface (MPI) defines multiple functions to perform communications over distributed architectures. Among these operations, the nonblocking ones allow communications to asynchronously progress, thus enabling the overlap of communications by computations. Despite the potential gain of performance, developers still prefer to use blocking communications due to the complexity of nonblocking operations usage. During the first year of its PhD, Van Man developed a method that automatically transforms blocking MPI calls into their nonblocking counterparts and that optimizes their overlapping potential through extensive code motion [3]. The transformations have been validated through a static analysis, integrated in the PARCOACH tool [4]. We plan to use this work to optimize applications used at LBNL.
Identifying the Benefit of Communication Aggregation in HPC Applications – Communication aggregation is an important optimization in applications that communicate at fine granularity. We are studying the benefit of aggregation in technologically important applications, using proxies representing patterns of communication and computation that arise in production applications.The work is part of Scott Baden Inria International chair.
Correctness of MPI 3.0 one-sided communication – Célia Tassadit Ait Kaci is currently developing an online dynamic analysis to detect data races in MPI RMA programs.
[1] Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J. Higham. Harnessing gpu tensor cores for fast fp16 arithmetic to speed up mixed-precision iterative refinement solvers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC ’18. IEEE Press, 2018.
[2] Hugo Brunie, Costin Iancu, Khaled Z. Ibrahim, Philip Brisk and Brandon Cook.Tuning Floating-Point Precision Using Dynamic Program Information and Temporal Locality, Supercomputing 2020 TO APPEAR
[3] Van Man Nguyen, Emmanuelle Saillard, Julien Jaeger, Denis Barthou and Patrick Carribault, “Automatic Code Motion to Extend MPI Nonblocking Overlap Window”, First workshop on Compiler-assisted Correctness Checking and Performance Optimization for HPC, 2020, TO APPEAR
[4] Van Man Nguyen, Emmanuelle Saillard, Julien Jaeger, Denis Barthou and Patrick Carribault, “PARCOACH Extension for Static MPI Nonblocking and Persistent Communication Validation”, Fourth International Workshop on Software Correctness for HPC Applications, 2020 TO APPEAR