Prospiel Associate Team

Team members

France

Caroline Collange, Inria ALF, Principal investigator
Laure Gonnord, Inria ROMA
Fabrice Rastello, Inria CORSE
Maroua Maalej, Inria ROMA
Diogo Sampaio, Inria CORSE

Brazil

Fernando Pereira, UFMG, Principal investigator
Mariza Bigonha, UFMG
Anolan Milanés, CEFET-MG
Andrei Rimsa, CEFET-MG
Rubens Emílio, UFMG
Gabriel Poesia, UFMG

Scientific Program

The PROSPIEL project aims at optimizing parallel applications for high performance on new throughput-oriented architectures: GPUs and many-core processors. Traditionally, code optimization is driven by a program analysis performed either statically at compile-time, or dynamically at run-time. Static program analysis is fully reliable but often over-conservative. Dynamic analysis provides more accurate data, but faces strong execution time constraints and does not provide any guarantee.

By combining profiling-guided specialization of parallel programs with runtime checks for correctness, PROSPIEL seeks to capture the advantages of both static analysis and dynamic analysis. The project relies on the polytope model, a mathematical representation for parallel loops, as a theoretical foundation. It focuses on analyzing and optimizing performance aspects that become increasingly critical on modern parallel computer architectures: locality and regularity.

Research activities

Sparse static analyses for memory

Alias analysis is one of the most fundamental techniques that compilers use to optimize languages with pointers. However, in spite of all the attention that this topic has received, the current state-of-the-art approaches inside compilers still face challenges regarding precision and speed. In particular, pointer arithmetic, a key
feature in C and C++, is yet to be handled satisfactorily.

The collaborative work of Vitor Paisante, Maroua Maalej, Leonardo Barbosa, Laure Gonnord and Fernando Pereira published in a CGO 2016 paper presents a new alias analysis algorithm to solve this problem. The key insight of our approach is to combine alias analysis with classic range analysis on the interval domain. This combination lets us disambiguate fields within arrays and structs, effectively achieving more precision than traditional algorithms. To validate our technique, we have implemented it on top of the LLVM compiler. Tests on a vast suite of benchmarks show that we can disambiguate several kinds of C idioms that current state-of-the-art analyses cannot deal with. In particular, we can disambiguate 1.35x more queries than the alias analysis currently available in LLVM. Furthermore, our analysis is very fast: we can go over one million assembly instructions in 10 seconds.

Our CGO 2017 paper contributes further to address the pointer aliasing challenge. We start from an obvious, yet unexplored, observation: if a pointer is strictly less than another, they cannot alias. Motivated by this remark, we use the abstract interpretation framework to build strict less-than relations between pointers. To this end, we construct a program representation that bestows the Static Single Information (SSI) property onto our dataflow analysis. SSI gives us an efficient sparse algorithm, which, once seen as a form of abstract interpretation, is correct by construction. We have implemented our static analysis in LLVM. It runs in time linear on the number of program variables, and, depending on the benchmark, it can be as much as six times more precise than the pointer disambiguation techniques already in place in that compiler.

Our research report proposes an adaptation of the two previous analyses in a more general framework where we use three different checks to verify if two pointers might
alias. All these checks emerge naturally from the data-structures that we build in the effort to find less-than relations between variables. Our new alias analysis increases the ability of this compiler to disambiguate pointers by a significant margin. As an example, we increase by almost 2.5x the number of pairs of pointers that we distinguish in SPEC’s hmmer and bzip2. In SPEC’s gcc, this growth reaches 3.5x. Furthermore, contrary to several algebraic pointer disambiguation techniques, our implementation scales up to programs with millions of assembly instructions.

Heterogeneous scheduling

Douglas Teixeira, Fernando Pereira and co-authors proposed a compiler analysis to place each parallel computation in a program on the CPU or the GPU. Four factors are decisive in determining how a program will fare on the graphics processing unit: (i) available parallelism; (ii) data size; (iii) memory access divergences; and (iv) control flow divergences. We proposed a suite of static analyses to estimate each one of these metrics, for each loop in a program. Equipped with these numbers, our compiler adds OpenACC directives to each loop, which is transformed independently of the others. The result of this transformation is a new program bearing code that will be send to the GPU, and code that will remain on the CPU. We have implemented our static analyses in LLVM, and have used ipmacc, an open source OpenACC compiler to transform code automatically.

Code Optimizations for GPUs

The increasing popularity of Graphics Processing Units (GPUs), has brought renewed attention to old problems related to the Single Instruction, Multiple Data execution model. One of these problems is the reconvergence of divergent threads. A divergence happens at a conditional branch when different threads disagree on the path to follow upon reaching this split point. Divergences may impose a heavy burden on the performance of parallel programs.

Douglas de Couto, Caroline Collange and Fernando Pereira have proposed a compiler-level optimization to mitigate the performance loss due to branch divergence on GPUs. This optimization consists in merging function call sites located at different paths that sprout from the same branch. We show that our optimization adds negligible overhead on the compiler. It does not slowdown programs in which it is not applicable, and accelerates substantially those in which it is. As an example, we have been able to speed up the well known SPLASH Fast Fourier Transform benchmark by 11%.

Runtime Pointer disambiguation

Fabrice Rastello, Fernando Pereira and co-authors introduced compiler techniques to disambiguate pointers at runtime. We provide different ways to determine at runtime when two memory locations can overlap. We then produce two versions of a code region: one that is aliasing-free – hence, easy to optimize – and another that is not. Our checks let us safely branch to the optimizable region. We have applied these ideas on Polly-LLVM, a loop optimizer built on top of the LLVM compilation infrastructure. Our experiments indicate that our method is precise, effective and useful: we can disambiguate every pair of pointer in the loop intensive PolyBench benchmark suite. The result of this precision is code quality: the binaries we generate are 10% faster than those that Polly-LLVM produces without our optimization, at the -O3 optimization level of LLVM.

Function Call Re-Vectorization

This work was done during the internship of Rubens Emilio in Rennes in collaboration with Sylvain Collange and Fernando Pereira.
SPMD programming languages for SIMD hardware such as C for CUDA, OpenCL or ISPC have contributed to increase the programmability of SIMD accelerators and graphics processing units. However, SPMD languages still lack the flexibility offered by low-level SIMD programming on explicit vectors. To close this expressiveness gap while preserving the SPMD abstraction, we introduce the notion of Function Call Re-Vectorization (CREV). CREV allows changing the dimension of vectorization during the execution of an SPMD kernel, and exposes it as a nested parallel kernel call. CREV affords a programmability close to dynamic parallelism, a feature that allows the invocation of kernels from inside kernels, but at much lower cost. In this paper, we present a formal semantics of CREV, and an implementation of it on the ISPC compiler. To validate our idea, we have used CREV to implement some classic algorithms, including string matching, depth first search and Bellman-Ford, with minimum effort. These algorithms, once compiled by ISPC to Intel-based vector instructions, are as fast as state-of-the-art implementations, yet much simpler. As an example, our straightforward implementation of string matching beats the Knuth-Morris-Pratt algorithm by 12%.

Multi-dimensional symbolic range analysis

Maroua Maalej, Laure Gonnord and Caroline Collange continue an ongoing work on symbolic range analysis, by generalizing it to identify common patterns of GPU programs. In particular, we are working on a compiler analysis for GPU languages like CUDA and OpenCL that computes precise parameterized bounds on the addresses used on multi-dimensional array accesses. An original and promising direction we are exploring is to decouple dimensions, by computing modulo relations between variables for each dimension separately, then combining this information to derive accurate modulo range relations between variables.

Qubit allocation for quantum computers

In May of 2016, IBM Research has made a quantum processor available in the cloud to the general public. The possibility of programming an actual quantum device has elicited much enthusiasm. Yet, quantum programming still lacks the compiler support that modern programming languages enjoy today. To use the IBM machine, programmers must design low-level circuits. In particular, they must map logical qubits into physical qubits that need to obey connectivity constraints. This task resembles the early days of programming, in which software was built in machine languages. In collaboration with Vinı́cius Fernandes dos Santos, professor at UFMG, and Marcos Yukio Siraichi, MSc student at UFMG, Fernando Pereira and Sylvain Collange have formally introduced the qubit allocation problem and provided an exact solution to it. This optimal algorithm deals with the simple quantum machinery available today; however, it cannot scale up to the more complex architectures scheduled to appear. Thus, we also provide a heuristic solution to qubit allocation, which is faster than the current solutions already implemented to deal with this problem.

Publications

Péricles Alves, Fabian Gruber, Johannes Doerfert, Alexandros Lamprineas, Tobias Grosser, Fabrice Rastello, Fernando Pereira. Runtime Pointer Disambiguation. OOPSLA 2015.
Douglas do Couto, Caroline Collange, Fernando Pereira. Fusion of calling sites. SBAC-PAD 2015
Vitor Paisante, Maroua Maalej, Leonardo Barbosa, Laure Gonnord, Fernando Magno Quintao Pereira. Symbolic Range Analysis of Pointers. Proceedings of CGO’16, pp.791-809, 2016.
Rubens Emı́lio Alves Moreira, Caroline Collange, Fernando Magno Quintão Pereira. Definição semântica de blocos everywhere para programação SIMD. Brazilian Symposium on Programming Languages (SBLP), 2016.
Maroua Maalej, Vitor Paisante, Fernando Magno Quintão Pereira, Laure Gonnord. Combining Range and Inequality Information for Pointer Disambiguation. Research Report RR-9076, ENS Lyon ; CNRS ; INRIA, December 2016.
Laure Gonnord, Maroua Maalej, Fernando Pereira, Vitor Paisante. Pointer Disambiguation via Strict Inequalities. CGO’17, 2017.
Rubens Emilio, Caroline Collange, Fernando Pereira. Function Call Re-Vectorization. PPoPP 2017.