Prospiel Associate Team

Team members

France

Caroline Collange, Inria ALF, Principal investigator
Laure Gonnord, Inria ROMA
Fabrice Rastello, Inria CORSE
Maroua Maalej, Inria ROMA
Diogo Sampaio, Inria CORSE

Brazil

Fernando Pereira, UFMG, Principal investigator
Mariza Bigonha, UFMG
Anolan Milanés, CEFET-MG
Andrei Rimsa, CEFET-MG
Rubens Emílio, UFMG
Gabriel Poesia, UFMG

Scientific Program

The PROSPIEL project aims at optimizing parallel applications for high performance on new throughput-oriented architectures: GPUs and many-core processors. Traditionally, code optimization is driven by a program analysis performed either statically at compile-time, or dynamically at run-time. Static program analysis is fully reliable but often over-conservative. Dynamic analysis provides more accurate data, but faces strong execution time constraints and does not provide any guarantee.

By combining profiling-guided specialization of parallel programs with runtime checks for correctness, PROSPIEL seeks to capture the advantages of both static analysis and dynamic analysis. The project relies on the polytope model, a mathematical representation for parallel loops, as a theoretical foundation. It focuses on analyzing and optimizing performance aspects that become increasingly critical on modern parallel computer architectures: locality and regularity.

Research activities

Range Analysis of Pointers

Pointer arithmetic, a key feature in C and C++, is yet to be handled satisfactorily. Maroua Maalej, Laure Gonnord and Fernando Pereira designed a new alias analysis algorithm to solve this problem. The key insight of our approach is to combine alias analysis with classic range analysis on the interval domain. This combination lets us disambiguate fields within arrays and structs, effectively achieving more precision than traditional algorithms. To validate our technique, we have implemented it on top of the LLVM compiler. Tests on a vast suite of benchmarks show that we can disambiguate several kinds of C idioms that current state-of-the-art analyses cannot deal with. In particular, we are two orders of magnitude more precise than LLVM’s implementation of scalar-evolution-based points-to analysis, our most similar competitor. Furthermore, our analysis is very fast: we can go over one million assembly instructions in 10 seconds.

Heterogeneous scheduling

Douglas Teixeira, Fernando Pereira and co-authors proposed a compiler analysis to place each parallel computation in a program on the CPU or the GPU. Four factors are decisive in determining how a program will fare on the graphics processing unit: (i) available parallelism; (ii) data size; (iii) memory access divergences; and (iv) control flow divergences. We proposed a suite of static analyses to estimate each one of these metrics, for each loop in a program. Equipped with these numbers, our compiler adds OpenACC directives to each loop, which is transformed independently of the others. The result of this transformation is a new program bearing code that will be send to the GPU, and code that will remain on the CPU. We have implemented our static analyses in LLVM, and have used ipmacc, an open source OpenAcc compiler to transform code automatically.

Code Optimizations for GPUs

The increasing popularity of Graphics Processing Units (GPUs), has brought renewed attention to old problems related to the Single Instruction, Multiple Data execution model. One of these problems is the reconvergence of divergent threads. A divergence happens at a conditional branch when different threads disagree on the path to follow upon reaching this split point. Divergences may impose a heavy burden on the performance of parallel programs.

Douglas de Couto, Caroline Collange and Fernando Pereira have proposed a compiler-level optimization to mitigate the performance loss due to branch divergence on GPUs. This optimization consists in merging function call sites located at different paths that sprout from the same branch. We show that our optimization adds negligible overhead on the compiler. It does not slowdown programs in which it is not applicable, and accelerates substantially those in which it is. As an example, we have been able to speed up the well known SPLASH Fast Fourier Transform benchmark by 11%.

Runtime Pointer disambiguation

Fabrice Rastello, Fernando Pereira and co-authors introduced compiler techniques to disambiguate pointers at runtime. We provide different ways to determine at runtime when two memory locations can overlap. We then produce two versions of a code region: one that is aliasing-free – hence, easy to optimize – and another that is not. Our checks let us safely branch to the optimizable region. We have applied these ideas on Polly-LLVM, a loop optimizer built on top of the LLVM compilation infrastructure. Our experiments indicate that our method is precise, effective and useful: we can disambiguate every pair of pointer in the loop intensive PolyBench benchmark suite. The result of this precision is code quality: the binaries we generate are 10% faster than those that Polly-LLVM produces without our optimization, at the -O3 optimization level of LLVM.

Publications

Péricles Alves, Fabian Gruber, Johannes Doerfert, Alexandros Lamprineas, Tobias Grosser, Fabrice Rastello, Fernando Pereira. Runtime Pointer Disambiguation. OOPSLA 2015.
Douglas do Couto, Caroline Collange, Fernando Pereira. Fusion of calling sites. SBAC-PAD 2015