My research interests are GPU-inspired throughput-oriented parallel architectures, quantum computers and emerging computer architectures in general. My primary focus is on computer architecture and compiler issues. I am also interested in software optimization and computer arithmetic.
INRIA – Centre de recherche Rennes – Bretagne Atlantique
Campus de Beaulieu
35042 RENNES Cedex
E-mail: caroline.collange [at] inria.fr
I aim at improving both the efficiency and the programmability of throughput-oriented processors through architecture changes and compiler techniques.
Compiler optimization for quantum computers
The first generation of usable quantum computers is now a reality. It enables experimental computer science in the quantum computing research field.
We propose compiler analyses and transformations to optimize quantum programs for existing and near-term quantum computers. With only a few tens or even hundreds of qubits, these computers are still too small to afford general error correction techniques. Instead, software and compilers have to cope with noise. Compiler optimization is critical, not only to minimize the time to solution, but more importantly to maximize the accuracy of results.
- Qubit allocation is the process of mapping the logical qubits of quantum programs into physical qubits following hardware constraints. It is the quantum equivalent of register allocation. Our CGO 2018 paper formally introduces the qubit allocation problem and provides an exact solution to it. This optimal algorithm deals with the simple quantum machinery available today; however, it cannot scale up to the more complex architectures scheduled to appear. Thus, we also provide a heuristic solution to qubit allocation, which is faster than the current solutions already implemented to deal with this problem.
General-purpose SIMT: bridging the gap between SMT CPU and SIMT GPU architectures
I design hardware schemes that allow the execution of existing scalar instruction sets on GPU-like architectures. Existing mechanisms that perform dynamic vectorization in the Single Instruction Multiple Thread (SIMT) model on current GPUs rely on explicit annotations in the instruction set, and hardware-based stack structures. I have shown an alternative constant-space mechanism could enable SIMT execution on conventional scalar instruction sets. It allows individual threads to be managed, suspended, resumed or migrated independently, lifting the main barrier separating SIMT architectures from general-purpose multithread processors.
My ongoing research along this direction includes:
- DITVA: Dynamic Inter-Thread Vectorization.
DITVA starts from an SMT in-order core and incorporates SIMT an execution mode, retaining full compatibility with existing SPMD binaries.
SBAC-PAD 2016 paper.
- Simty: General-Purpose SIMT made simple.
Simty is an open-source fully-synthesizable RTL design of a general-purpose SIMT core implementing the RISC-V instruction set. Simty aims at defining the RISC of general-purpose SIMT: a streamlined resource-efficient SIMT pipeline suitable as the building block for highly scalable, easy-to-program parallel architectures.
Enabling composable, safe and efficient warp-synchronous SIMD programming on SIMT GPUs
Warp-synchronous programming has evolved from an obscure programmer trick to a common programming technique to express explicit SIMD
computations in CUDA or OpenCL programs, supported by new hardware primitives like warp vote and shuffle instructions.
Warp-synchronous programming is extensively used in highly-tuned libraries like CUB.
However, warp-synchronous programming still lacks clearly defined semantics, documentation, and vendor support, and its use raise code composability issues.
I seek to document warp-synchronous patterns and propose light-weight compiler extensions to improve expressivity and code composability.
- Warp-synchronous programming: SIMD algorithms on SIMT hardware.
Tutorial on warp-synchronous programming with CUDA 9: Warp-synchronous programming with Cooperative Groups.
- Lightweight Dynamic Parallelism.
Low-footprint runtime and architecture support for small-scale nested and transposed parallelism calls.