My research interests are GPU-inspired throughput-oriented parallel architectures, quantum computers and emerging computer architectures in general. My primary focus is on computer architecture and compiler issues. I am also interested in software optimization and computer arithmetic.
INRIA – Centre de recherche Rennes – Bretagne Atlantique
Campus de Beaulieu
35042 RENNES Cedex
E-mail: caroline.collange [at] inria.fr
I aim at improving both the efficiency and the programmability of throughput-oriented processors through architecture changes and compiler techniques.
Compiler optimization for quantum computers
The first generation of usable quantum computers is now a reality. It enables experimental computer science in the quantum computing research field.
We propose compiler analyses and transformations to optimize quantum programs for existing and near-term quantum computers. With only a few tens or even hundreds of qubits, these computers are still too small to afford general error correction techniques. Instead, software and compilers have to cope with noise. Compiler optimization is critical, not only to minimize the time to solution, but more importantly to maximize the accuracy of results.
- Qubit allocation is the process of mapping the logical qubits of quantum programs into physical qubits following hardware constraints. It is the quantum equivalent of register allocation. Our OOPSLA 2019 paper shows how to model qubit allocation as the combination of known Subgraph Isomorphism and Token Swapping problems, and derives efficient parameterized heuristics. When evaluated in “Tokyo”, a quantum architecture with 20 qubits, our technique outperforms these state-of-the-art approaches in terms of the quality of the solutions that it finds and the amount of memory that it uses, while maintaining practical runtime.
General-purpose SIMT: bridging the gap between SMT CPU and SIMT GPU architectures
I propose hardware schemes that allow the execution of existing scalar instruction sets on GPU-like architectures. Existing mechanisms that perform dynamic vectorization in the Single Instruction Multiple Thread (SIMT) model on current GPUs rely on explicit annotations in the instruction set, and hardware-based stack structures. I have shown an alternative constant-space mechanism could enable SIMT execution on conventional scalar instruction sets. It allows individual threads to be managed, suspended, resumed or migrated independently, lifting the main barrier separating SIMT architectures from general-purpose multithread processors.
Our ongoing research along this direction includes:
- SIMT-X: Out-of-order SIMT.
SIMT-X is a general-purpose CPU microarchitecture which enables GPU-style SIMT execution across multiple threads of the same program for high throughput, while retaining the latency benefits of out-of-order execution, and the programming convenience of homogeneous multi-thread processors.
- DITVA: Dynamic Inter-Thread Vectorization.
DITVA starts from an SMT in-order core and incorporates SIMT an execution mode, retaining full compatibility with existing SPMD binaries.
SBAC-PAD 2016 paper.
- Simty: General-Purpose SIMT made simple.
Simty is an open-source fully-synthesizable RTL design of a general-purpose SIMT core implementing the RISC-V instruction set. Simty aims at defining the RISC of general-purpose SIMT: a streamlined resource-efficient SIMT pipeline suitable as the building block for highly scalable, easy-to-program parallel architectures.
Enabling composable, safe and efficient warp-synchronous SIMD programming on SIMT GPUs
Warp-synchronous programming has evolved from an obscure programmer trick to a common programming technique to express explicit SIMD
computations in CUDA or OpenCL programs, supported by new hardware primitives like warp vote and shuffle instructions.
Warp-synchronous programming is extensively used in highly-tuned libraries like CUB.
However, warp-synchronous programming still lacks clearly defined semantics, documentation, and vendor support, and its use raise code composability issues.
I seek to document warp-synchronous patterns and propose light-weight compiler extensions to improve expressivity and code composability.
- Warp-synchronous programming: SIMD algorithms on SIMT hardware.
Tutorial on warp-synchronous programming with CUDA 9: Warp-synchronous programming with Cooperative Groups.
- Lightweight Dynamic Parallelism.
Low-footprint runtime and architecture support for small-scale nested and transposed parallelism calls.
- Anita Tino, Caroline Collange, André Seznec. SIMT-X: Extending Single-Instruction Multi-Threading to Out-of-Order Cores
ACM Transactions on Architecture and Code Optimization. ACM, 10.1145/3392032, 2020. On HAL
- Marcos Siraichi, Vinicius Fernandes Dos Santos, Caroline Collange, Fernando Magno Quintão Pereira. Qubit allocation as a combination of subgraph isomorphism and token swapping. Proceedings of the ACM on Programming Languages (OOPSLA), 2019 On HAL
- Niloofar Charmchi, Caroline Collange, André Seznec. Compressed cache layout aware prefetching. SBAC-PAD 2019 – International Symposium on Computer Architecture and High Performance Computing, 2019. On HAL
- Niloofar Charmchi, Caroline Collange. Toward compression-aware prefetching. COMPAS 2019 – Conférence d’informatique en Parallélisme, Architecture et Système. 2019. On HAL
- Caroline Collange. Ordinateurs quantiques : ouvrons la boîte. COMPAS 2019 – Conférence d’informatique en Parallélisme, Architecture et Système, 2019. On HAL
High-Performance Computing Advanced (HPCA)
0. GPU microarchitecture: Revisiting the SIMT execution model
1. Warp-synchronous programming with Cooperative Groups
2. Advanced CUDA programming: asynchronous execution, memory models, unified memory