Job Offers

Postdoc


PhDs

 


 

Marie Skłodowska-Curie Individual Fellowships

You already have a PhD and are interested  to join DataMove for  2 years. The Marie Skłodowska-Curie Individual Fellowships open call from EU    can provide such support with very good financial conditions.  Get in touch with Bruno Raffin or any DataMove permanent research and,  if there is a good match,  we will help you prepare your application file. Deadline: September 2019 (not opened yet).


Internships / Stages

 




Postdoc: High Performance Deep Reinforcement Learning

  • Requirement: PhD in computer Science
  • Location: Grenoble or Lille
  • Hosting Teams:
    • Sequel (INRIA Lille): https://team.inria.fr/sequel/
    • DataMove (INRIA Grenoble): https://team.inria.fr/datamove
  • Contact: Bruno.Raffin@inria.fr and Philippe.Preux@inria.fr
  • Period: to start somewhere in 2019
  • Duration: 24 months

Reinforcement learning goal is to self-learn a task trying to maximize a reward (a game score for instance) interacting with simulations.
Recently, researchers have successfully introduced deep neural networks enabling to address more complex problems. This is often refered as
Deep Reinforcement Learning (DRL). DRL managed for instance to play many ATARI games. The most visible success of
DLR is probably AlphaGo Zero that outperformed the best human players (and itself) after being trained without using data from human games but solely through reinforcement learning. The process requires an advanced infrastructure for the training phase. For instance AlphaGo Zero trained during more than 70 hours using 64 GPU workers and19 CPU parameter servers for playing 4.9 million games of generated self-play, using 1,600 simulations for each Monte Carlo Tree Search.
The general workflow is the following. To speed up the learning process and enable a wide but thorough exploration of the parameter space, the learning neural network interacts in parallel with several instances of actors, each one consisting of a simulation of the task being learned and a neural network interacting with this simulation through the best wining strategy it knows. Periodically the actor neural networks are being updated by the learned neural network.
This workflow has evolved through various research works combining parallelization, asynchronism and novel learning strategies (GORILA, A3C, IMPALA,…).

The goal of this postdoc is to push forward the scalability of these approaches, and to proposing novel learning strategies to
learn more rapidly and more complex tasks (multiple heterogeneous tasks at once, non deterministic games, simulations of complex industrial or living systems).
This work will be performed in close collaboration in between the Sequel INRIA team specialized in DRL and the DataMove team specialized in HPC.

Datamove has developed the Melissa solution to manage large ensembles of parallel simulations and aggregate their data on-line in a parallel server. Melissa enabled to run thousands of simulation on up to 30 000 cores. So far Melissa was used to compute advanced statistics. But we expect this framework to be a sound base for a DRL workflow. The SequeL team has strong activities in reinforcement learning, either deep or not, ranging from theroretical aspects to applications. Among other projects, SequeL has collaborated with Mila (Montréal) to design and develop the Guesswhat?! experiment. As early as 2006, SequeL worked on go and designed the first go program (Crazy Stone) able to challenge a human expert player.

We are looking for a candidate with a PhD either in deep learning, reinforcement learning or high performance computing (a combination of these expertise would be ideal) for a 24 month contract at INRIA. The candidate will have the possibility to join either the Sequel team at Lille or the Grenoble Team at Grenoble.

The postdoc will have access to large supercomputers equipped with multiple GPUs for experiments. We expect this work to lead to international publications sustained by advanced software prototypes.

4 References


Internship: High Performance Deep Reinforcement Learning

  • Level: Master Level Research Internship (M2)
  • Location: University Grenoble Alpes Campus, Saint Martin D’heres (close to Grenoble)
  • Duration: At least 4 months, possibility to pursue as a PhD.
  • Contact: Bruno.Raffin@inria.fr
  • Incomes: Gratifications de stage (about 500 euros/month)
  • Period: 2018-2019

1 Context

Deep learning algorithms need high computational power to deal with increasingly larger
datasets. Reinforcement learning goal is to self-learn a task trying to maximize a reward
(wining a game for instance) interacting with simulations. Recently, researchers have successfully introduced deep neural networks enabling to address more complex problems. This is often refereed as Deep Reinforcement Learning (DRL). DRL managed for instance to play many ATARI games. Very recently AlphaGo Zero was a leap forward in AI as it outperformed the best human players (and itself) after being trained without using data from human games but solely through reinforcement learning. The process requires an advanced infrastructure for the training phase. To speed up the learning process and enable a wide exploration of the parameter space, the neural network interacts in parallel with several instances of the simulation. For instance AlphaGo Zero trained during more than 70 hours using 64 GPU workers and 19 CPU parameter servers for playing 4.9 million games of generated self-play, using 1,600 simulations for each Monte Carlo Tree Search.

The goal of this internship is to work on developing a high performance infrastructure for deep reinforcement learning. We will first target the Go game following the architecture proposed for AlphaGo Zero.

This work will be performed in close collaboration with the Sequel INRIA team specialized in DRL.

2 Work

The candidate is expected to have expertise with high performance computing and if possible some knowledge on deep learning and experience with TensorFlow programming.

The candidate will first work on developing a pipe-line to enable the chained execution of the different steps (self-play, learning, test and update of the self-play neural networks) in a
continuous and constantly improving process. We will rely first on TensorFlow. In a second step we will work on parallelizing these steps combining classical approaches like mini- batch training as well as developing novel ones when necessary. This parallelization will  leverage the efficient use of GPUs and multi-core CPUs.  The third step will focus one adapting this learning process to more complex games, for instance twisting the go game to introduce stochastic behaviors (imagine a stone placement being changed by a random process).

The elementary code blocks (neural network, go game implementation, monte carlos search tree), will be provided by the Sequel team.

3 Location

The internship will take place at the DataMove team located in the IMAG building on the campus of Saint Martin d’Heres (Univ. Grenoble Alpes) near Grenoble. The DataMove team is a friendly and stimulating environment gathering Professors, Researchers, PhD and Master students all leading research on High Performance Computing.

The city of Grenoble is a student friendly city surrounded by the alps mountains, offering a high quality of life and where you can experience all kinds of mountain related outdoors activities.

4 References


 

Internship: On-Line Learning from Multiple Simulations with TensorFlow

  • Level: Master Level Research Internship (M2)
  • Location: University Grenoble Alpes Campus, Saint Martin D’heres (close to Grenoble)
  • Duration: At least 4 months, possibility to pursue as a PhD.
  • Contact: Bruno.Raffin@inria.fr
  • Incomes: Gratifications de stage (about 500 euros/month)
  • Period: 2018-2018

1 Context

Learning with deep neural networks usually takes place using data stored to disk that are presented several times until convergence.  Here we are looking at a different context where the data are actually produced on-line by some large simulation.  The goal can be to learn the behavior of a simulation and be able to replay it through a neural network that is significantly faster than the initial simulation, or for deep reinforcement learning where these simulation play a game, chess or go for instance, and the neural network learns playing strategies. This approach has been successfully used for instance by Alpha-Go-Zero. But often suchs scenarios run at middle scale with hundreds of nodes. The goal of this internship is to setup an environment  that enable the on-line learning at very large scale using a supercomputer.

2 Work

After a training period to master the different concepts and tools, the work will consists in coupling TensorFlow with the Melissa framework for managing large number of simulations, design a neural network to learn from the  data produced by the various instances of the simulation,  and run various experiments at different scales to evaluate the results.

3 What you will learn during this internship

  • TensorFlow programming
  • Get expertise on running code on a parallel machine
  • Learn how to conduct experiments, analyse the result, be critical about the findings.
  • Be creative, to find the right solution when facing a problem.

4 Location

The internship will take place at the DataMove team located in the IMAG building on the campus of Saint Martin d’Heres (Univ. Grenoble Alpes) near Grenoble. The DataMove team is a friendly and stimulating environement gathering Professors, Researchers, PhD and Master students
all leading research on High Performance Computing.

The city of Grenoble is a student friendly city surrounded by the alps mountains, offering a high quality of life and where you can experience all kinds of moutain related outdors activities.


Internship: Rust for Performance: Parallel Data Structure for Stream Data Storage.

  • Level: Master Level Research Internship (M2)
  • Location: University Grenoble Alpes Campus, Saint Martin D’heres (close to Grenoble)
  • Duration: At least 4 months, possibility to pursue as a PhD.
  • Contact: Bruno.Raffin@inria.fr and Frederic.Wagner@inria.fr
  • Incomes: Gratifications de stage (about 500 euros/month)
  • Period: 2018-2019

1 Context

The analysis of large multidimensional spatiotemporal datasets poses challenging questions regarding storage requirements and query performance. Several data structures have recently been proposed to address these problems that rely on indexes that pre-compute different aggregations from a known-a-priori dataset. Consider now the problem of handling streaming datasets, in which data arrive as one or more continuous data streams. Such datasets introduce challenges to the data structure, which now has to support dynamic updates (insertions/deletions) and rebalancing operations to perform self- reorganizations. We developed a  Packed- Memory Quadtree (PMQ),  that efficiently supports dynamics data insertions and outperform other classical data structures like R-trees or B-trees. The  PMQ ensures good performance by keeping control on the density of data that are internally stored in a large array. The PMQ is based on the Packed Memory Array  [3,4,5] and share with it its  amortized cost of  O(log^2(N)) per insertion.  But today all processors are multicores.  Data structures need to be parallelized to leverage the compute power available. The goal of this internship  is to design, develop and test a parallel version of the PMQ using the Rust  programming language (or C++).

2 Work

The work will consists in:

  • Understanding the PMQ and PMA algorithm
  • Design a parallel PMQ and  implement it
  • Run experiments on a parallel machine to evaluate the performance gain compared to the sequential version

3 What you will learn during this internship

  • Advanced Algorithms and data structures
  • The Rust programming language and its modern programming approach
  • Parallel task programming
  • Learn how to conduct experiments on a parallel machine, analyze the result, be critical about the findings.
  • Be creative, to find the right solution when facing a problem.

4 Location

The internship will take lab in the DataMove team located in the new IMAG building
of the Unversity Grenoble Alpes near Grenoble (). The DataMove team is a friendly and stimulating environnement gathering Professors, Researchers, PhD and Master students
all leading research on High Performance Computing.

The city of Grenoble is a student friendly city surrounded by the alps mountains, offering a high quality of life and where you can experience all kinds of mountain related outdoors activities.

5 References:

  1. Rust (https://www.rust-lang.org/en-US/)
  2. Rayon library for task programming with Rust (https://github.com/rayon-rs/rayon)
  3. Bender M. A., Demaine E. D., Farach- Colton M.: Cache-oblivious b-trees. SIAM J. Comput. 35, 2 (2005), 341–358
  4. Bender, M. A., Fineman, J. T., Gilbert, S., & Kuszmaul, B. C. (2005). Concurrent Cache-Oblivious B-Trees (pp. 228–237). Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures, New York, NY, USA: ACM.
  5. Marie Durand, Bruno Raffin and François Faure. A Packed Memory Array to Keep Moving Particles Sorted. 9th Workshop on Virtual Reality Interaction and Physical Simulation (VRIPHYS), Darmstadt, December 2012.
  6. Julio Toss, Cicero Pahins, Bruno Raffin, João Luiz Dihl Comba. Packed-Memory Quadtree: a cache-oblivious data structure for visual exploration of streaming spatiotemporal big dataComputers and Graphics, Elsevier, 2018, pp.1-18

Internship: Apache Flink Stream Processing for Scientific Simulations.

  • Level: Master Level Research Internship (M2)
  • Location: University Grenoble Alpes Campus, Saint Martin D’heres (close to Grenoble)
  • Duration: At least 4 months, possibility to pursue as a PhD.
  • Contact: Bruno.Raffin@inria.fr and emilioj@udc.es
  • Incomes: Gratifications de stage (about 500 euros/month)
  • Period: 2018-2019

1 Context

Flink (1) is a recent fast growing Big Data framework from the Apache foundation. Flink is a stream processing framework for distributed applications. Flink is designed for Big Data applications requiring to analysis
a continuous flow of data coming from sensors (flow of tweets for instance). The goal of this internship is to study how Flink could be used in a different  context: analyzing the stream of data produced by a massive parallel scientific simulation.

Parallel simulations produce large amount of data. The traditional approach consists in writing these data to disk to later on read them back from disk and analyse them. But this approach is becoming prohibitively slow.
An alternative consists in analyzing the data online, as soon as provided by the simulation, before to write the data to disk. Thus the data are analyzed and reduced in size before to go to disk, enabling to significantly reduce the disk performance bottleneck. These data produced by the simulation can be seen as a stream of data: each process of the simulation (a parallel simulation can run thousands of processes) produces a new results at each time step (a numerical simulation can classically computes thousands of time steps). A natural idea is thus to use a stream processing tool like Flink to process this massive parallel data stream online. This is actually a challenging issue that you will study during this internship. We already have an early prototype based on Flink. The goal of this internship is more specifically to investigate how to program performant parallel stream analysis with Flink.

2 Work

The main work steps are the following:

  • Put you hands on the existing prototype and run it on a parallel machine
  • Learn Flink  stream programming model
  • Develop high performance analytics codes.  Experience shows that Flink programming model require sa different way of thinking to come-up with good algorithms.
  • Run various tests at scale to evaluate the performance of the new prototype.

The internship will take place at the DataMove team located in the IMAG building on the campus of Saint Martin d’Heres (Univ. Grenoble Alpes). You will be co-advised by Bruno Raffin and Emilio Padron (this work results from a join collaboration between INRIA and the Universty of Coruna, Spain) You will have access for experiments to large computers (Grid’5000 and CIMENT). Beyond this internship, you will have the opportunity to pursue a PhD on this topic (funding not guaranteed yet).  THis work has a good potential to produce a publication in an international conference.

For further information contact Bruno.Raffin@inria.fr and emilioj@udc.es

3 What you will learn during this internship

  • A Big Data software (Flink)
  • A Big Data file system (Cassandra)
  • Get expertise on running code on large parallel machines (hundreds of nodes)
  • Learn how to conduct experiments, analyse the result, be critical about the findings.
  • Be creative to find the right solution when facing a problem

4 References:

1
Apache Flink (https://flink.apache.org/)
2
Apache Cassandra (http://cassandra.apache.org)
3
Omar A. Mures, Emilio J. Padron and Bruno Raffin. Leveraging the Power of Big Data Tools for Large Scale Molecular Dynamics Analysis. JP2016
4
Matthieu Dreher, Bruno Raffin; “A Flexible Framework for Asynchronous In Situ and In Transit Analytics for Scientific Simulations”, 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (2014). (http://hal.inria.fr/hal-00941413/en)
5
Tiankai Tu, Charles A Rendleman, David W Borhani, Ron O Dror, Justin Gullingsrud, Morten Ø Jensen, John L Klepeis, Paul Maragakis, Patrick Miller, Kate A Stafford, et al., “A scalable parallel framework for analyzing terascale molecular dynamics simulation tra- jectories,” in High Performance Computing, Network- ing, Storage and Analysis, 2008. SC 2008. International Conference for. IEEE, 2008, pp. 1–12.
6
C. Docan, M. Parashar, and S. Klasky, “DataSpaces: an Interaction and Coordination Framework for Coupled Simulation Workflows,” Cluster Computing, vol. 15, no. 2, pp. 163–181, 2012.
7
Synergistic Challenges in Data-Intensive Science and Exascale Computing. US Department of Energy, March 2013 (http://science.energy.gov/~/media/40749FD92B58438594256267425C4AD1.ashx).
8
Fang Zheng, Hongfeng Yu, Can Hantas, Matthew Wolf, Greg Eisenhauer, Karsten Schwan, Hasan Abbasi, Scott Klasky. GoldRush: Resource Efficient In Situ Scientific Data Analytics Using Fine-Grained Interference Aware Execution. Proceedings of ACM/IEEE Supercomputing Conference (SC’13), November, 2013.

Internship: Data Assimilation at Large Scale

  • Level: Master Level Internship (M2)
  • Location: University Grenoble Alpes Campus, Saint Martin D’heres (close to Grenoble)
  • Duration: At least 4 months
  • Contact: Bruno.Raffin@inria.fr 
  • Incomes: Gratifications de stage (about 500 euros/month)
  • Period: 2018-2019

1 Context

Ensemble runs consist in running many times the same simulation with different parameter sets to get a sturdy analysis of the simulation behavior in the parameter space. The simulation can be a  high resolution large scale parallel code or a smaller scale lightweight version called meta-model or surrogate-model. This is a process commonly used for Uncertainty Quantifications, Sensibility Analysis or Data Assimilation.  This may require to run from thousands to millions of the same simulation, making it an extremely compute-intensive process that will fully benefit of Exascale machines.

Existing approaches show a limited scalability. They either rely on intermediate files where each simulation run writes outputs to file that are next processed for result analysis. This makes for  flexible process, but writing these data to the file system and reading them back for the analysis step is a strong performance bottleneck at scale.  The alternative approach consists in aggregating all simulation runs into the same large monolithic MPI job. Results are processed as soon as available avoiding intermediate files. However, by not taking benefit of the loose synchronization needs between the different runs,  this over-constrains the execution regarding compute resource allocation, application flexibility or fault tolerance. Recent approaches like the Melissa framework  adopt an elastic architecture.  Simulations or groups of simulations (depend on the level of synchronization needed between the runs) are submitted to the machine batch scheduler independently. Once these jobs start they dynamically connect to a parallel data processing parallel server. This server gets the data from the running simulation that are processed in parallel on-line as soon as available, thus avoiding intermediate files.  The computed partial results can be retro-feeded to the simulation (needed for data assimilation for instance). They can also be used to support an adaptive sampling process where the set of parameters for the next simulation runs are defined according to these partial results. Such framework enables to fully take benefit of the loose synchronization capabilities between simulation runs: simulations  are submitted and allocated independently enabling to better use the machine resources and to support efficient fault tolerance mechanisms.

2 Work

The goal of this internship is to set-up a prototype of data assimilation process using the Melissa framework. We will take a simple parallel simulation and associated observation data and a classical data assimilation algorithm like EnKF,  to study how to perform data assimilation efficiently at large scale in the elastic and fault-tolerant framework provided by Melissa. So far Melissa is only able to aggregate data from multiple simulations without support to adjust on-line the parameter of the running simulation according to some error measure with the observation data.

This internship requires good programming skills but also  some taste for applied maths.

3 What you will learn during this internship

  • Parallel programming
  • Get expertise on running code on large parallel machines (hundreds of nodes)
  • Learn how to conduct experiments, analyse the result, be critical about the findings.
  • Be creative to find the right solution when facing a problem

4 Location

The internship will take place at the DataMove team located in the IMAG building on the campus of Saint Martin d’Heres (Univ. Grenoble Alpes). The DataMove team is a friendly and stimulating environement gathering Professors, Researchers, PhD and Master students all leading research on High Performance Computing.

The city of Grenoble is a student friendly city surrounded by the alps mountains, offering a high quality of life and where you can experience all kinds of moutain related outdors activities.


Internship: Simulation de système de fichiers parallèles pour l’étude des grandes infrastructures de calcul

Le domaine du calcul scientifique intensif à hautes performances (HPC) voit évoluer continument la taille des infrastructures utilisées. A l’heure actuelle et parmi les plus puissantes, elles sont composées de plusieurs millions de coeurs de calculs. A ces échelles un problème crucial est la maitrise des flux de données (mémoire/coeur, communication entre noeuds, opérations E/S vers les systèmes de fichier). Ces flux impactent directement la performance des applications ainsi que le rendement effectif global des infrastructures.

Un procédé classique pour étudier de telles infrastructures est le recours à la simulation pour explorer les différentes alternatives d’exploitation, d’optimisation ainsi que d’évaluer de nouvelles méthodes d’ordonnancement pour l’usage des ressources (ordonnancement des applications, placements des fichiers).

Pour ce stage nous proposons d’étendre le simulateur d’infrastructure Batsim [1]
afin de prendre en compte les opérations d’entrées/sorties vers les systèmes de fichiers. Ce logiciel est construit au-dessus du moteur de simulation SimGrid [2]. Il s’agira après une étude bibliographique approfondie, de proposer une extension à Batsim, suffisamment réaliste pour reproduire différent scénarios applicatifs et architecturaux (type de système de fichiers)/

Comments are closed.