Internship: Policy Disagreement Kernels for Bayesian RL

Supervision: Riad Akrour (riad -dot- akrour -at- inria -dot- fr)

When: Spring and Summer 2022

Context

Exploration in RL is the basis for learning improved policies. In practice, exploration in RL is typically conducted through the use of stochastic policies. For example, the overwhelming majority of deep RL literature performs exploration with stochastic policies—including Deep Deterministic Policy Gradient and its derivatives. In contrast, in robotics—and more generally when policies are implemented on physical systems—methods such as Bayesian optimization are preferred. The reason is that these methods perform policy search directly in the space of deterministic policy parameters. This avoids running stochastic policies on physical systems which would lead to jerky and damaging behaviors.

On the other hand, direct policy search methods are oblivious to the sequential nature of RL and poorly scale w.r.t. the number of parameters of the policy. Trajectory kernels [3], that compute policy disagreement in the state-action space instead of the parameter space, are a first step for bridging the gap between Bayesian optimization and Bayesian RL but this can be improved by i) making use of off-policy policy evaluation for computing partial quantities contributing to the policy return ii) adapting entropy regularization (for example in the context of Bayesian optimization [1]) to deterministic policies, ensuring that the next policy to evaluate is within a trust region where trajectory kernels are valid. To summarize, we are interested in researching an RL algorithm that scales to large policy parameterizations (e.g. neural networks), that is sample efficient and that explores using only deterministic policies. We think that such an algorithm can be found at the intersection of Bayesian optimization and RL.

Goals

Research an RL algorithm at the intersection of Bayesian optimization and RL. A starting point will be provided to the intern but they will have to make further design choices of their own.
Perform comparative evaluations to direct policy search, deep RL and model-based RL (e.g. the work of [2]) methods on RL benchmarks.
Deploy the algorithm on real world problems in collaboration with industry partners of Inria Scool, notably showing improvements of the RL method over the existing engineered one.

Bibliography

R. Akrour, D. Sorokin, J. Peters, and G. Neumann. Local Bayesian Optimization of Motor Skills. In:International Conference on Machine Learning (ICML). 2017
Sebastian Curi, Felix Berkenkamp, and Andreas Krause. Efficient Model-Based Reinforcement Learning through Optimistic Policy Search and Planning. In:Advances in Neural Information Processing Systems(NeurIPS). 2020
Aaron Wilson, Alan Fern, and Prasad Tadepalli. Using trajectory data to improve bayesian optimization for reinforcement learning. In:Journal of Machine Learning Research15.1 (2014), pp. 253–282