by Nicolas Gast (Inria Grenoble) and Bruno Gaujal (Inria Grenoble).
Markov Decision Processes and Reinforcement Learning.
Course description
In this course we will present stochastic dynamic programming, also known as the theory of Markov Decision Processes (MDP) as a model for optimization of uncertain systems that evolve over time. We introduce the various optimality criteria (finite horizon, total discounted reward and average gain) as well as the corresponding Bellman Optimality equations. We also present several algorithms to solve these equations (Backward induction, value iteration and policy improvement). Certain famous MDPs will also be discussed, such as the optimal stopping times, inventory control, Markovian bandits.
In the second part of the course, we will explain the mathematical foundations of reinforcement learning. Our starting point will be learning in bandit problems. We will then extend this approach to MDPs. We will discuss several algorithms (Q-learning, Upper Confidence Reinforcement Learning, Posterior Sampling Reinforcement Learning) and study their theoretical guarantees.
The course will be mainly focused on theory but we will also hold exercice sessions where we will put to practice the course’s algorithmic content.
Course Outline
1. Markov Decision Processes
Bellman Optimality equation
Algorithms (Value Iteration, Policy Iteration).
2. Reinforcement Learning
The bandit problem and regret
Q-learning and optimist algorithms
Follow The Leader and adversarial bandits.
Lecture notes & shared documents
Stochastic Bandits (lecture notes by Victor Boone)
Q-learning and stochastic-approximation (draft of lecture notes by Nicolas Gast)
Q_learning examples.ipynb (notebook by Nicolas Gast)
connect4: illustration for MCTS (by Nicolas Gast)
Large state spaces and approximations (last course by Nicolas Gast)
Exercices
Here are some exercices on MDPs (lectures 1 and 2)
Complexity of Policy Iteration
Prerequisites
Basic knowledge of probability, linear algebra, algorithmics.
Evaluation
Sited exam + project.
- Sited exam: October 24 (10h15 – 12h15): all documents allowed (but no electronic device).
- Project defense: November 14 (all day). Project to be handled November 9.
Bibliography
– Markov Decision Processes: Discrete Stochastic Dynamic Programming, by Martin L. Puterman (2005)
– Bandit Algorithms, by Tor Lattimore and Csaba Szepesvári (2016)
– Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto (2018)