Markov Decision Processes and Reinforcement Learning

by Nicolas Gast (Inria Grenoble) and Bruno Gaujal (Inria Grenoble).

Markov Decision Processes and Reinforcement Learning.

 Course description

In this course we will present stochastic dynamic programming, also known as the theory of Markov Decision Processes (MDP) as a model for optimization of uncertain systems that evolve over time. We introduce the various optimality criteria (finite horizon, total discounted reward and average gain) as well as the corresponding Bellman Optimality equations. We also present several algorithms to solve these equations (Backward induction, value iteration and policy improvement). Certain famous MDPs will also be discussed, such as the optimal stopping times, inventory control, Markovian bandits.

In the second part of the course, we will explain the mathematical foundations of reinforcement learning. Our starting point will be learning in bandit problems. We will then extend this approach to MDPs. We will discuss several algorithms (Q-learning, Upper Confidence Reinforcement Learning, Posterior Sampling Reinforcement Learning) and study their theoretical guarantees.

The course will be mainly focused on theory but we will also hold exercice sessions where we will put to practice the course’s algorithmic content.

Course Outline

1. Markov Decision Processes

     Bellman Optimality equation

     Algorithms (Value Iteration, Policy Iteration).

2. Reinforcement Learning

      The bandit problem and regret

      Q-learning and optimist algorithms

      Follow The Leader and adversarial bandits.

Lecture notes & shared documents

Stochastic Bandits (lecture notes by Victor Boone)

Q-learning and stochastic-approximation (draft of lecture notes by Nicolas Gast)

Q_learning examples.ipynb (notebook by Nicolas Gast)

connect4: illustration for MCTS (by Nicolas Gast)

Large state spaces and approximations (last course by Nicolas Gast)

Exercices

Here are some exercices on MDPs (lectures 1 and 2) 

Large discounts

The Intern Problem

Complexity of Policy Iteration

Gittins index

Prerequisites

Basic knowledge of probability, linear algebra, algorithmics.

Evaluation

Sited exam + project.

If you want to verify, we implemented an algorithm for the reward [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4] and a cost of “0” if we do not obtain anything. This non-optimized program in Python takes about 3 seconds to compute the solution. It hopefully works without errors and gives:
– If the first throw is [1, 1, 2, 4, 4, 4, 5, Worm], the optimal action is to keep the “4” and the expected value is 1.50.
– If we already have 3 dices of value 3 and the throw is [1, 3, 4, 4, Worm], the optimal action is to keep the work and the expected value is 1.00. Keeping the two “4” gives an expected value of 0.85.
– If we already have “3, 5, 5, 5, W”, the optimal action is to continue (it gives 1.73 vs 1 if we stop).
– if we already have “2, 2, 3, 4, W, W”, the optimal action is to stop (it gives 1 vs 0.89 if we continue).
 

Bibliography

Markov Decision Processes: Discrete Stochastic Dynamic Programming, by Martin L. Puterman (2005)
Bandit Algorithms, by Tor Lattimore and Csaba Szepesvári (2016)
Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto (2018) 

 

Comments are closed.