Markov Decision Processes and Reinforcement Learning

by Nicolas Gast (Inria Grenoble) and Bruno Gaujal (Inria Grenoble).

Markov Decision Processes and Reinforcement Learning.

Course description

In this course we will present stochastic dynamic programming, also known as the theory of Markov Decision Processes (MDP) as a model for optimization of uncertain systems that evolve over time. We introduce the various optimality criteria (finite horizon, total discounted reward and average gain) as well as the corresponding Bellman Optimality equations. We also present several algorithms to solve these equations (Backward induction, value iteration and policy improvement). Certain famous MDPs will also be discussed, such as the optimal stopping times, inventory control, Markovian bandits.

In the second part of the course, we will explain the mathematical foundations of reinforcement learning. Our starting point will be learning in bandit problems. We will then extend this approach to MDPs. We will discuss several algorithms (Q-learning, Upper Confidence Reinforcement Learning, Posterior Sampling Reinforcement Learning) and study their theoretical guarantees.

The course will be mainly focused on theory but we will also hold exercice sessions where we will put to practice the course’s algorithmic content.

Course Outline

1. Markov Decision Processes

Bellman Optimality equation

Algorithms (Value Iteration, Policy Iteration).

2. Reinforcement Learning

The bandit problem and regret

Q-learning and optimist algorithms

Follow The Leader and adversarial bandits.

Supplementary material: Lecture notes & shared documents

Stochastic Bandits (lecture notes by Victor Boone)

Q-learning and stochastic-approximation (draft of lecture notes by Nicolas Gast)

Q_learning examples.ipynb (notebook by Nicolas Gast)

connect4: illustration for MCTS (by Nicolas Gast)

Large state spaces and approximations (last course by Nicolas Gast)

Exercices

Here are some exercices on MDPs (lectures 1 and 2)

Large discounts

The Intern Problem

Complexity of Policy Iteration

Gittins index

Prerequisites

Basic knowledge of probability, linear algebra, algorithmics.

Evaluation (2023)

Sited exam + project.

Sited exam: October 24 (10h15 – 12h15): all documents allowed (but no electronic device).
Project defense: November 14 (all day). Project to be handled November 9.
- Subject of the project: solve the game “Pickomino”

If you want to verify, we implemented an algorithm for the reward [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4] and a cost of “0” if we do not obtain anything. This non-optimized program in Python takes about 3 seconds to compute the solution. It hopefully works without errors and gives:

– If the first throw is [1, 1, 2, 4, 4, 4, 5, Worm], the optimal action is to keep the “4” and the expected value is 1.50.

– If we already have 3 dices of value 3 and the throw is [1, 3, 4, 4, Worm], the optimal action is to keep the work and the expected value is 1.00. Keeping the two “4” gives an expected value of 0.85.

– If we already have “3, 5, 5, 5, W”, the optimal action is to continue (it gives 1.73 vs 1 if we stop).

– if we already have “2, 2, 3, 4, W, W”, the optimal action is to stop (it gives 1 vs 0.89 if we continue).

Evaluation (2024)

Exam in class (2h00) + home project. Exact dates to come.

Home project: black jack. (description in French)

On prend les règles de https://fr.wikipedia.org/wiki/Blackjack_(jeu) avec K valeurs de cartes (de 1 à K).

On adopte le « cas général » de la « décision des joueurs » sans les « règles spécifiques ».

Partie I: on suppose que la distribution des cartes est « i.i.d. » (c’est à dire qu’on obtient à chaque coup une carte de valeur k avec probabilité uniforme 1/K).

1/ Montrer que la stratégie du croupier « stand on 17 » pour le cas K=13 n’est pas optimale si le croupier veut maximiser les gains du casino et si le croupier est autorisé à utiliser une stratégie qui dépend des cartes du joueur. Calculer la stratégie optimale du croupier en un contre un.

2/ Calculer la stratégie (politique) optimale d’un joueur (qui joue avant le croupier).

Partie II: on suppose maintenant qu’il y a N cartes par valeurs et que les tirages des cartes se font sans remise.

Sachant que le croupier utilise la stratégie calculée en 1/, calculer la stratégie (politique) optimale du joueur (pour N=4, K=13 ou pour une valeur de K ou N plus petite si votre implémentation ne le permet pas).

Partie III: plusieurs options. Choisir un problème parmi les trois suivants:

a) On revient au cas iid mais on suppose maintenant que la loi de probabilité des cartes est inconnue. Proposer une méthode d’apprentissage et évaluer sa performance (par exemple: sa vitesse d’apprentissage, son regret,…).

b) Étudier le cas à plusieurs joueurs qui se liguent contre le croupier.

c) Développer une approche d’apprentissage par renforcement (approximation des valeurs?) pour résoudre le problème de la partie II avec des paramètres réalistes (7 jeux de 54 cartes soit N=28).

Le projet est à faire par groupe de deux. L’évaluation aura deux volets:

– un rapport dans lequel vous décrivez vos solutions et vos expériences. On fera attention à décrire les algorithmes employés et à votre méthodologie d’expérimentation (ex: par exemple: intervalles de confiance quand c’est nécessaire). Comme d’habitude, la qualité de la rédaction et du soin apporté aux figures (ex: titre sur les axes, légendes) est importante.

– une soutenance orale lors du dernier cours. Les modalités exactes (temps de présentation) seront précisées plus tard.

Bibliography

– Markov Decision Processes: Discrete Stochastic Dynamic Programming, by Martin L. Puterman (2005)
– Bandit Algorithms, by Tor Lattimore and Csaba Szepesvári (2016)
– Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto (2018)