Markov Decision Processes and Stochastic Games

 Course description

In this course we will present stochastic dynamic programming, also known as the theory of Markov Decision Processes (MDP) as a model for optimization of uncertain systems that evolve over time. We introduce the various optimality criteria (finite horizon, total discounted reward and average gain) as well as the corresponding Bellman Optimality equations. We also present several algorithms to solve these equations (Backward induction, value iteration and policy improvement). Certain famous MDPs will also be discussed, such as the optimal stopping times, inventory control, Markovian bandits.

In the second part we will discuss stochastic games, that can be seen as a generailzation of MDPs with several decision makers.

In the thrid part, we will present and analyze the bandit problem, as a first foray into reinforcement learning.

Lecture notes

Lectures on MDPs are based on Puterman’s book (see bilbliography).

UCB Algorithm

Exercices

Here are some exercices on MDPs

New for 2025: the wheel Of Fortune and Inventory

Large discounts

The Intern Problem

Complexity of Policy Iteration

Gittins index

Homework (2023)

The homework is due on November 13. Please send your work to bruno.gaujal@inria.fr.

The theoretical part is individual while the computer program must be written in pairs.

Gods love dinosaurs

Homework (2024)

optimal play at 2048 (further details in my lectures)

Homework (2025)

To come

Bibliography

– Markov Decision Processes: Discrete Stochastic Dynamic Programming, by Martin L. Puterman (2005)
– Bandit Algorithms, by Tor Lattimore and Csaba Szepesvári (2016)