Course description
In this course we will present stochastic dynamic programming, also known as the theory of Markov Decision Processes (MDP) as a model for optimization of uncertain systems that evolve over time. We introduce the various optimality criteria (finite horizon, total discounted reward and average gain) as well as the corresponding Bellman Optimality equations. We also present several algorithms to solve these equations (Backward induction, value iteration and policy improvement). Certain famous MDPs will also be discussed, such as the optimal stopping times, inventory control, Markovian bandits.
In the second part we will discuss stochastic games, that can be seen as a generailzation of MDPs with several decision makers.
In the thrid part, we will present and analyze the bandit problem, as a first foray into reinforcement learning.
Lecture notes
Lectures on MDPs are based on Puterman’s book (see bilbliography).
Exercices
Here are some exercices on MDPs
New for 2025: the wheel Of Fortune and Inventory
Complexity of Policy Iteration
Homework (2023)
The homework is due on November 13. Please send your work to bruno.gaujal@inria.fr.
The theoretical part is individual while the computer program must be written in pairs.
Homework (2024)
optimal play at 2048 (further details in my lectures)
Homework (2025)
To come
Bibliography
– Markov Decision Processes: Discrete Stochastic Dynamic Programming, by Martin L. Puterman (2005)
– Bandit Algorithms, by Tor Lattimore and Csaba Szepesvári (2016)