Seminar Polaris-tt Learning in finite-horizon MDP with UCB (Romain Cravic)
– February 7, 2023
Most of you probably know Markov Decisions Processes (MDP). They are very useful to handle situations where an agent interacts with an environnement that may involve randomness. Concretely, at each time the MDP has a current state and the agent chooses an action : This couple state-action induces a (random) reward and a (random) state transition. If the probability distributions for rewards and transitions are known, at least theoretically, designing optimal behaviors for the agent is easy. What about the case where these distributions are unknown at the early stage of the process ? How to LEARN optimal behaviors efficiently ? A popular way to handle this issue is to use the optimism paradigm, inspired from UCB algorithms designed for stochastic bandits problems. In this talk, I will expose the main ideas of two possible approaches, UCRL algorithm and optimistic Q-learning algorithm, that use optimism to well perform in finite-horizon
Strictly Necessary Cookies
Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.
If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.