## Events in January–February 2023

MonMonday | TueTuesday | WedWednesday | ThuThursday | FriFriday | SatSaturday | SunSunday |
---|---|---|---|---|---|---|

December 26, 2022 | December 27, 2022 | December 28, 2022 | December 29, 2022 | December 30, 2022 | December 31, 2022 | January 1, 2023 |

January 2, 2023 | January 3, 2023 | January 4, 2023 | January 5, 2023 | January 6, 2023 | January 7, 2023 | January 8, 2023 |

January 9, 2023 | January 10, 2023 | January 11, 2023 | January 12, 2023 | January 13, 2023 | January 14, 2023 | January 15, 2023 |

January 16, 2023 | January 17, 2023 | January 18, 2023 | January 19, 2023 | January 20, 2023 | January 21, 2023 | January 22, 2023 |

January 23, 2023 | January 24, 2023 | January 25, 2023 | January 26, 2023 | January 27, 2023 | January 28, 2023 | January 29, 2023 |

January 30, 2023 | January 31, 2023 | February 1, 2023 | February 2, 2023 | February 3, 2023 | February 4, 2023 | February 5, 2023 |

**December 8, 2022 @****-- Seminar Mario Bravo (room 106)****December 8, 2022 @ Bâtiment IMAG (406)****-- Séminaire GLSI / CtrlA: Quentin Guilloteau (Datamove)****December 15, 2022 @ Tour IRMA, campus Saint Martin d'Hères****-- PhD defense Chen Yan: Poliques quasi-optimal pour les restless bandits.**Thèse supervisée par Nicolas GAST et Bruno GAUJAL.La soutenance aura lieu le**à 14h00**à l**'amphithéâtre 1 de la****Tour IRMA**(51 rue des mathématiques, 38610 Gières). Un**pot**aura lieu après la soutenance à**salle****406****du bâtiment IMAG**.**Jury:**-- David Alan Goldberg, Professeur associé, Université de Cornell (Rapporteur)-- Bruno Scherrer, Chargé de recherche, Inria Nancy (Rapporteur)-- Jérôme Malick, Directeur de recherche, CNRS (Examinateur)-- Nguyễn Kim Thắng, Professeur, Université Grenoble Alpes (Examinateur)-- LEGROS Benjamin, Professeur associé, EM Normandie (Examinateur)**Résumé :**Bandits are one of the most basic examples of decision-making with uncertainty. A Markovian restless bandit can be seen as the following sequential allocation problem: At each decision epoch, one or several arms are activated (pulled); all arms generate an instantaneous reward that depend on their state and their activation; the state of each arm then changes in a Markovian fashion, based on an underlying transition matrix. Both the rewards and the probability matrices are known, and the new state is revealed to the decision maker for its next decision. The wordserves to emphasize the fact that arms that are not activated can also change states, hence generalizes the simpler rested bandits. In principle, the above problem can be solved by dynamic programming, since it is a Markov decision process. The challenge that we face is the*restless*, since the size of possible states and actions grows exponentially with the number of arms of the bandit. Consequently, the focus is to design policies that solve the dilemma of computational efficiency and close-to-optimal performance.**curse of dimension**In this thesis, we construct computationally efficient policies with provable performance bounds, that may differ depending on certain properties of the problem. We first investigate the classical(WIP) on infinite horizon problems, and prove that if it is asymptotically optimal under the global attractor assumption, then almost always it converges to the optimal value exponentially fast. The application of WIP has the additional technical assumption of indexability as a prerequisite, to get around this, we next study the**Whittle index policy**, that is well-defined for any problem, and shares the same exponential speed of convergence as WIP under similar assumptions.**LP-index policy**In infinite horizon, we always need the global attractor assumption for asymptotic optimality. We next study the problem under finite horizon, so that this assumption is no-longer a concern. Instead, the LP-compatibility and theare required for the asymptotic optimality and a faster convergence rate. We construct the finite horizon LP-index policy, as well as the**non-degeneracy**, that amounts to solving new LP-index policies during the evolution of the process. This latter LP-update policy is then generalized to the broader framework of**LP-update policy**, together with the generalization of the non-degenerate condition. This condition also allows a more efficient implementation of the LP-update policy, as well as a faster convergence rate, if it is satisfied on the weakly coupled MDPs.**weakly coupled MDPs****February 7, 2023 @ Bâtiment IMAG (406)****-- Seminar Polaris-tt Learning in finite-horizon MDP with UCB (Romain Cravic)**Most of you probably know Markov Decisions Processes (MDP). They are very useful to handle situations where an agent interacts with an environnement that may involve randomness. Concretely, at each time the MDP has a current state and the agent chooses an action : This couple state-action induces a (random) reward and a (random) state transition. If the probability distributions for rewards and transitions are known, at least theoretically, designing optimal behaviors for the agent is easy. What about the case where these distributions are unknown at the early stage of the process ? How to LEARN optimal behaviors efficiently ? A popular way to handle this issue is to use the optimism paradigm, inspired from UCB algorithms designed for stochastic bandits problems. In this talk, I will expose the main ideas of two possible approaches, UCRL algorithm and optimistic Q-learning algorithm, that use optimism to well perform in finite-horizon