## Second year

In the second year we continued working on Task 1 and the problem of trading off scientific knowledge and user learning. Building on the previous year activity, we have proposed a novel formalization of this problem as a multi-armed bandit (MAB) problem, where the objective is to maximize a convex combination of the accuracy of the estimates and the so-called regret (i.e., the cumulative teaching performance of the algorithm). We have developed a novel algorithm (called *Forcing*) and we proved how its performance approaches the performance of an optimal oracle allocation over time with a rate of *O(1/sqrt(n)) *(where *n* is the number of time steps). This result is very interesting since it shows that a simple strategy as the *Forcing* is able to achieve small *regret* even for the considerably more complex problem that we consider. We then tested it empirically on a series of synthetic problems where we validated our theoretical findings. Finally, we use a dataset collected on a the serious game *TreeFrog, *designed to teach how to solve number line problems. In such a scenario, our trade-off can be seen as the problem of finding a balance between entertaining the players (e.g., by providing simple exercises that allow them to advance in the game) and testing different exercises so as to return accurate estimates of their difficulty, an information that can be later reused to rank/discard or improve the existing exercises. We compared *Forcing* with a number of standard MAB baselines. The results showed that not only the level of entertainment was quite high (as measured by the “reward”) but the accuracy in estimating the difficulty of different arms was also satisfactory (e.g., for ranking them). See more details about this activity [discover_bandit_submission].

Another activity started during the second year is related to transfer learning. We focused on the so-called *option framework* where a reinforcement learning agent is provided with a set of macro-actions (or options) that can be used to replace basic/primitive actions with the objective of speeding-up the learning process. In the field of education, a macro-action is a specific sequence of exercises or activities that are considered to be effective when proposed together. Overall, the objective is to design algorithms that are able to automatically discover such sequences, build a library, and reuse them across different problems and for different students. If these sequences are indeed effective, this would allow us to reduce the number of samples (i.e., interaction with students) needed to optimize the teaching strategy. While in the literature many different heuristics have been proposed, very limited theoretical understanding on the impact of such macro-actions onto the performance of learning algorithm is available. For this reason we have first considered the case when a set of such macro-actions is provided by a domain expert and we investigated when and how macro-actions may actually improve the learning performance. We have developed a novel theoretical analysis showing the impact of macro-actions on the critical aspects of the problem that determine the performance of a learning algorithm, i.e., number of states, number of actions, diameter of the Markov decision process, and number of decision steps. This is the first result of this kind and it allows us to have a much better understanding of the type of macro-actions that can speed-up the learning without compromising its overall performance. See more details about this activity [regret_options].

## First year

In the first year we focused on Task 1. With the collaboration of Yun-En Liu (University of Washington and former collaborator of Emma Brunskill) and Akram Erraqabi (intern from Ecole Polytechnique for 6 months at Inria Lille), we focused on the problem of trading off scientific knowledge and user learning. In fact, the objective of a good automatic method for allocating educational modules is two-fold. On the one hand, we want to learn general educational principles and assess the effectiveness of each educational module. On the other hand, we are interested in teaching to the students the most suitable learning modules at each step. These two contrasting needs raise the issue of trading-off an effective exploration of the learning modules to have accurate estimates of their effectiveness and the exploitation of the better ones to maximize the teaching performance.

Leveraging on previous work (Yun-En Liu et al., “Trading Off Scientific Knowledge and User Learning with Multi-Armed Bandits” Educational Data Mining, 2014), we have cast this problem as a multi-armed bandit (MAB) problem, where the objective is to maximize a convex combination of the accuracy of the estimates and the so-called regret (i.e., the cumulative teaching performance of the algorithm). Solving this trade-off is completely novel in the MAB literature and it introduces new challenges. In fact, standard approaches, such as the popular optimism-in-face-of-uncertainty, at the basis of a large number of MAB algorithms, may completely fail in this scenario. At the moment, we have designed a series of novel MAB algorithms, including simple baselines such as the epsilon-greedy algorithm. After testing them over a number of synthetic scenarios, we have now identified a very promising algorithm based on the construction of confidence intervals over the average performance of each learning module and its variance. The resulting algorithm achieves a very good performance across a variety of problems. While formalizing the problem and developing novel algorithms, Yun-En Liu provided a dataset collected on a game designed to teach how to solve number line problems. We are now testing our new algorithm on this data. Another main challenge we are currently studying is to provide theoretical guarantees about our novel algorithm. For the extreme cases when we are only interested in minimizing the regret or maximize the accuracy of the estimates, theoretical analyses already exist. Nonetheless, dealing with the combination of the two objectives is considerably more difficult and it requires studying how estimation errors are propagated through the behavior of the algorithm and how the resulting exploration strategy can reduce such errors. See more details about this activity [here]**.**