PhD Position “Validation of large-scale complex data through active and socialized crowdsourcing”

Advisors: Alexis Joly & Esther Pacitti


Citizen science has the potential to leverage the interest and talent of non-specialists to improve science. In a typical citizen science/crowdsourcing environment, the contributors label items. When there are few labels (e.g. how oval is the shape of a galaxy), it is straightforward to train contributors by giving a few examples with known answers. Current research in crowdsourcing usually focus on such micro-tasking, designing algorithms for solving optimization problems from the job requester’s perspective and with simple models of worker behavior. However, the participants are people with varying expertise, skills, interests, incentives as well as rich capabilities of learning and collaborating, in particular in the context of social networks. The goal of this PhD will be to study more nuanced crowdsourcing approaches that place special emphasis on the participants, in particular through assignment and recommendation algorithms allowing to progressively expand the expertise and fields of interest of the users. In particular, we will study domain-specific applications that involve complex classification tasks with large number of classes and expert annotations (for instance plant species recognition). Classical crowdsourcing algorithms based on the Bayesian inference of the most probable labels according to the confusion matrix of each worker are particularly inefficient in such contexts. The problem is that the very high number of classes makes it impossible to train a complete confusion matrix for each participant, as it would require them to answer to millions of problems. Furthermore, the brute-force approach consisting in a quiz across the full list of classes is not tractable for most of the contributors who are competent only on a fraction of the objects of interest. To bridge this gap, it is necessary to design new models and algorithms taking into account the need to actively and collaboratively train the users, so that they can jointly solve complex classification tasks through simple and personalized sub-problems. We will in particular start focusing on (i) automatically reducing the hypothesis space thanks to machine learning tools, (ii) actively specializing the participants on complementary subparts of the problem thanks to probabilistic models and recommendation algorithms.


  1. Learning from crowds, Raykar, V. C. et al., The Journal of Machine Learning Research, 2010
  2. Community-based bayesian aggregation models for crowdsourcing, Venanzi, M. et al., WWW’2014
  3. Roy, S. B., Lykourentzou, I., Thirumuruganathan, S., Amer-Yahia, S., & Das, G. (2015). Task assignment optimization in knowledge-intensive crowdsourcing.The VLDB Journal, 1-25.

Permanent link to this article: