PhD Thesis in Natural Language Processing: Online hate speech against migrants

Supervisors: Irina Illina, MdC, Dominique Fohr, CR CNRS

Team: Multispeech, LORIA-INRIA


Duration: 3 years

Deadline to apply : April 1th, 2019

Required skills: background in statistics, natural language processing and computer program skills (Perl, Python), neural networks tools. Candidates should email a detailed CV with diploma

Motivations and context

According to the 2017 International Migration Report, the number of international migrants worldwide has grown rapidly in recent years, reaching 258 million in 2017, among whom 78 million in Europe. A key reason for the difficulty of EU leaders to take a decisive and coherent approach to the refugee crisis has been the high level of public anxiety about immigration and asylum across Europe. There are at least three social factors underlying this attitude (Berri et al, 2015): the increase in the number and visibility of migrants; the economic crisis that has fed feelings of insecurity; the role of mass media. The last factor has a major influence on the political attitudes of the general public and the elite. Refugees and migrants tend to be framed negatively as a problem. This translates into a significant increase of hate speech towards migrants and minorities. The Internet seems to be a fertile ground for hate speech (Knobel, 2012).

The goal of this PhD Thesis is to develop a methodology to automatically detect hate speech in social network data (Twitter, YouTube, Facebook).

Our methodology in the hate speech classification will be related on the recent approaches for text classification with Neural Networks and word embeddings. In this context, fully connected feed forward networks (Iyyer et al., 2015; Nam et al., 2014), Convolutional Neural Networks (CNN) (Kim, 2014; Johnson and Zhang, 2015) and also Recurrent/Recursive Neural Networks (RNN) (Dong et al., 2014) have been applied. On the one hand, the approaches based on CNN and RNN capture rich compositional information, and have outperformed the state-of-the-art results in text classification; on the other hand they are computationally intensive and require careful hyperparameter selection and/or regularization (Dai and Le, 2015).


The goal of this PhD Thesis is to develop a new methodology to automatically detect hate speech, based on machine learning and Neural Networks. Human detection of this material is infeasible since the contents to be analyzed are huge. In recent years, research has been conducted to develop automatic methods for hate speech detection in the social media domain. These typically employ semantic content analysis techniques built on Natural Language Processing (NLP) and Machine Learning (ML) methods (Schmidt et al. 2017). Although current methods have reported promising results, their evaluations are largely biased towards detecting content that is non-hate, as opposed to detecting and classifying real hateful content (Zhang et al., 2018). Current machine learning methods use only certain task-specific features to model hate speech. We propose to develop an innovative approach to combine these pieces of information into a multi-feature approach so that the weaknesses of the individual features are compensated by the strengths of other features (explicit hate speech, implicit hate speech, contextual conditions affecting the prevalence of hate speech, etc.).

The student will work in the framework of French-German project (ANR project).


Berri M, Garcia-Blanco I, Moore K (2015), Press coverage of the Refugee and Migrant Crisis in the EU: A Content Analysis of five European Countries, Report prepared for the United Nations High Commission for Refugees, Cardiff School of Journalism, Media and Cultural Studies.

Dai, A. M. and Le, Q. V. (2015). “Semi-supervised sequence Learning”. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28, pages 3061-3069. Curran Associates, Inc

Dong, L., Wei, F., Tan, C., Tang, D., Zhou, M., and Xu, K. (2014). “Adaptive recursive neural network for target-dependent twitter sentiment classification”. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL, Baltimore, MD, USA, Volume 2: pages 49-54.

Iyyer, M., Manjunatha, V., Boyd-Graber, J., and Daumé, H. (2015). “Deep unordered composition rivals syntactic methods for text classification”. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, volume 1, pages 1681-1691.

Johnson, R. and Zhang, T. (2015). “Effective use of word order for text categorization with convolutional neural networks”. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 103-112.

Knobel M. (2012). L’Internet de la haine. Racistes, antisémites, néonazis, intégristes, islamistes, terroristes et homophobes à l’assaut du web. Paris: Berg International

Schmidt A., Wiegand M.(2017). A Survey on Hate Speech Detection using Natural Language Processing, Workshop on Natural Language Processing for Social Media

Zhang, Z., Luo, L (2018). Hate speech detection: a solved problem? The Challenging Case of Long Tail on Twitter.