Supervisors: Irina Illina, MdC, Dominique Fohr, CR CNRS
Team: Multispeech, LORIA-INRIA
Contact: firstname.lastname@example.org, email@example.com
Duration: 5-6 months
Deadline to apply : March 1th, 2019
Required skills: background in statistics, natural language processing and computer program skills (Perl, Python). Candidates should email a detailed CV with diploma
Motivations and context
According to the 2017 International Migration Report, the number of international migrants worldwide has grown rapidly in recent years, reaching 258 million in 2017, among whom 78 million in Europe. A key reason for the difficulty of EU leaders to take a decisive and coherent approach to the refugee crisis has been the high level of public anxiety about immigration and asylum across Europe. There are at least three social factors underlying this attitude (Berri et al, 2015): the increase in the number and visibility of migrants; the economic crisis that has fed feelings of insecurity; the role of mass media. The last factor has a major influence on the political attitudes of the general public and the elite. Refugees and migrants tend to be framed negatively as a problem. This translates into a significant increase of hate speech towards migrants and minorities. The Internet seems to be a fertile ground for hate speech (Knobel, 2012).
The goal of this master internship is to develop a methodology to automatically detect hate speech in social network data (Twitter, YouTube, Facebook).
In text classification, text documents are usually represented in some so-called vector space and then assigned to predefined classes through supervised machine learning. Each document is represented as a numerical vector, which is computed from the words of the document. How to numerically represent the terms in an appropriate way is a basic problem in text classification tasks and directly affects the classification accuracy. Developments in Neural Network (Mikolov et al., 2013a) led to a renewed interest in the field of distributional semantics, more specifically in learning word embeddings (representation of words in a continuous space). Computational efficiency was one big factor which popularized word embeddings. The word embeddings capture syntactic as well as semantic properties of the words (Mikolov et al., 2013b). As a result, they outperformed several other word vector representations on different tasks (Baroni et al., 2014).
Our methodology in the hate speech classification will be related on the recent approaches for text classification with Neural Networks and word embeddings. In this context, fully connected feed forward networks (Iyyer et al., 2015; Nam et al., 2014), Convolutional Neural Networks (CNN) (Kim, 2014; Johnson and Zhang, 2015) and also Recurrent/Recursive Neural Networks (RNN) (Dong et al., 2014) have been applied. On the one hand, the approaches based on CNN and RNN capture rich compositional information, and have outperformed the state-of-the-art results in text classification; on the other hand they are computationally intensive and require careful hyperparameter selection and/or regularization (Dai and Le, 2015).
The goal of this Master internspeech Develop a new methodology to automatically detect hate speech, based on machine learning and Neural Networks. Human detection of this material is infeasible since the contents to be analyzed are huge. In recent years, research has been conducted to develop automatic methods for hate speech detection in the social media domain. These typically employ semantic content analysis techniques built on Natural Language Processing (NLP) and Machine Learning (ML) methods (Schmidt et al. 2017). Although current methods have reported promising results, their evaluations are largely biased towards detecting content that is non-hate, as opposed to detecting and classifying real hateful content (Zhang et al., 2018). Current machine learning methods use only certain task-specific features to model hate speech. We propose to develop an innovative approach to combine these pieces of information into a multi-feature approach so that the weaknesses of the individual features are compensated by the strengths of other features (explicit hate speech, implicit hate speech, contextual conditions affecting the prevalence of hate speech, etc.).
Baroni, M., Dinu, G., and Kruszewski, G. (2014). “Don’t count, predict! a systematic comparison of context-counting vs. contextpredicting semantic vectors”. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Volume 1, pages 238-247.
Berri M, Garcia-Blanco I, Moore K (2015), Press coverage of the Refugee and Migrant Crisis in the EU: A Content Analysis of five European Countries, Report prepared for the United Nations High Commission for Refugees, Cardiff School of Journalism, Media and Cultural Studies.
Dai, A. M. and Le, Q. V. (2015). “Semi-supervised sequence Learning”. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28, pages 3061-3069. Curran Associates, Inc
Dong, L., Wei, F., Tan, C., Tang, D., Zhou, M., and Xu, K. (2014). “Adaptive recursive neural network for target-dependent twitter sentiment classification”. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL, Baltimore, MD, USA, Volume 2: pages 49-54.
Iyyer, M., Manjunatha, V., Boyd-Graber, J., and Daumé, H. (2015). “Deep unordered composition rivals syntactic methods for text classification”. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, volume 1, pages 1681-1691.
Johnson, R. and Zhang, T. (2015). “Effective use of word order for text categorization with convolutional neural networks”. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 103-112.
Kim, Y. (2014). “Convolutional neural networks for sentence classification”. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746-1751.
Knobel M. (2012). L’Internet de la haine. Racistes, antisémites, néonazis, intégristes, islamistes, terroristes et homophobes à l’assaut du web. Paris: Berg International
Mikolov, T., Yih, W.-t., and Zweig, G. (2013a). “Linguistic regularities in continuous space word representations”. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746-751.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). “Distributed representations of words and phrases and their Compositionality”. In Advances in Neural Information Processing Systems, 26, pages 3111-3119. Curran Associates, Inc.
Nam, J., Kim, J., Loza Menc__a, E., Gurevych, I., and Furnkranz, J. (2014). “Large-scale multi-label text classification – revisiting neural networks”. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD-14), Part 2, volume 8725, pages 437-452.
Schmidt A., Wiegand M.(2017). A Survey on Hate Speech Detection using Natural Language Processing, Workshop on Natural Language Processing for Social Media
Zhang, Z., Luo, L (2018). Hate speech detection: a solved problem? The Challenging Case of Long Tail on Twitter. arxiv.org/pdf/1803.03662