Online abusive language detection and the role of topic models in a cross-corpora set-up

Speaker: Tulika Bose

Date and place: October 1, 2020 at 10:30 -C005 + VISIO-CONFERENCE

Abstract:

The proliferation of abusive language in social media in recent years is alarming. It requires proactive and automated mechanisms to help in detecting and dealing with them. In this context, it is important to analyze the topics raised in social media comments, as certain topics involve high degree of abuse. Moreover, the research on automatic identification of abusive language in social media involves the usage of a variety of corpora across literature, which vary in terms of sampling strategies, targets of abuse and topics discussed. The state-of-the-art supervised models that report high performance in such tasks are generally focused towards a specific corpus. However, in practical scenarios, the temporal and contextual shift in social media content requires high degree of generalisability in the abuse detection models. Unsupervised topic models inherently possess the ability of inferring a mixture of latent topics on unseen samples. Moreover, they can be used to leverage the large amount of unannotated corpus available. In this talk, I would discuss on how topic models can be used to reveal the distribution of abusive topics present in social media comments and present an analysis on how well they can generalize in the case of cross-corpora abusive language detection. An experimental analysis of the performance reveals that the larger variety in the training corpus in terms of topics, plays an important role in ensuring better generalisability across unseen corpora.