ANR – SoSweet : A sociolinguistics of Twitter — social links and linguistic variations

The SoSweet project focuses on the synchronic variation and the diachronic evolution of the variety of French language used on Twitter.
The Web has entered all areas of our social life. As the language is central in our social interactions, it is legitimate to ask how the Web has become a factor acting on language. This is even more actual as the recent rise of novel digital services opens up new areas of expression, which support new linguistics behaviors. In particular, social medias such as Twitter provide channels of communication through which speakers/writers use their language in ways that differ from standard written and oral forms. The result is the emergence of new varieties of languages.

A characteristic of these varieties is that they exhibit large variability among communities of speakers and high innovation rates. A scientific description must take into account this variability and explain how social forces and technical constraints regulate its dynamic. The main goal of SoSweet is to provide a detailed account of the links between linguistic variation and social structure in Twitter, both synchronically and diachronically. Through this specific example, and aware of its bias, we aim at providing a more detailed understanding of the dynamic links between individuals, social structure and language variation and change.

Traditional methods are not suitable to address these questions. On the one hand, Twitter requires redefining fundamental concepts such as “addressee” or the public/private communication distinction. Moreover, while sociolinguistic studies are based on small samples, we will base our analysis on a corpus of 500 million tweets combined with the social network of the 10 million users who authored these tweets, complemented by socio-demographic data. This large data mass leads us to heavily rely on computational methods from different areas. The SoSweet project will therefore adopt a strong interdisciplinary position, at the crossing of social media linguistics, sociolinguistics, natural language processing (NLP) and network science.

The NLP tools are designed for standard forms of language and exhibit a drastic loss of accuracy when applied to social media varieties. To define appropriate tools, descriptions of these varieties are needed. Descriptions that needs tools. We will address this circularity interdisciplinary, by working simultaneously both on linguistics description and on NLP tools development. For its part, network science provides us with tools for studying massive data from complex networks of users, through graph theory and computational modeling.

The scientific program of SoSweet has been conceived in order to favor optimal interdisciplinary work as the four work packages (management, data collection and enrichment, variation and evolution analysis, outreach) involve all partners. The project will last 48 months. It involves 4 leading teams in their own field of research. The principal investigator, Icar, is specialized in corpus linguistics and computer mediated interaction. Icar will carry out the tasks of unifying linguistics evidences (empirical and theoric) with social clues (extracted from a massive network of sociological relations). Lidilem is in charge of adapting the sociolinguistics framework to the case of variation and communication on Twitter. Alpage, specialized in natural language processing, takes care of the linguistics enrichment part, which provides the other partners with normalized and structurally enriched forms of text. Alpage is also responsible of providing distributional analysis of our corpus, by the means of various forms of word clustering in order to define sociolinguistic variants in the tweets. Inria DANTE, specialized in the exploration of massive graph structures, will lead the crucial network analysis and will work on jointly integrating the sociological network and the linguistic distributional network of lexical relations