ALMAnaCH is an Inria research team (Inria Paris research center) whose pluridisciplinary research domain brings together Natural Language Processing and Computational Humanities.
ALMAnaCH is a follow-up to the ALPAGE project-team, which has come to an end at the end of December 2016. ALPAGE was created in 2007 in collaboration with Paris-Diderot University and had the status of an UMR-I since 2009. This joint team involving computational linguists from Inria as well as Paris-Diderot computational linguists with a strong background in linguistics proved successful. However, the context is changing, with the recent emergence of digital humanities and, more importantly, of computational humanities. This presents both an opportunity and a challenge for Inria computational linguists. It provides them with new types of data on which their tools, resources and algorithms can be used and lead to new results in human sciences. Computational humanities also provide computational linguists with new and challenging research problems, which, if solved, provide new ways of studying human sciences.
The scientific positioning of ALMAnaCH therefore extends that of ALPAGE. We remain committed to developing state-of-the-art NLP software and resources that can be used by academics and in the industry, including recent approaches based on deep learning. At the same time we will continue our work on language modelling in order to provide a better understanding of languages, an objective that will be reinforced and addressed in the broader context of computational humanities, with an emphasis on language evolution and, as a result, on ancient languages. Finally, we will remain dedicated to having an impact on the industrial and social world, via multiple types of collaboration with companies and other institutions (startup creation, industrial contracts, expertise, etc.).
Scientific context and objectives
One of the main challenges in computational linguistics is to model and to cope with language variation. Language varies with respect to domain and genre (news wires, scientific literature, poetry, oral transcripts…), sociolinguistic factors (age, background, education; variation attested for instance on social media), geographical factors (dialects) and other dimensions (disabilities, for instance). But language also constantly evolves at all time scales. Addressing this variability is still an open issue for NLP. Commonly used approaches, which often rely on supervised and semi-supervised machine learning methods, require huge amounts of annotated data. They still suffer from the high level of variability found for instance in user-generated content, non-contemporary texts, as well as in domain-specific documents (e.g. financial, legal).
ALMAnaCH will tackle the challenge of language variation in two complementary directions, supported by a third, transverse research strand on language resources.
Research strand 1: Automatic Context-augmented Linguistic Analysis
We will focus on linguistic representations that are less affected by language variation. This obviously requires us to stay at a state-of-the-art level in key NLP tasks such as shallow processing, part-of-speech tagging and (syntactic) parsing, which are core expertise domains of ALMAnaCH members. This will also require us to improve the generation of semantic representations (semantic parsing), and to begin to explore tasks such as machine translation, which now relies on neural architectures also used for some of the above-mentioned tasks. This will also involve the integration of both linguistic and non-linguistic contextual information to improve automatic linguistic analysis. This is an emerging and promising line of research in NLP. We will have to identify, model and take advantage of each type of contextual information available. Addressing these issues will enable the development of new lines of research related to conversational content. Applications include improved information and knowledge extraction algorithms. We will especially focus on challenging datasets such as domain-specific texts (e.g. financial, legal) as well as historical documents, in the larger context of the development of digital humanities.
Research strand 2: Computational Modelling of Linguistic Variation
Language variation must be better understood and modelled in all its forms. In this regard, we will put a strong emphasis on four types of language variation and their mutual interaction: sociolinguistic variation in synchrony and short-term diachrony (including non-canonical spelling and syntax in user-generated content), complexity-based variation in relation with language-related disabilities, and diachronic variation (computational exploration of language change and language history, with a focus on Old to all forms of Modern French, as well as Indo-European languages in general). In addition, the noise introduced by Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) systems, especially in the context of historical documents, bears some similarities with that of non-canonical input in user-generated content (e.g. erroneous characters). This noise constitutes a more transverse kind of variation stemming from the way language is graphically encoded, which we call language-encoding variation. Other types of language variation could also become important research topics for ALMAnaCH in the future. This includes dialectal variation (e.g. work on Arabic varieties, something we have already started working, focusing on Maghrebi Arabizi, the Arabic variants used on social media by people from Maghreb countries, written using a non-fixed Latin-script transcription) as well as the study and exploitation of paraphrases in a broader context than the above-mentioned complexity-based variation.
Both research strands above rely on the availability of language resources (corpora, lexicons), which is the focus of our third research strand.
Research strand 3: Modelling and development of Language Resources
Language resource development is not only a technical challenge and a necessary preliminary step to create evaluation datasets for NLP systems as well as training datasets for systems using machine learning. It is also a research field in itself, which concerns, among other challenges, (i) the development of semi-automatic and automatic algorithms to speed up the work (e.g. automatic extraction of lexical information, low-resource learning for the development of pre-annotation algorithms, transfer methods to leverage existing tools and/or resources for other languages, etc.) and (ii) the development of formal models to represent linguistic information in the best possible way, thus requiring expertise at least in NLP and in typological and formal linguistics. Language resource development involves the creation of raw corpora from original sources as well as the (manual, semi-automatic or automatic) development of lexical resources and annotated corpora. Such endeavours are domains of expertise of the ALMAnaCH team. This research strand 3 will benefit the whole team and beyond, and will both benefit from and feed the work of the other research strands.