Présentation

ALMAnaCH est une équipe-projet Inria (centre de recherches Inria de Paris) dont le domaine de recherche pluridisciplinaire combine Traitement Automatique des Langues et Humanités Computationnelles.

Présentation générale

ALMAnaCH fait suite à l’équipe-projet ALPAGE, qui s’est achevée fin décembre 2016. ALPAGE a été créée en 2007 en collaboration avec l’Université Paris-Diderot et a obtenu le statut d’UMR-I en 2009. Cette équipe mixte composée de linguistes computationnels d’Inria et de Paris-Diderot a été un succès. Cependant, le contexte évolue, avec l’émergence récente des humanités numériques et, plus important encore, des humanités computationnelles. C’est à la fois une opportunité et un défi pour les linguistes computationnels d’Inria. Cela ouvre à de nouveaux types de données sur lesquelles leurs outils, ressources et algorithmes peuvent être utilisés et conduire à de nouveaux résultats en sciences humaines. Les humanités computationnelles fournissent également aux linguistes computationnels des problèmes de recherche nouveaux et stimulants qui, une fois résolus, offrent de nouvelles façons d’étudier les sciences humaines.

Le positionnement scientifique d’ALMAnaCH étend ainsi celui d’ALPAGE. Nous continuons à développer des logiciels et des ressources de niveau état de l’art en traitement automatique des langues (TAL) qui peuvent être utilisés pour la recherche comme dans l’industrie, notamment par la mise en œuvre des approches récentes reposant sur l’apprentissage profond. En parallèle, nous poursuivons nos travaux sur la modélisation linguistique afin de mieux comprendre les langues, un objectif qui sera renforcé et abordé dans le contexte plus large des humanités computationnelles, en mettant l’accent sur l’évolution du langage et, par conséquent, sur les langues anciennes. Enfin, nous restons engagés à avoir un impact sur le monde industriel et social, à travers de multiples formes de collaboration avec des entreprises et d’autres institutions (création de startups, contrats industriels, expertise, etc.).

Scientific context and objectives

One of the main challenges in computational linguistics is to model and to cope with language variation. Language varies with respect to domain and genre (news wires, scientific literature, poetry, oral transcripts…), sociolinguistic factors (age, background, education; variation attested for instance on social media), geographical factors (dialects) and other dimensions (disabilities, for instance). But language also constantly evolves at all time scales. Addressing this variability is still an open issue for NLP. Commonly used approaches, which often rely on supervised and semi-supervised machine learning methods, require huge amounts of annotated data. They still suffer from the high level of variability found for instance in user-generated content, non-contemporary texts, as well as in domain-specific documents (e.g. financial, legal).

ALMAnaCH will tackle the challenge of language variation in two complementary directions, supported by a third, transverse research strand on language resources.

Research strand 1: Automatic Context-augmented Linguistic Analysis

We will focus on linguistic representations that are less affected by language variation. This obviously requires us to stay at a state-of-the-art level in key NLP tasks such as shallow processing, part-of-speech tagging and (syntactic) parsing, which are core expertise domains of ALMAnaCH members. This will also require us to improve the generation of semantic representations (semantic parsing), and to begin to explore tasks such as machine translation, which now relies on neural architectures also used for some of the above-mentioned tasks. This will also involve the integration of both linguistic and non-linguistic contextual information to improve automatic linguistic analysis. This is an emerging and promising line of research in NLP. We will have to identify, model and take advantage of each type of contextual information available. Addressing these issues will enable the development of new lines of research related to conversational content. Applications include improved information and knowledge extraction algorithms. We will especially focus on challenging datasets such as domain-specific texts (e.g. financial, legal) as well as historical documents, in the larger context of the development of digital humanities.

Research strand 2: Computational Modelling of Linguistic Variation

Language variation must be better understood and modelled in all its forms. In this regard, we will put a strong emphasis on four types of language variation and their mutual interaction: sociolinguistic variation in synchrony and short-term diachrony (including non-canonical spelling and syntax in user-generated content), complexity-based variation in relation with language-related disabilities, and diachronic variation (computational exploration of language change and language history, with a focus on Old to all forms of Modern French, as well as Indo-European languages in general). In addition, the noise introduced by Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) systems, especially in the context of historical documents, bears some similarities with that of non-canonical input in user-generated content (e.g. erroneous characters). This noise constitutes a more transverse kind of variation stemming from the way language is graphically encoded, which we call language-encoding variation. Other types of language variation could also become important research topics for ALMAnaCH in the future. This includes dialectal variation (e.g. work on Arabic varieties, something we have already started working, focusing on Maghrebi Arabizi, the Arabic variants used on social media by people from Maghreb countries, written using a non-fixed Latin-script transcription) as well as the study and exploitation of paraphrases in a broader context than the above-mentioned complexity-based variation.

Both research strands above rely on the availability of language resources (corpora, lexicons), which is the focus of our third research strand.

Research strand 3: Modelling and development of Language Resources

Language resource development is not only a technical challenge and a necessary preliminary step to create evaluation datasets for NLP systems as well as training datasets for systems using machine learning. It is also a research field in itself, which concerns, among other challenges, (i) the development of semi-automatic and automatic algorithms to speed up the work (e.g. automatic extraction of lexical information, low-resource learning for the development of pre-annotation algorithms, transfer methods to leverage existing tools and/or resources for other languages, etc.) and (ii) the development of formal models to represent linguistic information in the best possible way, thus requiring expertise at least in NLP and in typological and formal linguistics. Language resource development involves the creation of raw corpora from original sources as well as the (manual, semi-automatic or automatic) development of lexical resources and annotated corpora. Such endeavours are domains of expertise of the ALMAnaCH team. This research strand 3 will benefit the whole team and beyond, and will both benefit from and feed the work of the other research strands.

Les commentaires sont clos.