ALMAnaCH is a follow-up to the ALPAGE project-team, which has come to an end at the end of December 2016. ALPAGE was created in 2007 in collaboration with Paris-Diderot University and had the status of an UMR-I since 2009. This joint team involving computational linguists from Inria as well as Paris-Diderot computational linguists with a strong background in linguistics proved successful. However, the context is changing, with the recent emergence of digital humanities and, more importantly, of computational humanities. This presents both an opportunity and a challenge for Inria computational linguists. It provides them with new types of data on which their tools, resources and algorithms can be used and lead to new results in human sciences. Computational humanities also provide computational linguists with new and challenging research problems, which, if solved, provide new ways of studying human sciences.
ALMAnaCH’s scientific positioning therefore extend ALPAGE’s. We remain committed to developing state-of-the-art natural language processing software and resources that can be used by academics and in the industry. At the same time we continue our work on language modelling in order to provide a better understanding of languages, an objective that is reinforced and addressed in the broader context of computational humanities, with an emphasis on language evolution and, as a result, on ancient languages.
This new scientific positioning has motivated the creation of a new project-team with a new partner, namely the École Pratique des Hautes Études (EPHE). The EPHE is a leading institution in France in human sciences in general and in digital and computational humanities in particular. Two EPHE research directors, who have already been working together for some time in computational humanities, are permanent members of the project-team: a philologist and a computer scientist, both specialists of computational approaches to philology and ancient language studies, in line with the above-mentioned scientific positioning.
Scientific context and objectives
One of the main challenges in computational linguistics is modelling and analysing language variation. Language varies with respect to domain and genre (news wires, scientific literature, poetry, oral transcripts…) and sociolinguistic factors (age, background, education; variation attested for instance on social media). But language is also in constant evolution at all time scales. Most current approaches to addressing this variability are not satisfactory, as they often require huge amounts of (costly) annotated data and are not fully successful in dealing with the high level of variability found for instance in contemporary user-generated content or in ancient texts.
ALMAnaCH tackles the challenge of language variation in two complementary and mutually interacting directions:
- Firstly, we focus on linguistic representations that are less affected by language variation (research strand 1). This first requires improving the production of semantic representations (semantic parsing). More importantly, it also involves investigating the integration of both linguistic and non-linguistic contextual information to improve automatic linguistic analysis. This is an emerging and promising line of research in the field of natural language processing. It requires identifying the type of contextual information available in each case, how to extract it and how to integrate it.
- Secondly, language variation must be better understood and modelled in all its forms (research strand 2). In this regard, we put a strong emphasis on two types of language variation and their mutual interaction: sociolinguistic variation in synchrony and diachronic variation, of which the latter is of particular importance in ALMAnaCH. It is the main motivation behind the creation of a joint project-team between Inria and EPHE. We will concentrate on research questions pertaining to the development of models, resources and tools for ancient languages (especially ancient Semitic and Indo-European languages), both in a synchronic and diachronic perspective (computational exploration of language change and language history). This new line of research is at the core of the new collaboration between Inria and EPHE and is a fascinating new direction of research for former ALPAGE members.
These two research directions rely on the availability of language resources (corpora, lexicons). The development of raw corpora from original sources is a domain of expertise of ALMAnaCH’s EPHE members. The (manual, semi-automatic and automatic) development of lexical resources and annotated corpora is a domain of expertise of ALMAnaCH’s Inria and Paris 4 members. This complementary expertise in language resource development (research strand 3) benefits to the whole team and beyond, and will both feed and benefit from the work of the other research strands.