Almanach Seminars (Saison 2018-2019)
Il s’agit du séminaire de recherche de l’équipe Almanach, équipe mixte INRIA – EPHE,
spécialisée en traitement automatique des langues et humanités numériques.
Lieu: Inria Paris, 2 rue Simone Iff, 75012 Paris (Visualisation Google Map )
Périodicité: Irrégulière (en pratique, tous les 2 ou 3 vendredi de chaque mois). Contact : firstname.lastname@example.org
Toute personne intéressée est la bienvenue.
(Afin de respecter les consignes de sécurité, merci de vous munir d’une pièce d’identité
et d’une copie, éventuellement numérique, de l’annonce du séminaire.)
Marine Courtin and Kim Gerdes (Univ. Paris 3 et CNRS)
!! October 5, 2018 Friday, 11am !!
Building a Treebank for Naija, the English-based Creole of Nigeria.
Abstract As an example of treebank development without pre-existing language
specific NLP tools, we will present the ongoing work of constructing a
750 000 word treebank for Naija. The annotation project, part of the
NaijaSynCor ANR project, has a social dimension because the language,
that is not fully recognized as such by the speakers themselves, is not
yet institutionalized in any way. Yet, Naija, spoken by close to 100
million speakers, could play an important role in the nation-building
process of Nigeria. We will briefly present a few particularities of
Naija such as serial verbs, reduplications, and emphatic adverbial
particles. We used a bootstrapping process of manual annotation and
parser training to enhance and speed up the annotation process. The
annotation is done in the Syntactic Universal Dependencies scheme (SUD)
which allows seamless transformation into Universal Dependencies (UD) by
means of Grew http://grew.fr/, a rule based graph rewriting system. We
will present the different tools involved in this process, and we will
show a few preliminary quantitative measures on the annotated sentences.
September 24, 2018
New Resources and Ideas for Semantic Parsing
Kyle Richardson, IMS
abstract: In this talk, I will give an overview of research being done at the University of Stuttgart on semantic parser
induction and natural language understanding. The main topic, semantic parser induction, relates to the problem
of learning to map input text to full meaning representations from parallel datasets. Such resulting “semantic parsers”
are often a core component in various downstream natural language understanding applications, including automated
question-answering and generation systems. We look at learning within several novel domains and datasets being
developed in Stuttgart (e.g., software documentation for text-to-code translation) and under various types of data
supervision (e.g., learning from entailment, “polyglot” modeling, or learning from multiple datasets).
bio: Kyle Richardson is a finishing PhD student at the University of Stuttgart (IMS), working on semantic parsing and
various applications thereof. Prior to this, he was a researcher in the Intelligent Systems Lab at the Palo Alto
Research Center (PARC), and holds a B.A. from the University of Rochester, USA.
He’ll be joining the Allen Institute for AI in November.
September 21, 2018
Historical text normalization with neural networks
Marcel Bollmann (University of Copehnhagen, Departement of Computer Science)
With the increasing availability of digitized historical documents,
interest in effective NLP tools for these documents is on the rise. The
abundance of variant spellings, however, makes them challenging to work
with for both humans and machines. For my PhD thesis, I worked on
automatic normalization—mapping historical spellings to modern
ones—as a possible approach to this problem. I looked at datasets of
historical texts in eight different languages and evaluated
normalization using rule-based, statistical, and neural approaches, with
a particular focus on tuning a neural encoder–decoder model. In this
talk, I will highlight what I learned from different perspectives: Why,
what, and how to normalize? How do the different approaches compare and
which one should I use? And what can we learn from this about neural
networks that might be useful for other NLP tasks?
Almanach Seminars (Saison 2017-2018)
Catherine Koshmar (Cambridge University, UK) (May 5, 2018)
Text readability assessment for second language learners
Abstract In this talk, I will present our work on readability assessment for
the texts aimed at second language (L2) learners. I will discuss the
approaches to this task and the features that we use in the machine
learning framework. One of the major challenges in this task is the
lack of significantly sized level-annotated data for L2 learners,
as most models are aimed at and trained on the large amounts of texts
for native English speakers. I will overview the methods of adapting
models trained on larger native corpora to estimate text readability for
L2 learners. Once the readability level of the text is assessed, the
text can be adapted (e.g., simplified) to the level of the reader. The
first step in this process is identification of words and phrases in
need of simplification or adaptation. This task is called Complex Word
Identification (CWI), and it has recently attracted much attention. In
the second part of the talk, I will discuss the approaches to CWI and
present our winning submission to the CWI Shared Task 2018.
Houda Bouamor (Carnegie Mellon University, Qatar) (November 22, 2017)
Quality Evaluation of Machine Translation into Arabic
Abstract In machine translation, automatically obtaining a reliable assessment
of translation quality is a challenging problem. Several techniques
for automatically assessing translation quality for different purposes
have been proposed, but these are mostly limited to strict string
comparisons between the generated translation and translations produced by
humans. This approach is too simplistic and ineffective for languages with
flexible word order and rich morphology such as Arabic, a language for
which machine translation evaluation is still an under-studied problem,
despite posing many challenges.
In this talk, I will first introduce AL-BLEU, a metric for Arabic
machine translation evaluation that uses a rich set of morphological,
syntactic and lexical features to extend the evaluation beyond the exact
matching. We showed that AL-BLEU has a stronger correlation with human
judgments than the state-of-the-art classical metrics.
Then, I will present a more advanced study in which we explore the use of
embeddings obtained from different levels of lexical and morpho-syntactic
linguistic analysis and show that they improve MT evaluation into
an Arabic. Our results show that using a neural-network model with
different input representations produces results that clearly outperform
the state-of-the-art for MT evaluation into Arabic, by almost over 75%
increase in correlation with human judgments on pairwise MT evaluation
Jacobo Levy Abitbol and Marton Karsaï (ENS Lyon, Inria Dante) (November 13, 2017)
Socioeconomic dependencies of linguistic patterns in Twitter: Correlation and learning
Jacobo Levy Abitbol and Marton Karsaï (ENS Lyon, Inria Dante)
Abstract: Our usage of language is not solely reliant on cognition but is arguably determined by myriad external factors leading to a global variability of linguistic patterns. This issue, which lies at the core of sociolinguistics and is backed by many small-scale studies on face-to-face communication, is addressed here by constructing a dataset combining the largest French Twitter corpus to date with detailed socioeconomic maps obtained from national census in France. We show how key linguistic variables measured in individual Twitter streams depend on factors like socioeconomic status, location, time, and the social network of individuals. We found that (1) people of higher socioeconomic status, active to a greater degree during the daytime, use a more standard language; (ii) the southern part of the country is more prone to using more standard language than the northern one, while locally the used variety or dialect is determined by the spatial distribution of socioeconomic status; and (iii) individuals connected in the social network are closer linguistically than disconnected ones, even after the effects of status homophily have been removed. In the second part of the talk we will discuss how linguistic information and the detected correlations can be used for the inference of socioeconomic status.