MElt

MElt is a freely available (LGPL) state-of-the-art sequence labeller that is meant to be trained on both an annotated corpus and an external lexicon. It was initially developed by Pascal Denis and Benoît Sagot. Recent evolutions have been carried out by Benoît Sagot. MElt allows for using multiclass Maximum-Entropy Markov models (MEMMs) or multiclass perceptrons (multitrons) as underlying statistical devices. Its output is in the Brown format (one sentence per line, each sentence being a space-separated sequence of annotated words in the word/tag format).

MElt has been trained on various annotated corpus, using for instance Alexina lexicons as source of lexical information.

MElt also includes a normalization wrapper aimed at helping processing noisy text, such as user-generated data retrieved on the web. This wrapper is only available for French and English. It was used for parsing web data for both English and French, respectively during the 2012 SANCL shared task (Google Web Bank) and for developing the French Social Media Bank (Facebook, twitter and blog data).

You can retrain MElt on your own data, provided you put it in the Brown format, using the MElt-train script. You have to provide an external lexicon file, but it can be an empty file if you don’t want to use external lexical information.

If you use MElt, please cite one or two of the following publications:

  • Pascal Denis and Benoît Sagot (2012). Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging. In Language Resources and Evaluation 46:4, pp. 721-736, DOI 10.1007/s10579-012-9193-0.
  • Benoît Sagot (2016). External Lexical Information for Multilingual Part-of-Speech Tagging. INRIA Scientific Report 8924.

Questions, comments and bug reports should be sent to Benoît Sagot (benoit.sagot@inria.fr).

Downloading MElt — note that you have access to the latest version of MElt on Inria’s GitLab

Earlier versions can be retrieved on the appropriate download page on the INRIA GForge.

Comments are closed.