We release: CamemBERT: a Tasty French Language Model (soon on arxiv)
CamemBERT is trained on 138GB of French text. It establishes a new state of the art in POS tagging, Dependency Parsing and NER, and achieves strong results in NLI.
CamemBERT is the result of a joint work involving Inria and Facebook research: Louis Martin, Benjamin Muller, Pedro Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah, Benoît Sagot.
CamemBERT’s architecture is a variant of RoBERTa (Liu et al. 2019), with SentencePiece tokenisation (Kudo and Richardson 2018) and whole-word masking. It is trained on the French part of our OSCAR corpus created from CommonCrawl (Ortiz Suárez et al. 2019).
Bon appétit !