Adrien Pupier


2024

pdf bib
Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains
Vincent Segonne | Aidan Mannion | Laura Cristina Alonzo Canul | Alexandre Daniel Audibert | Xingyu Liu | Cécile Macaire | Adrien Pupier | Yongxin Zhou | Mathilde Aguiar | Felix E. Herron | Magali Norré | Massih R Amini | Pierrette Bouillon | Iris Eshkol-Taravella | Emmanuelle Esperança-Rodier | Thomas François | Lorraine Goeuriot | Jérôme Goulian | Mathieu Lafourcade | Benjamin Lecouteux | François Portet | Fabien Ringeval | Vincent Vandeghinste | Maximin Coavoux | Marco Dinarelli | Didier Schwab
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Pretrained Language Models (PLMs) are the de facto backbone of most state-of-the-art NLP systems. In this paper, we introduce a family of domain-specific pretrained PLMs for French, focusing on three important domains: transcribed speech, medicine, and law. We use a transformer architecture based on efficient methods (LinFormer) to maximise their utility, since these domains often involve processing long documents. We evaluate and compare our models to state-of-the-art models on a diverse set of tasks and datasets, some of which are introduced in this paper. We gather the datasets into a new French-language evaluation benchmark for these three domains. We also compare various training configurations: continued pretraining, pretraining from scratch, as well as single- and multi-domain pretraining. Extensive domain-specific experiments show that it is possible to attain competitive downstream performance even when pre-training with the approximative LinFormer attention mechanism. For full reproducibility, we release the models and pretraining data, as well as contributed datasets.

pdf bib
What Has LeBenchmark Learnt about French Syntax?
Zdravko Dugonjić | Adrien Pupier | Benjamin Lecouteux | Maximin Coavoux
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The paper reports on a series of experiments aiming at probing LeBenchmark, a pretrained acoustic model trained on 7k hours of spoken French, for syntactic information. Pretrained acoustic models are increasingly used for downstream speech tasks such as automatic speech recognition, speech translation, spoken language understanding or speech parsing. They are trained on very low level information (the raw speech signal), and do not have explicit lexical knowledge. Despite that, they obtained reasonable results on tasks that requires higher level linguistic knowledge. As a result, an emerging question is whether these models encode syntactic information. We probe each representation layer of LeBenchmark for syntax, using the Orféo treebank, and observe that it has learnt some syntactic information. Our results show that syntactic information is more easily extractable from the middle layers of the network, after which a very sharp decrease is observed.

2023

pdf bib
PROPICTO: Developing Speech-to-Pictograph Translation Systems to Enhance Communication Accessibility
Lucía Ormaechea | Pierrette Bouillon | Maximin Coavoux | Emmanuelle Esperança-Rodier | Johanna Gerlach | Jerôme Goulian | Benjamin Lecouteux | Cécile Macaire | Jonathan Mutal | Magali Norré | Adrien Pupier | Didier Schwab
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

PROPICTO is a project funded by the French National Research Agency and the Swiss National Science Foundation, that aims at creating Speech-to-Pictograph translation systems, with a special focus on French as an input language. By developing such technologies, we intend to enhance communication access for non-French speaking patients and people with cognitive impairments.

2022

pdf bib
Une chaîne de traitements pour la simplification automatique de la parole et sa traduction automatique vers des pictogrammes (Simplification and automatic translation of speech into pictograms )
Cécile Macaire | Lucia Ormaechea-Grijalba | Adrien Pupier
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 : 24e Rencontres Etudiants Chercheurs en Informatique pour le TAL (RECITAL)

La Communication Alternative et Augmentée (CAA) prend une place importante chez les personnes en situation de handicap ainsi que leurs proches à cause de la difficulté de son utilisation. Pour réduire ce poids, l’utilisation d’outils de traduction de la parole en pictogrammes est pertinente. De plus, ils peuvent être d’une grande aide pour l’accessibilité communicative dans le milieu hospitalier. Dans cet article, nous présentons un projet de recherche visant à développer un système de traduction de la parole vers des pictogrammes. Il met en jeu une chaîne de traitement comportant plusieurs axes relevant du traitement automatique des langues et de la parole, tels que la reconnaissance automatique de la parole, l’analyse syntaxique, la simplification de texte et la traduction automatique vers les pictogrammes. Nous présentons les difficultés liées à chacun de ces axes ainsi que, pour certains, les pistes de résolution.