Maarten van Gompel


2024

pdf bib
A Workflow for HTR-Postprocessing, Labeling and Classifying Diachronic and Regional Variation in Pre-Modern Slavic Texts
Piroska Lendvai | Maarten van Gompel | Anna Jouravel | Elena Renje | Uwe Reichel | Achim Rabus | Eckhart Arnold
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We describe ongoing work for developing a workflow for the applied use case of classifying diachronic and regional language variation in Pre-Modern Slavic texts. The data were obtained via handwritten text recognition (HTR) on medieval manuscripts and printings and partly by manual transcription. Our goal is to develop a workflow for such historical language data, covering HTR-postprocessing, annotating and classifying the digitized texts. We test and adapt existing language resources to fit the pipeline with low-barrier tooling, accessible for Humanists with limited experience in research data infrastructures, computational analysis or advanced methods of natural language processing (NLP). The workflow starts by addressing ground truth (GT) data creation for diagnosing and correcting HTR errors via string metrics and data-driven methods. On GT and on HTR data, we subsequently show classification results using transfer learning on sentence-level text snippets. Next, we report on our token-level data labeling efforts. Each step of the workflow is complemented with describing current limitations and our corresponding work in progress.

2014

pdf bib
SemEval 2014 Task 5 - L2 Writing Assistant
Maarten van Gompel | Iris Hendrickx | Antal van den Bosch | Els Lefever | Véronique Hoste
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf bib
Translation Assistance by Translation of L1 Fragments in an L2 Context
Maarten van Gompel | Antal van den Bosch
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
CLAM: Quickly deploy NLP command-line tools on the web
Maarten van Gompel | Martin Reynaert
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations

2013

pdf bib
WSD2: Parameter optimisation for Memory-based Cross-Lingual Word-Sense Disambiguation
Maarten van Gompel | Antal van den Bosch
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

2012

pdf bib
Beyond SoNaR: towards the facilitation of large corpus building efforts
Martin Reynaert | Ineke Schuurman | Véronique Hoste | Nelleke Oostdijk | Maarten van Gompel
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper we report on the experiences gained in the recent construction of the SoNaR corpus, a 500 MW reference corpus of contemporary, written Dutch. It shows what can realistically be done within the confines of a project setting where there are limitations to the duration in time as well to the budget, employing current state-of-the-art tools, standards and best practices. By doing so we aim to pass on insights that may be beneficial for anyone considering to undertake an effort towards building a large, varied yet balanced corpus for use by the wider research community. Various issues are discussed that come into play while compiling a large corpus, including approaches to acquiring texts, the arrangement of IPR, the choice of text formats, and steps to be taken in the preprocessing of data from widely different origins. We describe FoLiA, a new XML format geared at rich linguistic annotations. We also explain the rationale behind the investment in the high-quali ty semi-automatic enrichment of a relatively small (1 MW) subset with very rich syntactic and semantic annotations. Finally, we present some ideas about future developments and the direction corpus development may take, such as setting up an integrated work flow between web services and the potential role for ISOcat. We list tips for potential corpus builders, tricks they may want to try and further recommendations regarding technical developments future corpus builders may wish to hope for.

2010

pdf bib
UvT-WSD1: A Cross-Lingual Word Sense Disambiguation System
Maarten van Gompel
Proceedings of the 5th International Workshop on Semantic Evaluation