Jorge Baptista

2024

pdf bib abs
Charting the Linguistic Landscape of Developing Writers: An Annotation Scheme for Enhancing Native Language Proficiency
Miguel Da Corte | Jorge Baptista
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This study describes a pilot annotation task designed to capture orthographic, grammatical, lexical, semantic, and discursive patterns exhibited by college native English speakers participating in developmental education (DevEd) courses. The paper introduces an annotation scheme developed by two linguists aiming at pinpointing linguistic challenges that hinder effective written communication. The scheme builds upon patterns supported by the literature, which are known as predictors of student placement in DevEd courses and English proficiency levels. Other novel, multilayered, linguistic aspects that the literature has not yet explored are also presented. The scheme and its primary categories are succinctly presented and justified. Two trained annotators used this scheme to annotate a sample of 103 text units (3 during the training phase and 100 during the annotation task proper). Texts were randomly selected from a population of 290 community college intending students. An in-depth quality assurance inspection was conducted to assess tagging consistency between annotators and to discern (and address) annotation inaccuracies. Krippendorff’s Alpha (K-alpha) interrater reliability coefficients were calculated, revealing a K-alpha score of k=0.40, which corresponds to a moderate level of agreement, deemed adequate for the complexity and length of the annotation task.

pdf bib abs
Enhancing Writing Proficiency Classification in Developmental Education: The Quest for Accuracy
Miguel Da Corte | Jorge Baptista
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Developmental Education (DevEd) courses align students’ college-readiness skills with higher education literacy demands. These courses often use automated assessment tools like Accuplacer for student placement. Existing literature raises concerns about these exams’ accuracy and placement precision due to their narrow representation of the writing process. These concerns warrant further attention within the domain of automatic placement systems, particularly in the establishment of a reference corpus of annotated essays for these systems’ machine/deep learning. This study aims at an enhanced annotation procedure to assess college students’ writing patterns more accurately. It examines the efficacy of machine-learning-based DevEd placement, contrasting Accuplacer’s classification of 100 college-intending students’ essays into two levels (Level 1 and 2) against that of 6 human raters. The classification task encompassed the assessment of the 6 textual criteria currently used by Accuplacer: mechanical conventions, sentence variety & style, idea development & support, organization & structure, purpose & focus, and critical thinking. Results revealed low inter-rater agreement, both on the individual criteria and the overall classification, suggesting human assessment of writing proficiency can be inconsistent in this context. To achieve a more accurate determination of writing proficiency and improve DevEd placement, more robust classification methods are thus required.

pdf bib
Automatic Text Readability Assessment in European Portuguese
Eugénio Ribeiro | Nuno Mamede | Jorge Baptista
Proceedings of the 16th International Conference on Computational Processing of Portuguese

pdf bib
Hurdles in Parsing Multi-word Adverbs: Examples from Portuguese
Izabela Muller | Nuno Mamede | Jorge Baptista
Proceedings of the 16th International Conference on Computational Processing of Portuguese

pdf bib
Towards a Syntactic Lexicon of Brazilian Portuguese Adjectives
Ryan Martinez | Jorge Baptista | Oto Vale
Proceedings of the 16th International Conference on Computational Processing of Portuguese

pdf bib
Text Readability Assessment in European Portuguese: A Comparison of Classification and Regression Approaches
Eugénio Ribeiro | Nuno Mamede | Jorge Baptista
Proceedings of the 16th International Conference on Computational Processing of Portuguese

2022

pdf bib abs
Support Verb Constructions across the Ocean Sea
Jorge Baptista | Nuno Mamede | Sónia Reis
Proceedings of the 18th Workshop on Multiword Expressions @LREC2022

This paper analyses the support (or light) verb constructions (SVC) in a publicly available, manually annotated corpus of multiword expressions (MWE) in Brazilian Portuguese. The paper highlights several issues in the linguistic definitions therein adopted for these types of MWE, and reports the results from applying STRING, a rule-based parsing system, originally developed for European Portuguese, to this corpus from Brazilian Portuguese. The goal is two-fold: to improve the linguistic definition of SVC in the annotation task, as well as to gauge the major difficulties found when transposing linguistic resources between these two varieties of the same language.

2017

pdf bib abs
Discriminating between Similar Languages Using a Combination of Typed and Untyped Character N-grams and Words
Helena Gomez | Ilia Markov | Jorge Baptista | Grigori Sidorov | David Pinto
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

This paper presents the cic_ualg’s system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year’s task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms – Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.

pdf bib
Os Provérbios em manuais de ensino de Português Língua Não Materna (The Proverbs of teaching manuals in Non-Native Portuguese)[In Portuguese]
Sónia Reis | Jorge Baptista
Proceedings of the 11th Brazilian Symposium in Information and Human Language Technology

2016

pdf bib abs
metaTED: a Corpus of Metadiscourse for Spoken Language
Rui Correia | Nuno Mamede | Jorge Baptista | Maxine Eskenazi
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper describes metaTED ― a freely available corpus of metadiscursive acts in spoken language collected via crowdsourcing. Metadiscursive acts were annotated on a set of 180 randomly chosen TED talks in English, spanning over different speakers and topics. The taxonomy used for annotation is composed of 16 categories, adapted from Adel(2010). This adaptation takes into account both the material to annotate and the setting in which the annotation task is performed. The crowdsourcing setup is described, including considerations regarding training and quality control. The collected data is evaluated in terms of quantity of occurrences, inter-annotator agreement, and annotation related measures (such as average time on task and self-reported confidence). Results show different levels of agreement among metadiscourse acts (α ∈ [0.15; 0.49]). To further assess the collected material, a subset of the annotations was submitted to expert appreciation, who validated which of the marked occurrences truly correspond to instances of the metadiscursive act at hand. Similarly to what happened with the crowd, experts revealed different levels of agreement between categories (α ∈ [0.18; 0.72]). The paper concludes with a discussion on the applicability of metaTED with respect to each of the 16 categories of metadiscourse.

Co-authors

Venues

propor4
lrec3
mwe3
stil3
coling2
show all...

lglp2

vardial1

ws1

Jorge Baptista

2024

2022

2017

2016

2015

2014

2007

2004

1999

Co-authors

Venues