Yongbin Jeong


2024

pdf bib
Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean
ChangSu Choi | Yongbin Jeong | Seoyoon Park | Inho Won | HyeonSeok Lim | SangMin Kim | Yejee Kang | Chanhyuk Yoon | Jaewan Park | Yiseul Lee | HyeJin Lee | Younggyun Hahm | Hansaem Kim | KyungTae Lim
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Large language models (LLMs) use pretraining to predict the subsequent word; however, their expansion requires significant computing resources. Numerous big tech companies and research institutes have developed multilingual LLMs (MLLMs) to meet current demands, overlooking less-resourced languages (LRLs). This study proposed three strategies to enhance the performance of LRLs based on the publicly available MLLMs. First, the MLLM vocabularies of LRLs were expanded to enhance expressiveness. Second, bilingual data were used for pretraining to align the high- and less-resourced languages. Third, a high-quality small-scale instruction dataset was constructed and instruction-tuning was performed to augment the LRL. The experiments employed the Llama2 model and Korean was used as the LRL, which was quantitatively evaluated against other developed LLMs across eight tasks. Furthermore, a qualitative assessment was performed based on human evaluation and GPT4. Experimental results showed that our proposed Bllossom model exhibited superior performance in qualitative analyses compared to previously proposed Korean monolingual models.

2023

pdf bib
Teddysum at MEDIQA-Chat 2023: an analysis of fine-tuning strategy for long dialog summarization
Yongbin Jeong | Ju-Hyuck Han | Kyung Min Chae | Yousang Cho | Hyunbin Seo | KyungTae Lim | Key-Sun Choi | Younggyun Hahm
Proceedings of the 5th Clinical Natural Language Processing Workshop

In this paper, we introduce the design and various attempts for TaskB of MEDIQA-Chat 2023. The goal of TaskB in MEDIQA-Chat 2023 is to generate full clinical note from doctor-patient consultation dialogues. This task has several challenging issues, such as lack of training data, handling long dialogue inputs, and generating semi-structured clinical note which have section heads. To address these issues, we conducted various experiments and analyzed their results. We utilized the DialogLED model pre-trained on long dialogue data to handle long inputs, and we pre-trained on other dialogue datasets to address the lack of training data. We also attempted methods such as using prompts and contrastive learning for handling sections. This paper provides insights into clinical note generation through analyzing experimental methods and results, and it suggests future research directions.

2020

pdf bib
Enhancing Quality of Corpus Annotation: Construction of the Multi-Layer Corpus Annotation and Simplified Validation of the Corpus Annotation
Youngbin Noh | Kuntae Kim | Minho Lee | Cheolhun Heo | Yongbin Jeong | Yoosung Jeong | Younggyun Hahm | Taehwan Oh | Hyonsu Choe | Seokwon Park | Jin-Dong Kim | Key-Sun Choi
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation