A Lifelong Multilingual Multi-granularity Semantic Alignment Approach via Maximum Co-occurrence Probability

Xin Liu, Hongwei Sun, Shaojie Dai, Bo Lv, Youcheng Pan, Hui Wang, Yue Yu


Abstract
Cross-lingual pre-training methods mask and predict tokens in multilingual text to generalize diverse multilingual information. However, due to the lack of sufficient aligned multilingual resources in the pre-training process, these methods may not fully explore the multilingual correlation of masked tokens, resulting in the limitation of multilingual information interaction. In this paper, we propose a lifelong multilingual multi-granularity semantic alignment approach, which continuously extracts massive aligned linguistic units from noisy data via a maximum co-occurrence probability algorithm. Then, the approach releases a version of the multilingual multi-granularity semantic alignment resource, supporting seven languages, namely English, Czech, German, Russian, Romanian, Hindi and Turkish. Finally, we propose how to use this resource to improve the translation performance on WMT14 18 benchmarks in twelve directions. Experimental results show an average of 0.3 1.1 BLEU improvements in all translation benchmarks. The analysis and discussion also demonstrate the superiority and potential of the proposed approach. The resource used in this work will be publicly available.
Anthology ID:
2024.lrec-main.60
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
684–694
Language:
URL:
https://aclanthology.org/2024.lrec-main.60
DOI:
Bibkey:
Cite (ACL):
Xin Liu, Hongwei Sun, Shaojie Dai, Bo Lv, Youcheng Pan, Hui Wang, and Yue Yu. 2024. A Lifelong Multilingual Multi-granularity Semantic Alignment Approach via Maximum Co-occurrence Probability. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 684–694, Torino, Italia. ELRA and ICCL.
Cite (Informal):
A Lifelong Multilingual Multi-granularity Semantic Alignment Approach via Maximum Co-occurrence Probability (Liu et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.60.pdf