Enhancing Hindi Feature Representation through Fusion of Dual-Script Word Embeddings

Lianxi Wang, Yujia Tian, Zhuowei Chen


Abstract
Pretrained language models excel in various natural language processing tasks but often neglect the integration of different scripts within a language, constraining their ability to capture richer semantic information, such as in Hindi. In this work, we present a dual-script enhanced feature representation method for Hindi. We combine single-script features from Devanagari and Romanized Hindi Roberta using concatenation, addition, cross-attention, and convolutional networks. The experiment results show that using a dual-script approach significantly improves model performance across various tasks. The addition fusion technique excels in sequence generation tasks, while for text classification, the CNN-based dual-script enhanced representation performs best with longer sentences, and the addition fusion technique is more effective for shorter sequences. Our approach shows significant advantages in multiple natural language processing tasks, providing a new perspective on feature representation for Hindi. Our code has been released on https://github.com/JohnnyChanV/Hindi-Fusion.
Anthology ID:
2024.lrec-main.528
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
5966–5976
Language:
URL:
https://aclanthology.org/2024.lrec-main.528
DOI:
Bibkey:
Cite (ACL):
Lianxi Wang, Yujia Tian, and Zhuowei Chen. 2024. Enhancing Hindi Feature Representation through Fusion of Dual-Script Word Embeddings. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 5966–5976, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Enhancing Hindi Feature Representation through Fusion of Dual-Script Word Embeddings (Wang et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.528.pdf