Yupu Liang


2024

pdf bib
Born a BabyNet with Hierarchical Parental Supervision for End-to-End Text Image Machine Translation
Cong Ma | Yaping Zhang | Zhiyang Zhang | Yupu Liang | Yang Zhao | Yu Zhou | Chengqing Zong
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Text image machine translation (TIMT) aims at translating source language texts in images into another target language, which has been proven successful by bridging text image recognition encoder and text translation decoder. However, it is still an open question of how to incorporate fine-grained knowledge supervision to make it consistent between recognition and translation modules. In this paper, we propose a novel TIMT method named as BabyNet, which is optimized with hierarchical parental supervision to improve translation performance. Inspired by genetic recombination and variation in the field of genetics, the proposed BabyNet is inherited from the recognition and translation parent models with a variation module of which parameters can be updated when training on the TIMT task. Meanwhile, hierarchical and multi-granularity supervision from parent models is introduced to bridge the gap between inherited modules in BabyNet. Extensive experiments on both synthetic and real-world TIMT tests show that our proposed method significantly outperforms existing methods. Further analyses of various parent model combinations show the good generalization of our method.

2023

pdf bib
LayoutDIT: Layout-Aware End-to-End Document Image Translation with Multi-Step Conductive Decoder
Zhiyang Zhang | Yaping Zhang | Yupu Liang | Lu Xiang | Yang Zhao | Yu Zhou | Chengqing Zong
Findings of the Association for Computational Linguistics: EMNLP 2023

Document image translation (DIT) aims to translate text embedded in images from one language to another. It is a challenging task that needs to understand visual layout with text semantics simultaneously. However, existing methods struggle to capture the crucial visual layout in real-world complex document images. In this work, we make the first attempt to incorporate layout knowledge into DIT in an end-to-end way. Specifically, we propose a novel Layout-aware end-to-end Document Image Translation (LayoutDIT) with multi-step conductive decoder. A layout-aware encoder is first introduced to model visual layout relations with raw OCR results. Then a novel multi-step conductive decoder is unified with hidden states conduction across three step-decoders to achieve the document translation step by step. Benefiting from the layout-aware end-to-end joint training, our LayoutDIT outperforms state-of-the-art methods with better parameter efficiency. Besides, we create a new multi-domain document image translation dataset to validate the model’s generalization. Extensive experiments show that LayoutDIT has a good generalization in diverse and complex layout scenes.