JaParaPat: A Large-Scale Japanese-English Parallel Patent Application Corpus

Masaaki Nagata; Makoto Morishita; Katsuki Chousa; Norihito Yasuda

JaParaPat: A Large-Scale Japanese-English Parallel Patent Application Corpus

Masaaki Nagata, Makoto Morishita, Katsuki Chousa, Norihito Yasuda

Abstract

We constructed JaParaPat (Japanese-English Parallel Patent Application Corpus), a bilingual corpus of more than 300 million Japanese-English sentence pairs from patent applications published in Japan and the United States from 2000 to 2021. We obtained the publication of unexamined patent applications from the Japan Patent Office (JPO) and the United States Patent and Trademark Office (USPTO). We also obtained patent family information from the DOCDB, that is a bibliographic database maintained by the European Patent Office (EPO). We extracted approximately 1.4M Japanese-English document pairs, which are translations of each other based on the patent families, and extracted about 350M sentence pairs from the document pairs using a translation-based sentence alignment method whose initial translation model is bootstrapped from a dictionary-based sentence alignment. We experimentally improved the accuracy of the patent translations by 20 bleu points by adding more than 300M sentence pairs obtained from patent applications to 22M sentence pairs obtained from the web.

Anthology ID:: 2024.lrec-main.826
Volume:: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:: LREC | COLING
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 9452–9462
Language:
URL:: https://aclanthology.org/2024.lrec-main.826
DOI:
Bibkey:
Cite (ACL):: Masaaki Nagata, Makoto Morishita, Katsuki Chousa, and Norihito Yasuda. 2024. JaParaPat: A Large-Scale Japanese-English Parallel Patent Application Corpus. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9452–9462, Torino, Italia. ELRA and ICCL.
Cite (Informal):: JaParaPat: A Large-Scale Japanese-English Parallel Patent Application Corpus (Nagata et al., LREC-COLING 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.lrec-main.826.pdf

PDF Cite Search