Patent Machine Translation Training Corpus
We can provide large quantities of highly accurate Chinese-English parallel sentence pairs data to improve the quality of your own patent machine translation system and reduce your post-editing efforts. The data are extracted from over 300,000 patent documents. This resource is drawn from the training corpus and test sets developed for the Tokyo-based NTCIR 2009 & 2010 tasks on Patent Machine Translation. For more details, please click the button.
Our research team was the sole provider of the training corpus and the test sets for the two international patent machine translation competitions organized by NTCIR/NII in Tokyo in 2009 and 2010.The two competitions drew over 30 international teams from well-known universities and R&D organizations in China and abroad. Download papers: NTCIR-9/NTCIR-10.
The sentences were selected from a much larger corpus of than 300,000 Chinese-English parallel patents in different fields according to a number of filtering parameters including word alignment, sentence length and language modeling. They were then automatically segmented and aligned. All texts are encoded as UTF-8.
Please view this Chinese sample and English sample.
