Bilingual Sentence Pairs, supported by Patent Machine Translation Training Corpus
We can provide large quantities of top quality Chinese-English parallel sentence pairs to enhance the quality of your patent or technical machine translation system and save on your post-editing efforts. The data are extracted from over 300,000 patent documents. This resource is drawn from the training corpus and test sets developed for the Tokyo-based NTCIR 2009 & 2010 Chinese-English patent MT competitions. Our data sets are also available at TAUS data market.
For data samples, please click the button.
Our research team was the sole provider of the training corpus and the test sets for the two international patent machine translation competitions organized by NTCIR/NII in Tokyo in 2009 and 2010.The two competitions drew over 30 international teams from well-known universities and R&D organizations in China and abroad. Download papers: NTCIR-9/NTCIR-10.
The sentences were selected from a much larger corpus of than 300,000 Chinese-English parallel patents in different fields according to a number of filtering parameters including word alignment, sentence length and language modeling. They were then automatically segmented and aligned. All texts are encoded as UTF-8.