Bilingual Sentence Pairs: Support for training Technical document (Patent) Machine Translation Engines
We can provide large quantities of top quality Chinese-English parallel sentence pairs to enhance the quality of your patent or technical machine translation system and save on your post-editing efforts. The data are extracted from over 300,000 comparable Chinese-English patents. This resource has provided the basis of the core bilingual terms in the PatentLex post-editing prototype system which came in second at the 2019 Game-Changer Competition in Singapore organized by TAUS. The extraction of the parallel sentence pairs began more than 10 years ago and the sentences have contributed to the training corpus and test sets for the first two ever Chinese-English Patent Machine Translation competitions organized in Tokyo by NTCIR in 2009 & 2010. Our data sets are also available at TAUS Data Market.
For data samples, please click the button.
Datasets can be ordered through Chilin by contacting us here. Chilin can provide custom datasets based on patent classification codes. Standard datasets for Pharmaceuticals (class A61K) and Biotechnology (class C12N) are also available.
Chilin’s Pharmaceutical and Biotechnology datasets are available on the TAUS Data Marketplace. To find Chilin’s data, under “Source Language” select “English (United States)”. Under “Target Language” select “Chinese (China)”. From the next page, select the “Pharmaceuticals & Biotechnology” Domain. You can then see samples from the datasets.
This data is not machine translated and is suitable for English to Chinese and Chinese to English MT training. The Chilin data is available in TMX format. It is configured with “en-US” as the source language and “zh-CN” as the target language.
To license data, you must register and log in to the data marketplace. Through the shopping cart icon, you can navigate to Chilin’s two datasets.
The pharmaceutical dataset contains 12,947 segments; 475,509 en-US words; and 401,629 zh-CN characters. It is based on the CPC Patent Classification category A61K which covers pharmaceuticals.
The biotechnology dataset contains 10,377 segments; 379,898 en-US words; and 327,637 zh-CN characters. It is based on the CPC Patent Classification C12N which contains many biotechnology filings. Sample data is shown below. More samples are available from Chilin upon request.
Chilin can provide larger datasets (one million records or more) upon request.
Suitable buffers include boric acid, sodium and potassium bicarbonate, sodium and potassium borates, sodium and potassium carbonate, sodium acetate, sodium biphosphate and the like, in amounts sufficient to maintain the pH at between about pH 6 and pH 8, and preferably, between about pH 7 and pH 7.5.
The price on the data marketplace is .001 euros/English word. This is €475.50 for pharmaceuticals (A61K) and €379.90 for biochemistry (C12N). You may license each dataset in its entirety or fractionally (50% or 75%).
Chilin’s data is extracted from over 300,000 parallel comparable Chinese-English patent documents. This resource began as the training corpus and test sets used in the first two ever Chinese-English Patent Machine Translation competitions organized in Tokyo by NTCIR in 2009 and 2010. For details and outcomes of this competition please see Overview of the Patent Machine Translation Task at the NTCIR-9 Workshop.