Bilingual Sentence Pairs: Support for training Technical document (Patent) Machine Translation Engines
 
We can provide large quantities of top quality Chinese-English parallel sentence pairs to enhance the quality of your patent or technical machine translation system and save on your post-editing efforts. The data are extracted from over 300,000 comparable Chinese-English patents. This resource has provided the basis of the core bilingual terms in the PatentLex post-editing prototype system which came in second at the 2019 Game-Changer Competition in Singapore organized by TAUS. The extraction of the parallel sentence pairs began more than 10 years ago and the sentences have contributed to the training corpus and test sets for the first two ever Chinese-English Patent Machine Translation competitions organized in Tokyo by NTCIR in 2009 & 2010. Our data sets are also available at TAUS Data Market.

For data samples, please click the button.

 
 
Availability 
Datasets can be ordered through Chilin by contacting us here. Chilin can provide custom datasets based on patent classification codes. Standard datasets for Pharmaceuticals (class A61K) and Biotechnology (class C12N) are also available. 

Chilin’s Pharmaceutical and Biotechnology datasets are available on the TAUS Data Marketplace. To find Chilin’s data, under “Source Language” select “English (United States)”. Under “Target Language” select “Chinese (China)”. From the next page, select the “Pharmaceuticals & Biotechnology” Domain. You can then see samples from the datasets. 
 
TAUSDM_3
 
This data is not machine translated and is suitable for English to Chinese and Chinese to English MT training.  The Chilin data is available in TMX format.  It is configured with “en-US” as the source language and “zh-CN” as the target language. 

To license data, you must register and log in to the data marketplace.  Through the shopping cart icon, you can navigate to Chilin’s two datasets.
 
Samples
The pharmaceutical dataset contains 12,947 segments; 475,509 en-US words; and 401,629 zh-CN characters. It is based on the CPC Patent Classification category A61K which covers pharmaceuticals.
 
The biotechnology dataset contains 10,377 segments; 379,898 en-US words; and 327,637 zh-CN characters. It is based on the CPC Patent Classification C12N which contains many biotechnology filings.  Sample data is shown below. More samples are available from Chilin upon request. 

Chilin can provide larger datasets (one million records or more) upon request. 

 
Suitable buffers include boric acid, sodium and potassium bicarbonate, sodium and potassium borates, sodium and potassium carbonate, sodium acetate, sodium biphosphate and the like, in amounts sufficient to maintain the pH at between about pH 6 and pH 8, and preferably, between about pH 7 and pH 7.5. 合适的缓冲液包括硼酸、碳酸氢钠和碳酸氢钾、硼酸钠和硼酸钾、碳酸钠和碳酸钾、醋酸钠、磷酸氢钠、等等,其量足以将pH维持在大约pH6-pH8,优选大约pH7-pH7.5。
Encapsulated dissolution formulations can be prepared either by coating particles or granules of drug with varying thicknesses of slowly soluble polymers or by microencapsulation.
通过将药物微粒或颗粒用不同厚度的缓慢溶解的聚合物包衣或通过微囊化可制备囊化溶出制剂。
While significant progress has been made in identifying factors that promote and inhibit angiogenesis, no treatment is currently available to specifically treat ocular vascular disease. 虽然在鉴别具有促进和抑制血管生成作用的因子方面取得了显著进步,但目前还没有特别有效的治疗眼部血管疾病的治疗方法。
 

The price on the data marketplace is .001 euros/English word. This is €475.50 for pharmaceuticals (A61K) and €379.90 for biochemistry (C12N). You may license each dataset in its entirety or fractionally (50% or 75%). 
 
 
Background
Chilin’s data is extracted from over 300,000 parallel comparable Chinese-English patent documents. This resource began as the training corpus and test sets used in the first two ever Chinese-English Patent Machine Translation competitions organized in Tokyo by NTCIR in 2009 and 2010. For details and outcomes of this competition please see Overview of the Patent Machine Translation Task at the NTCIR-9 Workshop.
 
 
Enquiries
Chilin has much more to offer beyond the above datasets.
For more information, contact us at here