Computational Linguistics

(3,269 words)

Author(s): Chu-Ren HUANG | Su QI
The earliest efforts on Chinese language processing can be traced back to the 1960s, with the invention of Chinese input methods. To enable the QWERTY keyboard to adopt Chinese, both orthography (character) based and phonetic (pīnyīn 拼音) based conversions were proposed. To further enhance the computerized processing of Chinese text, more efforts started in the late 1980s, marked by the first computational linguistics conferences in both China and Táiwān in 1988 and followed by increased research…
Date: 2017-03-02

Academia Sinica Balanced Corpus

(706 words)

Author(s): Keh-Jiann CHEN | Chu-Ren HUANG
1. The Sinica Corpus Academia Sinica Balanced Corpus (Sinica Corpus) is the first proportionally sampled Chinese corpus with part-of-speech tagging. The corpus (Sinica 1.0) was compiled and opened to the research community through direct license in 1995 (Huang et al. 1995). Its size was two million words. After 10 years of further development, it was upgraded to the Sinica 5.0 with ten million words in 2005. Its on-line web service is available at The corpus can also be accessed through direct licensing from the ROCLING Society (…
Date: 2017-03-02