Language Variation in Chinese Communities :
Database & Comparative Study (LIVAC)
 InvestigatorsBenjamin K Tsou (PI), Samuel W K Chan, Jerome P H Hu, Terence Y W Chan, Tom B Y Lai, H L Lin, Godfrey K F Liu, W F Tsoi, C H Chew (Singapore), J Tse (Taiwan)
 Funding SourceChiang Ching Kuo Foundation for International Exchange
 Project durationNov 1992 - Jun 1997

This project, with generous funding for the past four years from two sources, has two major objectives: (1) to collect comparable and reliable data from several major Chinese speech communities for a large authoritative electronic database so that a wide range of specialists from different fields may use it, and (2) to undertake rigorous comparison of the quantitative and qualitative differences in the identifiable language varieties and to explore how they contribute to our understanding of the rich fabric of Chinese culture and society.

For object (1), Chinese newspaper data, initially from Hong Kong, Taiwan and Singapore were collected regularly every four days. Subsequently, newspapers from Shanghai and Macau were added. The data came via the Internet, disks and email. Because different internal computer coding systems were used for the Chinese characters, the data had been regularized on the basis of traditional characters using a uniform format. These texts, unlike English, are printed uniquely without breaks for each word. They were initially segmented into words by computer. The results were subjected to human verification at least three times. The new words are then tagged according to preset criteria and added to the master dictionary for each target community.

For object (2), statistical analysis was performed on the different types of words and characters used in each community, before qualitative comparison was made across the different communities to study wideranging questions and issues. They included global characteristics such as the range of characters and words, and styles used in the different communities, and the content as represented by the words and their cultural and social significance. Additionally, the range and types of new words, indicating how new concepts were represented, and new grammatical features were studied, and their spread across the different speech communities was traced. Other studies have begun on the distinctive cryphic language of news headlines. Some initial findings have been reported.

This unique synchronous database has been also compared with the one developed from Hong Kong court proceedings, where Cantonese has been increasingly used as an alternative to English. It is rather surprising to find that nearly 40% of the words used in a single case of court proceedings are not found in the overall list of 43000 words used in Hong Kong newspapers for an entire year(!) This reflects a vast gap between the language used by the Cantonese speakers in Hong Kong and the language they are expected to use in the context of written language, as found in newspapers. The social, cultural, and educational implications are being studied at LISRC, which is also soliciting further support to maintain and expand the unique and useful database.