Language Variation in Chinese Communities : Database & Comparative Study (LIVAC) |
| | Investigators | Benjamin K Tsou (PI), Samuel W K Chan, Jerome P H Hu, Terence Y W Chan, Tom B Y Lai, H L Lin, Godfrey K F Liu, W F Tsoi, C H Chew (Singapore), J Tse (Taiwan) |
| | Funding Source | Chiang Ching Kuo Foundation for International Exchange |
| | Project duration | Nov 1992 - Jun 1997 |
|
This project, with generous funding for the past four years
from two sources, has two major objectives: (1) to collect
comparable and reliable data from several major Chinese speech
communities for a large authoritative electronic database so
that a wide range of specialists from different fields may use
it, and (2) to undertake rigorous comparison of the quantitative
and qualitative differences in the identifiable language
varieties and to explore how they contribute to our understanding
of the rich fabric of Chinese culture and society.
For object (1), Chinese newspaper data, initially from Hong Kong,
Taiwan and Singapore were collected regularly every four days.
Subsequently, newspapers from Shanghai and Macau were added. The
data came via the Internet, disks and email. Because different
internal computer coding systems were used for the Chinese
characters, the data had been regularized on the basis of
traditional characters using a uniform format. These texts,
unlike English, are printed uniquely without breaks for each
word. They were initially segmented into words by computer.
The results were subjected to human verification at least three
times. The new words are then tagged according to preset
criteria and added to the master dictionary for each target
community.
For object (2), statistical analysis was performed on the
different types of words and characters used in each community,
before qualitative comparison was made across the different
communities to study wideranging questions and issues. They
included global characteristics such as the range of characters
and words, and styles used in the different communities, and the
content as represented by the words and their cultural and social
significance. Additionally, the range and types of new words,
indicating how new concepts were represented, and new grammatical
features were studied, and their spread across the different speech
communities was traced. Other studies have begun on the distinctive
cryphic language of news headlines. Some initial findings have been
reported.
This unique synchronous database has been also compared with
the one developed from Hong Kong court proceedings, where Cantonese
has been increasingly used as an alternative to English.
It is
rather surprising to find that nearly 40% of the words used in a
single case of court proceedings are not found in the overall list
of 43000 words used in Hong Kong newspapers for an entire year(!)
This reflects a vast gap
between the language used by the Cantonese speakers in Hong Kong
and the language they are expected to use in the context of written
language, as found in newspapers. The social, cultural, and
educational implications are being studied at LISRC, which is also
soliciting further support to maintain and expand the unique and
useful database.
|