Six-digit coding method.
In Chinese, there are approximately 2,000-4,000 commonly used characters plus a few thousand more technical characters. It is impossible to have one key for each character. Thus, a coding system for Chinese characters is not only the basis for Chinese information interchange, but also an important tool in communication, text processing, and many other fields. A good coding method can benefit the modernization of China directly, since it is essential for a computer input system. More than one hundred Chinese character coding systems have been proposed, and some of them have been adopted as tools for communication and text processing [2, 3, 8]. They have various shortcomings however, and researchers are still looking for better coding methods. As computers begin to gain widespread use in China, there is an urgent need for a good coding system; yet there is no standard Chinese coding system at the present time.
Chinese characters incorporate shape, sound, and complex hieroglyphic meanings into an ideographic language, which is different from alphabetic languages used in most countries around the world. Consequently, there are several major difficulties in developing a good coding system. We think a good method should meet the following four requirements.
First, versatility is important. There are many forms of Chinese characters because of all the changes and complications in the long history of the Chinese language. For example, there are the traditional, variant, and simplified forms, which are unavoidable in studying ancient Chinese history. A coding system tailored to represent only the simplified form is obviously not good enough. How to code all forms of characters is not only a major challenge, but also required by the computerized study of the rich history of China.
Second, a standard style is essential for a standard coding method. There are many different shapes and complicated structures, since Chinese characters are evolved from ideographs. Common printing fonts include the Old Song, Fang Tou, Zheng Kai, Fang Song, and Li Shu; handwriting styles include Li, Zhuan, Cao, Hang, Kai, and Mei Shu. Different styles may lead to different codes. Therefore, it is necessary to choose a widely used style as the standard.
Third, the One Code, One Character (OCOC) doctrine, i.e., every code should have only one corresponding character and vice versa, is obviously desirable. At present, most coding systems cannot satisfy this. For example, the methods based on Pinyin are One Code, Multiple Characters (OCMC); i.e., one code has many corresponding characters sharing a common pronunciation. The Pinyin System consists of 403 different sounds  that are pronunciations of Chinese characters. Each sound can be read in four different tones that may not be used as input into a computer system. On the average, every sound corresponds to 17 characters with totally different meanings. For example, in the Xin Hua Dictionary, the Pinyin shi represents 78 characters that include: city, scholar, food, to lose, matter, wet, poem, lion, world, dead body, ten, stone, time, real, to be, to drive, to try. The Pinyin fu has 98 different characters, ji 119, and yi 131. The context provides the only means of distinguishing between them in spoken Chinese. On a computer, even if the users are familiar with Pinyin, they must look for the desired character among all the different characters. Furthermore, China has eight major dialects with totally different pronunciations . A coding system based on pronunciation can raise severe problems in inputting Chinese into a computer system. Clearly, OCOC is a requisite in good Chinese character coding systems. A major disadvantage of current OCOC systems, such as the Standard Code of Chinese Characters , is that the operator has to find the code in a chart that contains thousands of characters each time a character is entered.
Finally, a good coding system should be simple enough so that users can quickly figure out the code when they see a character. This is termed "See Character, Know Code" (SCKC). In this way, the method can be grasped easily without vast knowledge of the Chinese language.
Six-Digit Coding Method (SDCM) is designed to meet all of the above requirements. It is the first coding method which is based on the shape of characters. This article is organized as follows. We first illustrate the principles of this method. The coding rules are briefly explained next, and the advantages of this method are summarized. Finally, we discuss various viewpoints.
In SDCM, a Chinese character is divided into six sections. Each section is represented by a decimal digit. All six digits then make up a code that denotes the character. The standard form of characters for this method is in accordance with a standard, the Font Table of Commonly Used Characters in Printing, jointly published by the Ministry of Culture and the Language Reform Committee of the People's Republic of China, in Beijing, 1964. For convenience, we have adopted the most widely used Old Song style. A Chinese character can consist of single, double, and triple structures or combinations of these. For example, the character , is considered to have a single structure, the character  a double structure, and if a triple structure. On this basis, we classify each character into one of the following four types of structures:
1. Single and Double Structures
A character of this type is divided into six sections as shown in Figure 1.
Section 1 is the upper left corner;
Section 2 is the lower left corner;
Section 3 is the upper right corner;
Section 4 is the lower right corner;
Section 5 is the combination of Section 1 and Section 3;
Section 6 is the combination of Section 2 and Section 4.
We have categorized about 100 commonly used strokes into nine comprehensive groups, and assigned digits 1-9 to these groups. The coding table of strokes is not included here.
Each section is coded by a decimal digit that corresponds to the stroke of the character that occupies the section and belongs to the group to which the digit is assigned. Unoccupied sections or sections with un-grouped strokes are represented by the digit 0. Section 1 corresponds to the left-most digit; section 2 corresponds to the second left-most digit; and so on.
THE CODING RULES OF SDCM
All basic building blocks of Chinese characters and their corresponding codes are listed in the coding table of strokes. A character is first classified into one of the four types; then it is coded digit by digit from section 1 to section 6. If the start or the end of a stroke sticks out in a section, then this stroke is chosen to code the section. A stroke cannot be coded more than once, unless it crosses other strokes. if it crosses another stroke once, it may be coded twice; and if it crosses twice, it may be coded three times. In writing Chinese, characters are generally written from top to bottom, left to right. The start of a stroke is where you begin the stroke, and the end of a stroke is where you finish the stroke.
SDCM utilizes this feature in its coding. For example, in the character , there are two strokes starting in section 1, namely  and . The former is selected to code this section because it is higher than the latter. In section 2,  is chosen because its end is more exposed at the lower left corner than L is. Similarly,  is picked for section 3, and  is selected for section 4. In section 5,  is selected again because it is the most exposed and it crosses . For section 6,  is coded. Thus, according to the coding table, the code for  is 377831. Another example for the crossing rule is the character , whose code is 111242.
THE ADVANTAGES OF SDCM
First, SDCM adopts the Old Song style which is unified, widely established, and used in printing. To avoid errors in coding, we have standardized SDCM by revising a few of the irregularly formed characters that resulted from printing errors.
Second, 11,100 characters have been coded with SDCM, and no two characters share the same code. This fulfills the OCOC requirement.
Third, SDCM is the first method of coding based on the shape of characters. It is efficient and convenient. To code a character, a stroke is selected for each section and then its shape is compared with the strokes in the coding table to determine the digit for this section. In other words, the code only depends on the positions and shapes of the strokes in the characters. It is independent of the order of the strokes in the character. Consequently it makes SCKC possible and requires little knowledge of the Chinese language. Furthermore, on the basis of psychological analysis done in China, people recognize a character by first looking at the exposed parts of the sections of the characters at the sides and corners. If they still cannot tell what the character is, they then proceed to look at the subtle differences in the middle [5, 9]. Therefore, the four corners of the character sections, as well as the top and the bottom can be effectively used to differentiate between various characters. After coding thousands of characters, we have found that six sections suffice to identify a character even when some strokes are not coded, as shown in Examples 4, 5, and 6 in the previous section.
Fourth, by using six decimal digits to code Chinese characters, we can potentially code one million characters and still achieve the OCOC requirement. It is estimated that there are 50,000-60,000 characters in total.
Finally, with SDCM, one can code characters quickly. We expect that anyone with limited knowledge of Chinese would be able to master all the coding rules and code characters, without checking the coding table, after only two or three days of training. The code can then be quickly entered on a numerical keyboard.
Some critics say that a six-digit number is too long for one character and that it would slow down the input speed. For instance, they argue that the Pinyin System requires an average of three keystrokes to input the Pinyin of a character, but SDCM always needs six keystrokes per character, thus the input speed for Pinyin to SDCM is 1:2. The difference here is that the six keystrokes in SDCM call up the exact desired character because SDCM satisfies the OCOC requirement. In contrast, after the three keystrokes in Pinyin, you have identified only the pronunciation of the character, and you still have to choose the exact character of interest among all the characters that share the same pronunciation. The actual number of characters to choose from can range from three or four to over one hundred, depending on the specific pronunciation. The average, as we noted in the introduction, is seventeen. Since we plan to code 50,000 to 60,000 characters with the SDCM, six decimal digits are necessary. Moreover, SDCM is designed to reduce the amount of memorization. The additional typing can be compensated by the efficiency and the convenience of coding with the method.
The SDCM provides a useful tool for Chinese character processing on computers. We believe that the SDCM has a great impact on Chinese information processing because it adopts the standard style, achieves One Code, One Character, and can be used to code all forms of characters efficiently. We have already coded over eleven thousand characters, and we plan to develop an SDCM software and use neural networks to code characters.
(Tables and other figures omitted)
1. Archer, N.P., et al. A Chinese-English microcomputer system. Commun. ACM 31, 8 (Aug. 1988), 977-982.
2. Chen, C.K., and Gong, R.W. Evaluation of Chinese input methods. Comput. Process. Chinese & Oriental Lang. I (Nov. 1984), 236-247.
3. Chiu, A., and Wang, F. An intelligent, knowledge-based Chinese input system. Comput. Process. Chinese & Oriental Lang. 3 (May 1987), 25-32.
4. DeFrancis, J. The Chinese Language: Fact and Fantasy. Univ. of Hawaii Press, Honolulu, Hawaii, 1984.
5. Li, J.L. Ji Zhong Shi Zi Xin Li Chu Tan (A primary research on [the] psychology of character recognition). In Ji Zhong Shi Zi Jiao Xue Xuan (Selected Articils on the Education of Character Recognition), Edited by the Central Research Institute of Science of Education. Science of Education Publishing Co., Beijing, 1980, 195-215. In Chinese)
6. The People's Republic of China National Standard Code of Chinese Graphic Characters Set for Information Interchange Primary Set. Technical Standards Press, Beijing, 1981. (In Chinese)
7. Xin Hua Dictionary, 6th ed. Shang Wu Press, Beijing, 1987. (In Chinese)
8. Yu, W.C.P., and Chen, T.C. Two-level encoding for Chinese input systems. Comput. Process. Chinese & Oriental Lang. 1 (1984), 225-235.
9. Zhou, Y.G. Zhong Guo Yu Wen de Xian Dia Hua (The Modernization of Chinese Language). Shanghai Education Publishing Company, Shanghai, 1986. (In Chinese)
CR Categories and Subject Descriptors: H.4.1 [Information Systems Applications]: Office Automation; I.7.1 [Text Processing]: Text Editing
General Terms: Languages
Additional Key Words and Phrases: Chinese coding method, Chinese input system, Chinese text processing
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||new coding method for Chinese characters|
|Author:||Qiao, Jinan; Qiao, Yizheng; Qiao, Sanzheng|
|Publication:||Communications of the ACM|
|Date:||May 1, 1990|
|Previous Article:||Alphabets & languages.|
|Next Article:||Building bilingual microcomputer systems.|