A Chinese - English microcomputer system.
Computer handling of Japanese text is related to Chinese since the Kanji characters used in Japanese are of Chinese origin. In addition, there have been at least 400 projects  undertaken concerning the handling of Chinese character computer input by universities and other organizations in the PRC (see  for example). An indication of the interest and advancement in Chinese computing can be sampled by examining the proceedings of some of the International Conferences on Chinese Computing (for example, ) which contain numerous papers (in English) on the subject.
Part of the current growth in computer use in the PRC is due to increased use of mainframes and minicomputers [9, 26], with associated development in Chinese character systems [7, 11, 23, 27]. There is still a shortage of computing facilities in the PRC , however, a significant part of the recent growth in computer use has been with low cost microcomputers. Moreover, much progress has been made recently with Chinese-English microcomputer systems that allow both Chinese and English to be used, thus gaining the advantage of access to existing English software and systems.
Conversion of an English-based ASCII microcomputer system is possible by inserting a software interface between the operating system and the input/output control system so that Chinese characters are detected and managed in a special way. With the appropriate system software, the modified system can support programs for input, display, print and communication of both types of characters--allowing the user to input either Chinese or English characters on the same keyboard, display them on the same screen, and print them on the same printer. When the system software is modified to handle Chinese as well as ASCII characters, existing English application software can be modified to include Chinese characters where text is used (in menus, for example). This allows the Chinese-speaking user ready and straightforward access to the software.
Chinese-English system packages have been developed for some of the more popualr microcomputers such as the IBM PC, Apple II, and the Tandy TRS-80. There is a wide variety of these packages now in use, including the Dragon system, developed in Taiwan [5, 17]; the NK-DOS system, developed at Nankai Unviersity in the PRC; and a system developed by the PRC-based Electronic Research Institute. The examples given here will be based on the implementation of the NK-DOS system, due to our familiarity with it. Figure 1 shows the initial screen for the NK-DOS system.
A diversiy of applications have been developed at Nankai University and an associated company (Nankai Technological Development Corporation) to interface with NK-DOS, including: database management systems used for medical information; libraries and warehouse inventory management; software for maintaining technical drawings and scientific designs; an expert system providing diagnostic checks for conventional Chinese medicine; an industrial financial management package; a production management package; and others.
COMPUTER CODING SYSTEMS FOR CHINESE
Many Chinese words consist of one character, but there are many more with two, three and sometimes four or five characters. The average number of characters in a sample of English text is about 3.7 times that of Chinese characters in equivalent text . Since Chinese is an ideographic language there are many thousands of Chinese characters, but the Chinese character set is relatively uniform throughout the PRC, even though there are eight major distinct languages spoken there and many more dialects within these languages . Mandarin Chinese (Putonghua) is the standard language in the PRC, which provides a standardizing influence advantageous to the class of computer input methods which rely on phonetics.
While most traditional Chinese writing is in vertical columns, modern Chinese usage in business and technical writing is from left to right as in Western languages, allowing for easier adaptation of Chinese to Western computer systems.
Any one of many techniques which have been developed may be used to input Chinese characters with a keyboard. While many methods have been developed for entering Chinese characters into computers, the internal code which represents each character in the computer should adhere to some standard. Unfortunately, there is no single standard for computer coding of Chinese characters. Some coding systems use 2 bytes, but others use three and some as many as 6-byte codes for internal representation of each character.
The most popular coding system in the PRC uses a 2-byte internal code which is faster to manipulate. This Standard Chinese Character Code for Information Interchange (GB 2312-80) has been defined by the PRC, and contains 2-byte codes for the 6,763 most commonly used Chinese characters in two classes; the first contains 3,755, and the second most frequently used class contains 3,008. It also includes characters from other languages such as English and Russian. If this code is used with a Chinese-English computer system, the most significant bit (MSB) of each byte can be flagged as a "1" if it represents a Chinese character and a "0" if it represents an ASCII character. A disadvantage is that an 8-bit character must be used for data communications or the MSB will be lost during transmission. Otherwise, code transformation must be performed before the data are transmitted.
Standards developed thus far for Chinese character codes also do not necessarily cross national boundaries and this inhibits the development of data communication systems. Modern reform of the Chinese character set in the PRC by eliminating variant forms and simplifying characters now requires translation of Chinese characters between the PRC and other countries using the old Chinese character set. Other countries have developed their own Chinese coding systems. For example, Taiwan has developed a Chinese character code  for information interchange (CCCII) which contains 4,807 of the most frequently used and 16,197 less frequently used characters. There is also an International Syllabus of Chinese Characters with another coding scheme. Japan has adopted a 2-byte standard code--JIS C6228--to represent Kanji characters (which originated from China ). Work is underway through the International Standards Organization (ISO) to develop standard international communication character coding systems for a number of languages, including Chinese . The proposed standard will also use a 2-byte character code.
CHINESE CHARACTER INPUT
The most difficult problem facing the designers of Chinese language computer systems has been in developing techniques for efficient and easy character input. Once the characters have been input, the problem of displaying, printing or transmitting Chinese characters can be handled as long as there is an agreed-upon code for the internal representation of the characters. Input techniques for Chinese characters may be separated into four general classifications: defining each character by the whole character, phonetically by sound, by shape, or by mixing sound and shape characteristics.
Input keying by whole character is based upon the use of an unique numerical code for each Chinese character. There are two well-known standards for these unique codes. The four-digit telegraph code is one of the earliest systems and is still in use. A second system is the GB 2312-80 standard character set discussed previously, represented by a 2-byte code. The obvious difficulty with using either of these systems is that it is impossible to memorize the thousands of numbers needed to represent the characters, and users must resort to looking these up in a table.
Other methods use whole character input through special keyboards. A big keyboard which has been used contains up to 4,000 keys, one for each of the more commonly used Chinese characters. But searching a keyboard this size to find the required key is tedious and awkward. A "middle" keyboard has many fewer keys, but allows the use of special keys to select from among many possible "pages" of character images on the same keyboard. While this is an improvement on the big keyboard, it still requires a great deal of practice to become proficient in its use. Other methods (28) allowing the use of much simpler keyboards are based upon two-level encoding of Chinese characters, requiring two keystrokes for each character.
A second classification uses the phonetic symbols of Mandarin Chinese. These are written in Roman characters, a phonetic spelling known as "hanyu-pinyin." The difficulty with the phonetic entry of Mandarin characters is that there may be many characters with the same pronunciation (homonyms). To solve this problem, the homonym character set for each phonetic string entered may be displayed automatically on the screen, allowing the user to select the desired character. While phonetic input of characters or words allows a reasonable rate of speed after some practice, users who are not very familiar with Mandarin Chinese and its phonetic spelling will obviously still have difficulties. However, most Chinese children are trained in the phonetic spelling of Mandarin Chinese during their early school years, providing an educational background to build upon.
A third classification of character entry methods involves representing the shape or position of the character. There are many ways of breaking down character shapes into various types of fundamental characteristics. These approaches are feasible, but very tedious for the more complex characters. A method of building more complex characters from duplications of simple characters can also be used. For example, the Chinese character for "wood" (shown later in Figure 3) can be repeated to form a new character which is a combination of two of the original characters depicted. This Chinese character means "forest." The technique is complex because of the knowledge required about character structure, but Chiu and Wong  have recently developed a knowledge-based system to handle the complexities of shape-based input.
There is an additional input classification which attempts to combine the features of shape and sound of Chinese characters, and there are also several other more advanced techniques. For example, Becker  describes a multilingual word processing system developed by the Xerox Corporation which uses phonetic entry of entire Chinese words. The occurrence of homonyms in words is far less likely than with characters, and therefore users of this technique are not often required to make choices among homonyms. This requires storage of a chinese dictioanry of commonly used words which is searched by the computer for each word entered. It neds a relatively powerful computer system, since it is a larger task than searching a more limited set of characters. (To gain a better perspective of the many Chinese character input techniques, a paper by Chen and Gong  should be consulted for a relative evaluation of 10 of the more commonly used systems, out of approximately 30 available from computer vendors in Taiwan.)
Because users may wish to use more than one of the many input techniques available, depending upon the knowledge of the user or the particular text being input, it is customary to offer several input options to the user of a Chinese system. To illustrate, Figure 2 shows the input option screen for the NK-DOS system which runs on an IBM PC or equivalent micro-computer.
The software for handling the input of Chinese characters may allow several such options for character input, including the ASCII character set, and thus requires software to handle each such option. However, all such character information must be transformed into a standard internal character code which can be processed by the character management software. In the newer microcomputer-based systems, keyboard control is through operating system functions. For example, in NK-DOS an expanded 1 Kbyte of RAM is used for the keyboard buffer. The input from the keyboard is stored in a FIFO queue. NK-DOS analyzes the keyboard input to determine whether it is a control, Chinese, or ASCII character. Special subroutines are called in each case to perform the required function. The entire keyboard interface program occupies roughly 10 Kbytes of memory space and is stored in EPROM.
CHINESE CHARACTER PRINTING
Chinese characters are usually printed on dot matrix or laser printers. Because of the complexity of Chinese characters, a resolution of at least 16 x 16 pixels is normally required to display the characters in a readable form on printers or screens. for most standard printers, the Chinese character sets are not included in ROM, so they must be transmitted in graphic form to the printer. The entire character set cn be stored in computer storage in a form that can be suited to the resolution of the printer. Chinese characters are typically generated at twice the height and width of ASCII characters.
For printer control under NK-DOS, for example, the Chinese print function calls must cover a much wider spectrum of functions than the English function counterparts. Depending upon the type of printer that is used, NK-DOS will execute different print programs. The major function of the print program is to create a Chinese/English print buffer so that one row of characters in dot matrix format will be stored and then transmitted to the printer in graphic print mode. The process includes acquiring printer information, setting up and converting the dot matrix for each character, and deciding on print format. For printing the Chinese characters in different sizes, NK-DOS relies on the MS-DOS operating system functions for execution.
Chinese characters can be displayed on standard screns in 16 x 16 pixel cells but higher resolution systems could use 24 x 24, 32 x 32 or even more pixels per character. Since at least one column in each character must be used as a separator, only 16 x 15, 24 x 22, or 32 x 30 pixels are normally used for the character itself. The display of Chinese characters is no different than for ASCII characters, but using a standard microcomputer system requires that the Chinese characters be bit-mapped as special graphical characters. Figure 3 shows the bit map for the chinese character "wood." The internal hexadecimal representation of the 16 x 16 bit map of the character shown in the figure is 00, 80, 00, 80, 00, 80, 00, 80, 7F, FF, 01, C0, 02, A0, 04, 90, 08, 88, 10, 84, 20, 82, 40, 81, 00, 80, 00, 80, 00, 80, 00, 80.
Typically there are 25 rows and 80 columsn on a standard PC ASCII display screen, giving 2,000 characters per screen. On the IBM PC, the monochrome display system has a character generator and a 4 Kbyte memory buffer (1 byte for each character and 1 byte for its attributes) to store the codes for the characters to be displayed. If the character generator is not used, as with the high resolution monochrome graphics mode of the IBM color graphics adapter used by the NK-DOS 02 system, then 16 Kbytes of display memory are available to store the necessary character bit maps for the 640 x 200 screen. For 16 x 16 characters the display memory can store only 480 character cells, and display 12 rows with 40 characters per row. For NK-DOS 04 and later versions, a screen resolution of 640 x 400 may be used with a monochrome graphics adapter, allowing the display of 25 lines with 40 characters per line.
The 16 Kbytes of display memory in the NK-DOS 02 system is used for direct storage of the Chinese and ASCII character bit maps. In a Chinese-English system, an ASCII character normally takes up half the number of columns and therefore half the display space of a Chinese character.
For the screen display of text, a "display buffer" corresponds to the block of text in RAM memory which is also concurrently stored in the display memory on the graphics controller card. The normal PC display capacity for ASCII characters is 2 Kbytes. Every byte in the display buffer is linked to a character in the display memory. Using the 2-byte coding system for Chinese characters, there are 2-bytes linked to a display memory character. The same buffer can be used for mapping both Chinese and English characters. Part of it can be used as a window which maps the Chinese character display memory. The high level software recognizes this memory as a virtual display buffer and accesses it, linking the virtual display buffer to the physical display screen by moving the window about on it. There is then no need to distinguish between Chinese characters and other characters which can also be represented in dot matrix form. The standard English characters can be written into display memory by a function in the BIOS (Basic Input-Output System) routines, invoked through a standard interrupt feature of the MS-DOS operating system.
To edit an ASCII screen display when a 2-Kbyte display buffer is used, one needs only to modify the display buffer and then call the BIOS display subroutine. However, in generating a graphic display, the character dot matrix as supplied in a user table is written into the 16-Kbyte display memory when the display function routine is called. Any ASCII character can also be displayed in the same manner in graphics mode. Editing of the screen (insert, delete, etc.), requires rewriting the entire display memory. If we want to use the 2-Kbyte character generator memory map which stores the ASCII character information, we can use a simple subroutine to align the display buffer with the 16-Kbyte display memory. This allows editing the display memory as if we were using the character generator.
To display Chinese characters, we can add a "Chinese character function call" subroutine. This is equivalent to the function call used to managed the display of ASCII characters. This routine links the Chinese part of the display buffer to the display memory. In the NK-DOS 02 system, for example, the Chinese character area occupies 960 bytes (2 bytes per character referenced) of the 2-Kbyte display buffer. To handle this situation, the NK-DOS system uses part of the display buffer as the Chinese display buffer. By changing the low address of the 960-byte Chinese virtual display buffer, this window can be moved about in the 2-Kbyte buffer, and thus scroll any of the characters referenced in the buffer onto the screen. The window movement is controlled by a Chinese display function call. For any high level software, this becomes a transparent interface controlled by either the ASCII or the Chinese display function calls, as determined by the initial system boot.
The NK-DOS software also attends to other screen display and editing functions. For communicating with the scan control chip in the graphics controller board, the original MS-DOS function calls are used, under MS-DOS control.
OPERATING SYSTEM AND HIGH LEVEL
To provide a system which works easily with both Chinese and ASCII characters, the operating system must be modified so it can recognize Chinese characters and support the associated character processing routines. Newer operating systems for use with Chinese character input are likely to have built-in capacity to handle multiple character sets, as in the TRON operating system being developed in Japan . However, it is possible to do this with existing hierarchical operating systems such as CP/M and MS-DOS. For example, the BIOS in the IBM PC has many function calls which control the operation of the display terminal, keyboard, printer and disk drives. The MS-DOS operating system has an interface level IBMBIO which executes a DOS command by calling BIOS. As shown in Figure 4, if a Chinese character processing interface level program is inserted between BIOS and IBMBIO by modifying the addresses in certain standard DOS interrupts, then basic Chinese character processing routines can be used to intercept calls which involve Chinese characters for input, display, printing, or communication. Using these procedures when DOS calls BIOS, the Chinese interface level passes control to the appropriate Chinese processing function if Chinese functions are involved. If not, the call is passed directly to BIOS, as in handling ASCII characters.
The Chinese-English interface for the NK-DOS system between the MS-DOS operating system and BIOS on an IBM PC consists of two parts. One part distinguishes between ASCII and Chinese characters. The other part of the interface is used for controlling peripheral devices. When the machine is booted, the user selects either ASCII or Chinese mode. If the Chinese mode is chosen, the system automaticlly begins execution of NK-DOS. Otherwise, the system operates under MS-DOS control.
DATA COMMUNICATIONS WITH CHINESE
Data communications are still in their infancy in China  and there is no standard technique for asynchronous communication. As an example, the NK-DOS 02 and NK-DOS 03 systems can transmit on both Omninet and Ethernet local area networks (LANDs).
Some of the difficulty with Chinese character communication is due to the size of the character codes. When a 2-byte internal code is used for Chinese characters, all 16 bits are used for data. For synchronous communications or for asynchronous systems using 8-bit characters, all 16 bits are transmitted. However, for asynchronous communication systems using a 7-bit character, the MSB will be destroyed and thus Chinese character identification will be lost. Therefore, before characters are transmitted, it may be necessary to transform the data in order to avoid the loss of this identifying information. This is often done by using more than 2 bytes for the Chinese character information. The data transmitted may then be re-assembled into the original 2-byte code at the destination. Some Chinese character systems add a header in front of a Chinese character string. This header may be only 1 byte, but it is often 2 to 4 bytes. For a 4-byte header, the first 2 bytes will be a specific flag to identify the following string as Chinese characters. The next 2 bytes in the header will contain such information as the packet size in bytes, including the header. A packet with a 4-byte header can contain up to 8 Kbytes. Each Chinese character will use 2 bytes including the MSB, even though the MSB could be masked during communication to another system. If the Chinese system interface program is processing information for communication when it includes a Chinese character string, it puts headers in front of strings to be transmitted. It also strips headers off the strings it receives and resets MSBs to 1 for Chinese characters.
CHINESE CHARACTER FONTS
All together there are about 7,500 characters in common use in the People's Republic of China, including the 6,763 first and second most commonly used groups of Chinese characters, and counting Roman, Greek, Japanese kana, and special characters. In creating a character font, space should be reserved for about 8000 characters. For 16 X 16 pixel resolution for each character, the font would require 256 Kbytes of storage. For 24 X 24 resolution, about 576 Kbytes is needed. The character font may be stored on floppy disk, hard disk, RAM, ROM or EPROM. While a floppy disk or hard disk may have enough storage for 16 X 16 resolution, both are too slow for most applications. The entire font may also be transferred into RAM when the program is being booted. However, this takes up so much RAM storage it may limit the use of application programs, particularly on most PCs using current versions of DOS, which are limited to 640 Kbytes of RAM storage. The best option is to store the character font in ROM or EPROM on a card which is inserted into the microcomputer system. This font can be shared by the printer and the screen. Some printers allow the installation of a Chinese ROM in the printer, thus simplifying the printing of the combined Chinese and ASCII character sets.
While most commercially available systems supply the basic set of Chinese characters for input, display and printing purposes, it may be necessary to develop special characters which are not available with the system. For this, a program must be provided for the user to develop a dot matrix display map of the character and store it in an auxiliary character font, along with the storage code which will be used to manipulate the character internally. However, user-developed characters can create problems in a distributed network environment, since they cannot be transmitted to other systems such as servers or PCs which do not recognize them.
Acknowledgments. This work was supported partly by the Chinese State Education Commission and the World Bank through financial assistance for Professor Huang and Professor Liu during a one year visit at McMaster University, and partly by the Natural Sciences and Engineering Research Council of Canada.
|Printer friendly Cite/link Email Feedback|
|Author:||Archer, N. P.; Chan, M. W. L.; Huang, S. J.; Liu, R. T.|
|Publication:||Communications of the ACM|
|Date:||Aug 1, 1988|
|Previous Article:||An empirical study of computer capacity planning in Japan.|
|Next Article:||The 1986 - 1987 Taulbee survey.|