Communicating globally: the advent of Unicode.
With the exception of IBM mainframes, which use an 8-bit encoding scheme called EBCDIC (Extended Binary Coded Decimal Interchange Code), ASCII has become the standard for all data communications and computers. Our keyboards, video displays, system hardware, printers, font files, operating systems, e-mail, Internet, and information services (DIALOG, BRS, etc.) are all based on ASCII. The widespread use of ASCII is a tribute to its usefulness. The plain ASCII text file represents the lowest common denominator of text files because it is, with the exception of line-feeds and tabs, unformatted text. Before word processors came with elaborate file conversion utilities, this lack of formatting made ASCII the ideal tool for transferring text from one word processor to another. These same "plain vanilla" qualities have also made ASCII files the obvious choice for uploading messages to electronic mail and communication services.(2)
The ASCII Squeeze
But while ASCII resolved many data transmittal problems, it nevertheless suffers from some serious limitations. The most obvious problem is indicated by the first two words of the acronym ASCII: American Standard. ASCII is truly a standard for the U.S. It is no longer an appropriate standard for the highly interconnected global village that requires the easy transmission of data between countries using languages other than English. The basic problem that has made ASCII an outmoded standard is that there is simply no way all of the world's written languages can be represented in only 256 different characters.
ASCII's lack of formatting can also be a problem. Its very simplicity causes difficulties for printers, publishers, and authors. For instance, when trying to include a quotation from any language except English, we have no simple way of representing umlauted or accented characters. In practice, we are forced to insert symbols from our word processors, or even worse, we may have to cut and paste paper versions of those quotes into a printed portion of the document. As a result we have lost the ability to create and electronically transmit a text document in a single file. Of course, we can avoid cutting and pasting by scanning the non-English text and then transmitting the text as a graphics file instead of as ASCII text. But this solution almost invariably results in compatibility problems with the publisher's software. ASCII's limitations even complicate life for programmers, who are forced to use multiple characters to represent such common mathematical operators as "greater than," "less than," or "greater than or equal to." Once again, the problem can be traced back to the limited character set that ASCII can represent.
One response to the limitations of ASCII was the double-byte character set (DBCS). DBCS represents some characters using 1 byte (8 bits), while other characters are represented by 2 bytes (16 bits). Because the length of DBCS characters was not uniform, programs written to work with DBCS had to continually test each character to see if it was represented using 1 or 2 bytes.
Unicode: The 16-Bit Solution
A far more elegant solution that avoids the limitations of ASCII and the complications of DBCS is to switch to an encoding scheme in which every character is represented as 2 bytes (16 bits). And that's where Unicode comes in. Like ASCII, Unicode assigns a number for each character; unlike ASCII, Unicode's fixed 16-bit length will provide room for 65,536 characters, which, supposedly, will accommodate all major living languages, including ideographs used in Japan and China, plus non-Roman alphabets such as Cyrillic, Hebrew, Arabic, Greek, and Sanskrit. Unicode's character set also includes an expanded set of math and technical symbols, subscripts, superscripts, accent marks, control codes, and special codes that indicate the direction of the text (left to right or vice versa).(3)
In fact, the name Unicode was coined to reflect the new code's important characteristics. It is meant to be universal--code designed to meet the needs of the international community--uniform--fixed-length codes for efficiency and simplicity of programming--and unique--with minimal duplication of character codes (important for Chinese characters).
The Evolution of Unicode: A Historical Perspective
U.S. software publishers currently produce approximately 75 percent of all installed software packages in the world. The American English versions of software are certainly convenient for U.S. users, but their predominance forces international users to work with software that, from their standpoint, is written in a foreign language. Although other countries have developed their own coding systems, the incompatibilities among the coding systems have made international software development both difficult and expensive. A better solution would be to establish an international standard coding system. And that is exactly what the International Standards Organization (ISO) decided to do.
In 1983, ISO began developing a 2-byte (16-bit) standard for character encoding, ISO 10646. The ANSI committee working on this new standard wanted it to be compatible with ASCII as well as all of the current national and international standards; e.g. individual hardware and software manufacturers' standards, ISO 646 (an international standard almost identical to ASCII), and similar standards developed under the auspices of groups such as the ECMA (European Computer Manufacturers Association) and JISC (Japanese Industrial Standards Committee).
The first problem confronting the proposed new standard was what to do with the existing control codes, such as carriage return, formfeed, tab, and line-feed. Since these codes already exist in each of the international standards, one solution was to include all of the existing codes in the new standard; however, that solution would use up 40 percent of the 65,536 possible characters. On the other hand, if these codes were simply eliminated, the new standard would be incompatible with the existing codes.
The next major problem was how to deal with the Chinese, Japanese, and Korean ideograms. Because most of the ideograms are derived from Chinese, they are often referred to as Han characters--after the Han dynasty. In China the Han characters are called Hanzi; in Japan, the Han characters are called Kanji, and in Korean, they are called Hanja. Each Han character represents a word or concept as opposed to a letter. In many cases the characters are very similar and have the same meaning in Chinese, Japanese, and Korean. Rather than repeat each character for each language, the ISO 10646 standard committee tried to eliminate many thousands of the very similar and duplicated characters, so that the characters used by all of the major languages would fit into the 65,536 possible characters that can be formed using 2 bytes.(4) As expected, there was not unanimous consent among the nations, so the committee decided to include each character even though there was considerable overlap between the ideograms. It quickly became obvious that 16-bit character encoding would not provide a large enough character set. After considerable discussion, the committee reluctantly decided to use a 4-byte code, which would permit around 4 billion characters.
While the 4-byte code provided a large enough character set, many in the computer community objected to 4-byte encoding schemes because of the added data storage and communication costs. Data compaction schemes were considered, but most felt that they were too complicated. Clearly, a compromise was needed that would meet the demands of both the international community and the computer community.
In 1987, Joe Becker and Lee Collins of the Xerox Palo Alto Research Center and Mark Davis of Apple took a new approach. They decided to develop a code that was simpler than ISO 10646, yet still capable of encoding all characters necessary for a truly international computing standard. Representatives of other companies--including some big players in the computing world--soon joined the discussions, and in January 1991, the Unicode consortium was incorporated as Unicode, Inc. Research Libraries Group, Metaphor Computer Systems, Microsoft, IBM, Sun Microsystems, DEC, Adobe, Claris, NeXT, Pacific Rim Connections, Aldus, Go Corp., Lotus, WordPerfect, and Novell all became involved. Unicode 1.0--a specification for a universal encoding standard--was the result of the group's collaboration. ISO approved Unicode as a subset of ISO 10646. To minimize the conversion costs, ASCII and Latin-1 (8-bit ASCII) are subsets of Unicode.(5)
Because of the emphasis placed by ISO 10646 on migration to a new standard while maintaining compatibility with the old standard, ISO's standard is much larger than the Unicode system. There are other significant differences as well. ISO 10646 stores characters in 4 bytes (32 bits) while Unicode characters are always 2 bytes (16 bits) long. ISO 10646 supports the character codes from many existing character sets while Unicode takes the "unification" approach of eliminating the duplicate Han characters. The ISO standard also reserves some 28,672 codes to represent all the control codes already established while Unicode reserves space for only the 65 ASCII control codes.
The influence of the computer community on Unicode is quite evident in the designers' emphasis on completeness, simplicity, efficiency, unambiguity, and fidelity. As a universal code, Unicode was designed to have enough space to represent all the unique characters that appear in the world's languages. To be able to represent all those characters in a 2-byte code, the Unicode consortium spent years identifying and eliminating over 11,000 duplicated ideograms. By eliminating these "duplicates," enough space was freed to permit the representation of some "dead" languages such as Sanskrit. Even though each language does not have its own separate character set within Unicode, the unique characters from each of the languages are represented in Unicode.
For efficiency and simplicity, each character is represented as a unique, unambiguous code that occupies a fixed amount of storage space: 2 bytes. Using the Unicode standard eliminates the need for complex modes or escape codes to specify modified characters or special cases. Another advantage is that Unicode has built-in special control characters for handling changes in text direction within a single line of text.(6) To make sure that Unicode is a truly universal code, it is not only necessary that each character can be uniquely represented, but each character must also be interpreted in the same way by every software package, regardless of the language being used. Unicode also strives for fidelity; that is, textual data should not lose anything when it is being converted into or out of pre-existing character encoding standards. When you transmit an 'A' or an 'e', the receiver should get an 'A' or an 'e'.(7)
Making the Transition
Because ASCII was (and still is) the dominant standard for data encoding, Unicode's first 128 codes are the same as ASCII's.(8) Although this means that some of the sorting problems experienced with ASCII will remain, the migration from ASCII to Unicode will be greatly simplified. In Unicode, as in ASCII, the digits zero through nine are coded so that the 4 low-order bits of the code equal their binary values--zero is encoded as 0000, one is 0001, etc. Uppercase letters will continue to be sorted before lowercase letters. As a result, a capital 'Z' will continue to appear in a sorted list before all lowercase letters so that 'Z' will be sorted ahead of 'a'. Intellectually, it might be nice if Unicode would use traditional alphabetical sorting in which we do not distinguish between uppercase and lowercase letters. However, ASCII is so widely used that these inconveniences are minor compared to the enormous advantages of keeping Unicode as compatible with ASCII as possible.
The first 8,192 codes in Unicode are allotted to standard alphabetic characters, with additional room for ancient characters that might be added in the future. Because of the need to have Unicode released as soon as possible, the definition of some scripts has been delayed for the present time.(9) The next 4,096 codes are occupied by punctuation, mathematical operators, technical symbols, shapes, patterns, and even dingbats (the decorative characters that are so popular with word processor users). The next 4,096 characters are reserved for Chinese, Japanese, and Korean alphabets (as opposed to the logographic characters) and punctuation.(10)
By far the largest part of Unicode's character set is reserved for the unified Han characters--some 27,000 characters, as specified by the Chinese National Standard GB 13000. There is also room for future expansion of other code sets in case linguists decide that additional characters need to be added to existing scripts. Finally, Unicode includes 5,632 spaces in a private user area, for users to implement as they see fit under private agreement. There are also 495 code points in a compatibility area that is designed to help developers convert to Unicode.(11)
The Future of Unicode
While Unicode appears to be ASCII's heir apparent, it has not solved all of the problems involved in creating a truly universal computing language. Even something as seemingly straightforward as representing the time becomes surprisingly complicated when we try to establish a global standard. For instance, in the U.S. we might represent the time as 8:32 p.m.; in Canada, it would be written as 20:32, while in Switzerland, the same time would be written as 20,32,00. Similarly, each country has its own formats and conventions for representing dates, measurements, and money. But unless we implement universal standards for all numeric formats, dates, quantities, symbols, punctuation, etc., there is just no way that any encoding system will be able to resolve these inconsistencies. Software designers will simply have to continue to live with these different conventions.
Unicode, as might be expected, is not without its detractors. One of the major concerns for many librarians is the simplified encoding used to represent the Chinese characters. Several librarians working in Taiwan have told us that they have reservations because they won't be able to depict all of the classical Chinese characters. Meanwhile, some members of the computing community argue that the increased storage required to store and transmit Unicode characters is a definite drawback. After all, it appears that Unicode characters will take twice as long to transmit over communication lines as well as use twice as much disk storage space as the same text stored in ASCII.
Actually, the situation is not this bad. For one thing, word processing documents are not pure ASCII text; only about 10 to 20 percent of the file is used to store text characters; the rest is used to encode formatting and to add information so that the operating system knows how to process and store the file.
Unicode will certainly have a profound impact on the computing industry. To implement Unicode, software developers must rewrite their existing software. However, unless they do convert to Unicode, they run the risk of losing their international markets. And as mentioned earlier, the list of major players behind Unicode reads like a "Who's Who" of the computing world. Unicode also has the potential to simplify software development. No longer will a software vendor be required to develop multiple versions of every software package, with each version specifically tailored to meet each market's individualized method of character encoding. The major software vendors are definitely betting that Unicode will simplify the process of modifying, updating, and upgrading their software packages, which is why Unicode's proponents argue that the new standard "will make multilingual software easier to write, information systems easier to manage, and international exchange of information more practical."(12)
What It Means for Librarians
Unicode will definitely have a major impact on the library community. As we mentioned in the accompanying sidebar, the MARC record uses its own definition of extended ASCII. Although the first 128 characters in the MARC record present no special problems in translating the records into Unicode, the characters from 128 to 255, currently used to represent foreign language characters not available in ASCII, will need special programs written to translate them into Unicode. While this is conceptually not a difficult problem, it is never easy or inexpensive to convert more than 40 million records, which is exactly the task facing the library community when it makes the switch from ASCII to Unicode. On the positive side, using Unicode in MARC records will allow us to add foreign titles to our electronic databases without transliterating the data. End users will then be able to search library catalogs in all languages for textual data rather than just the call number or ISBN. Of course, this assumes that the users have the software and the customized keyboards (or other input devices) needed to enter all the different international characters.
For a complete definition of Unicode, see the two volumes compiled by The Unicode Consortium and published by Addison-Wesley in 1991. Volume 1, The Unicode Standard: Worldwide Character Encoding, Version 1.0, "covers the general principles of the Unicode standard, conformance issues, code space allocation, control characters, composite characters, directionality, implementation, future plans, case mapping tables, and mappings to and from various company code pages."(13) Volume 2 discusses the complex problems of encoding Chinese, Japanese, and Korean.
Unicode, Inc. can be reached at 408/777-5870; fax 408/777-5082. The e-mail address is unicode-inc@ unicode.org and the URL for their Web site is http://www.stonehand.com/unicode/consort.html.
(1.)Dean Abramson, "Globalization of Windows," Byte (November 1994): 181.
(2.)"Typing Unicode characters from the Keyboard," PC Magazine, (December 7, 1993): 442.
(3.)D. Barker, "Beyond ASCII: Group Promoting 'Global Code' for Information Exchange," Byte, (May 1991): 36.
(4.)Although daily communications can get by using only 2,000 to 4,000 Han characters, Albertine Gaur estimates that the total number of ideograms in Chinese is around 50,000 characters; see Albertine Gaur, A History of Writing, rev. ed., New York: Cross River Press, 80.
(5.)John C. Dvorak, "Kiss Your ASCII Goodbye," PC Magazine (September 15, 1992): 93.
(6.)Chris Miller, "Transborder Tips and Traps," Byte, (June 1994): 94.
(7.)Kenneth M. Sheldon, "ASCII Goes Global," Byte, (July 1991): 108.
(8.)Charles Petzold, "Move Over, ASCII! Unicode Is Here," PC Magazine, (October 26, 1993): 374-376.
(12.)Jonathan Beard, "Computer Code Speaks in Many Tongues," New Scientist, (March 9, 1991): 8.
(13.)Cen Huang, "The Unicode Character Encoding Standard," in the document /English-Menu/ifcss.org/china-studies/compute/ccnet-archive/unicode.std, received (September 17, 1992); available from chuang@educ. ucalgary.ca; INTERNET.
RELATED ARTICLE: Encoding Characters: From Two Bits to a Byte
To show why Unicode is so important, here's a discussion of how computers store textual data, with implications about why current standards of encoding textual data (ASCII and EBCDIC) are inadequate to meet the needs of modern information technology.
Like their distant cousin the telegraph, computers represent letters, numbers, punctuation, and special symbols using a binary code. Instead of calling these characters a dot or a dash, however, the computer world represents data as the binary numbers: zero and one. Each zero or one represents one 'bit' of data. The term 'bit' is a contraction of BInary digiT. At first glance, it appears that our binary numbering system can only represent two unique states (a one or a zero). Of course, any system that can only represent two different states is not going to be very useful or interesting unless we can figure out a way in which the individual bits can be combined to represent all of the letters, numbers, punctuation, and special characters needed to store text inside a computer.
The solution is surprisingly simple: let each character be represented using a combination of bits. The more bits used to represent a character, the larger the number of characters we can represent. For example, if we use two bits we can represent four unique characters (or states): 00, 01, 10, and 11. If we use three bits to store a character, then we can represent eight unique characters; with four bits, our "alphabet" increases to 16 different characters. The number of unique characters that can be represented, thus, depends on how many bits we use to represent an individual character. Put another way, the number of different characters that can be represented in n bits is equal to [2.sup.n]. According to this formula, if we use six bits ([2.sup.6] = 2 * 2 * 2 * 2 * 2 * 2) we will be able to represent 64 unique characters.
In fact, the first method of encoding data that was universally accepted as a standard was 6-bit BCD (binary coded decimal). Because of the high cost of computer storage, the decision to represent characters using only 6 bits actually made sense. Besides, in the late 1950s and early 1960s, computers were primarily used to crunch numbers. Once there was a need to print out reports or store text, the limitations of the 6-bit code became painfully obvious. For starters, with only 64 characters available, there were only enough unique codes to represent the letters A-Z, the digits 0-9, and 28 special symbols (* / + - @ % #), punctuation (period, comma, exclamation, and question mark), and some special functions (tab, space, line feed, etc.).
Although widely used, the 6-bit code was soon found to be too limiting. For starters, there were simply not enough unique characters available to represent lower case letters or all of the characters on a standard keyboard. To overcome these serious limitations, the computing community changed over to what is undoubtedly the most famous standard in the world: ASCII (American Standard Code for Information Interchange). ASCII was created by Robert W. Bemer in 1965 and was certified by ANSI (the American National Standards Institute) in 1977. ASCII is a 7-bit code, which means that we now can represent 128 different characters. ASCII's expanded character set can now represent both upper- and lower-case letters, the digits, 33 symbols, and 33 control characters (a character that controls peripheral devices such as a modem, or a printer, e.g., carriage return, linefeed, tab, etc.).
The decision to standardize on a 7-bit code appears, in retrospect, a bit odd. Since it is so convenient to work with storage units based on a power of 2, virtually all modern computers are constructed to store and process characters as 8-bit units--called bytes. In practice, each text character is actually stored in a byte (8 bits); consequently, computers can actually represent 256 different characters since 256 different or unique bit patterns can be made using 8 bits ([2.sup.8]). Because computers are designed to work with 8-bit bytes, ASCII is often referred to as an a 8-bit code; technically, this is not true; only the first 128 characters (number 1 through 127) are defined by the ASCII standard. The rest of the possible characters, represented by the numbers, 128 through 255, were originally defined on PCs by an unofficial standard called IBM extended ASCII. This former industry standard has now been replaced for Windows-based applications by Microsoft's extended character set.(1) There is, however, one important library application that does not use either the Windows or IBM extended character set definition: the MARC record. Although MARC records do adhere to the ASCII standard for the first 128 characters, the definition of the final 128 characters are unique to MARC.
1. Microsoft Word User's Guide, (Remond, Wash.: Microsoft Corp., 1993), 757-766.
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||a new computer alphabet to replace ASCII|
|Author:||McClure, Wanda L.; Hannah, Stan A.|
|Publication:||Computers in Libraries|
|Date:||May 1, 1995|
|Previous Article:||Making the future a reality.|
|Next Article:||Mail bombing the big-I Internet.|