Examining XML: New Concepts and Possibilities in Web Authoring.
XML, extensible Markup Language, is being touted by many as the "next wave" of authoring options available on the World Wide Web. The XML draft, first proposed November 1996, and the XML 1.0 specification, released February 1998 as a World Wide Web Consortium (W3C) [Recommendation.sup.1], have received a lot of attention. Many in the Internet community are introducing XML as a tool for revolutionizing information on the Web. As librarians try to stay abreast of forthcoming technologies, some wonder what XML is and how it will impact the library field.
Markup languages such as XML and HTML use tags or other modifiers to tell a computer how to display, recognize, or otherwise work with the marked material. Both XML and HTML originate from SGML, the Standard Generalized Markup Language. But XML is unlike HTML in that it can be extended to suit many different uses, hence the name extensible Markup Language. XML was created specifically with the Web environment in mind, so it is tailored to meet its demands. The author can modify the markup language rather than the document. In HTML, if no tag does what the author needs, nothing can be done unless later iterations add necessary tags or devise a work-around. But an extensible language like XML allows for adjustment and growth.
The XML family includes several sections that address different functions of markup. XLL (eXtensible Linking Language) governs the linking capabilities defined in the [Xpointer.sup.2] and [Xlink.sup.3] specifications. The XLink and XPointer specifications will take advantage of the full potential of hypertext, by allowing the creation of new options such as multidirectional links. XSL (eXtensible Stylesheet [Language).sup.4] specifies formatting and presentation of XML information, and XML governs the semantic and syntactic structures of the data. We will only give brief mention to the XLL and XSL specifications; the majority of this article will be concerned with examining XML.
Discovering What XML Is
XML is a [metalanguage.sup.5], or a set of rules governing the development of unique tags for encoding documents. Instead of using the pre-existing tagset available in HTML, users can design their own tags, which define the content, syntax, and semantics of their data. In other words, to encode a MARC record, you could design a tagset specifically for this purpose. A possible example of an XML-encoded MARC record is shown in Figure 1.
We have just created the rudiments of a markup language. The library field, for instance, could create its own markup language containing tags such as <abstract>, <article>, <dissertation>, or any of the tags mentioned in Figure 1. Markup languages using XML applications for mathematics [(MathML).sup.6] and chemistry [(CML).sup.7] are already under development. XML provides the ability to formulate precise tags relating to librarianship, or any other subject.
Comparing HTML and XML
Because SGML is complex, hard to learn, and has expensive software, alternative markup languages were developed. HTML is simple, easy to learn, and needs little in the way of software. XML provides the extensibility and adaptability of SGML, but, like HTML, is specifically designed for use on the Web. It is important to understand that XML is not likely to replace HTML and that the two are complementary. HTML addresses structure and presentation; it handles textual information very well. XML is best suited for handling data semantics and meaning, as well as data presentation issues not addressed in a purely text-oriented language like HTML. However, HTML is also much simpler to code and use than XML, which makes it better for the majority of casual Web authors. HTML will most likely remain the language of choice for encoding text and graphics as well as for arranging basic layouts.
So, what does all this mean? After all, everyone has been getting perfectly good information so far. But as many librarians know, there are limits to the types of data HTML can show. Charts and diagrams often cannot be displayed in text, but instead require the inclusion of an image. Data stored in an image format cannot be transferred to other applications directly. Complex equations and formulas in the science and mathematics fields cannot be typed out easily; often pages with multiple equations have to be entered in Adobe Portable Document Format (PDF), which does not allow for much use and manipulation of the data on the receiving end. Spreadsheets and other forms of data are limited in how they can be expressed via the Web. Since HTML is mainly geared toward documents with images and text, existing Web formats do not allow for interactive use of the many types of data. XML allows a great deal of new information to be transmitted via the Web.
Furthermore, since XML allows customization of tags, it permits authors to add a higher level of information, or metadata, to the document. Metadata can provide commentary or qualifications for the encoded data, including semantics and meaning. Using our previous example, MARC-record field information is directly embedded into each specific tag. So each tag consists of field information and data that could then be imported directly into a spreadsheet, database, or OPAC. The tags tell the program how to interpret the field names and the data contained in each. Thus the document is more portable and can be manipulated much easier than in the current architecture of HTML.
Style, presentation, and formatting issues can be addressed by implementing style sheets. Style sheets provide the advantage of separating style from content, and allow regulation and consistency of style. The main types of style sheets used with HTML include Cascading Style Sheets (CSS) and Document Style Semantics and Specification Language (DSSSL). Both of these can be used with XML documents as well; in fact, the XSL specification is based in part on DSSSL. The eXtensible Stylesheet Language (XSL) is being developed to specify conversion and formatting of XML documents. In the same way that XML allows authors to specialize their markups, XSL will allow for increased specialization of style for different types of documents.
The major disadvantage of using XML to author Web documents is the newness of the [technology.sup.8] Microsoft Internet Explorer 4.0 seems to be the only commercial browser at this time that includes partial support for XML. For more information visit the Microsoft XML home page at http:// www.microsoft.com/xml/contents.htm. Both Microsoft Internet Explorer 5.0 and Mozilla (formerly known as Netscape Navigator 5.0) promise more comprehensive support for XML documents.
The cost of new software could he prohibitive to users and authors alike. Librarians must look at their user bases and determine if there is support for a new technology. Since the formal XML recommendation 1.0 was just issued, it is very difficult to predict exactly how much growth it will see. Some members of the Internet community are pinning high hopes on XML, though, and we can anticipate that more XML-derived languages will be proposed and adopted.
Writing Documents in XML
Currently, some tools do exist for authoring and reading XML documents. A good starting point for learning more about XML applications is The Web Developer's Library Software Guide found at http://www.wdvl.com/software/xml. The list of software includes browsers, editors, database tools, and parsers that work with the XML specification, and this list continues to grow daily as interest in XML increases. Microsoft also has a Web page dedicated to possible XML scenarios at http://www.microsoft.com/xml/scenario/intro.asp. According to Microsoft, Word, Excel, Office, drawing packages, mailers, and spellcheckers all will use XML soon because its flexibility makes it wonderful for adapting to a variety of uses.
Writing XML documents is a little bit more complicated than writing HTML documents. Some of the rules are similar to well-written HTML, although in XML those rules must be followed whereas in HTML they are not mandatory but mainly a matter of traditionally accepted practice.
There are two major concepts in proper XML authoring. XML can be referred to as being "well-formed" and "valid." Simply, one could say that the well-formedness constraints are items that are absolutely mandatory under any circumstances, and the validity constraints have to be followed only under certain circumstances--particularly if you expect that the document will be read with a validating parser. (An XML parser is software that checks for the validity and well-formedness of an XML document.) An XML document that is not well-formed will give a fatal error, which means that all XML documents must be well-formed or they will not be displayed. Not all XML documents must be valid, however. A non-validating processor can be used with a document whether it is valid or not, while a validating parser may find an error, but not a fatal error, with an invalid document. The XML specification discusses a huge number of validity and well-formedness constraints.
Well-formed XML documents are those that conform to the basic format in which XML is designed. This means that elements must be nested properly, with no overlapping tags. Using HTML tags, properly nested elements would appear as such: <H1> <BLINK>Sample Text</BLINK></H1>. As demonstrated, the first tag opened must also be the last tag closed. All opening tags must be paired with a corresponding closing tag. All empty tags must be designated within the tag. All attributes must be properly surrounded by quote marks. All entities must be properly formatted; that also means that symbols such as "&" or"<", which are usually recognized as parts of tags, must be designated by "&" and "&It", respectively. Veteran HTML authors should be familiar with the need to specify entities in this way. In fact, many of the qualities that are necessary for an XML document to be well-formed are also considered proper format for HTML, even though they are not strictly necessary. The XML specification discusses well-formedness at some length. At its most simple, well-formedness can be equated to the quality of the document being written in the correct style.
Valid XML includes and conforms to a DTD (Document Type Definition). DTDs are a concept from SGML, but an XML DTD does not have to be as complex as an SGML DTD does. The DTD names the sections of the document and also specifies in what order they are to appear. The DTD is necessary because the document has to tell the processor or parser what set of rules it is following, figuratively speaking. Since each author can make his or her own set of tags, the document needs a declaration at the beginning to inform the receiving application how the tags are to be read. One of the reasons for making standardized XML-derived languages is so a processor can know in advance what sets of tags will be used by accessing a prearranged DTD. However, an organization using XML could make a company-wide DTD to be used in all its documents, so individual authors would not have to make their own. Writing a new DTD for each document would probably be too much of a hassle for most authors.
Not all XML documents are required to have a DTD. A non-validating processor can still read documents lacking a DTD, but it removes a certain amount of the specialization that makes XML so useful. DTDs can be declared internally or externally. An internal DTD includes all of the information needed to read the document. Figure 2 shows how an internal DTD might look.
The example in Figure 2 starts off with a statement declaring that the document was created using XML version 1.0 and a DTD is included. The first line of the DTD states the type of document, in this case a MARC_RECORD. The second line names the element: MARC_RECORD, and then specifies the parts of that element: BOOK and SUBJECT HEADING. The sections or parts of the book are then listed. The statement #PCDATA means that each field may contain data as content. This example is by no means definitive; it is merely a demonstration of the structure of a DTD. Please consult the specification for further information.
The DTD can also be attached to the XML document by means of an external DTD. Basically, the same information is contained in both the external and internal DTDs, but an external DTD is a separate file to which the code in the document refers. This might be used when an organization has its DTD for company use stored in a standard location on its network. Figure 3 is an example of a reference to an external DTD.
Converting HTML to XML
At this stage of development, the structure of XML tags is essentially similar in format to HTML. Basic differences between the two include case sensitivity, mandatory inclusion of quotation marks, the pairing of non-empty tags, and the explicit marking of empty tags.9 These differences will require some changes to ensure complete compatibility between HTML and XML. The changes stem mainly from the well-formedness constraints mentioned earlier.
First of all, XML is case sensitive; all tags have to be either upper or lower case. Since XML is still in its early stages, no guidelines have been made that specify the best way for HTML to conform. The only rule so far is consistency; the case must stay the same throughout the document.
Secondly, all HTML attributes must include quotation marks. At the moment, it is possible in HTML to include some attributes, such as the URL in the anchor tag <a href= http://www.mysite.com> without the quotes. But, using one quote without closing it will not work. In XML-compatible HTML, the tag must read <a href= "http://www.mysite.com">.
Furthermore, all non-empty tags would have to be paired. That is to say, any tag intended to be paired must be paired. In HTML as it is now used, it is possible to not close many tags; in XML-compatible HTML they will all have to be closed. The paragraph tag is the best example, in HTML <p> is acceptable. XML requires both <p> and </p>.
Finally, in XML, it is necessary to mark all empty tags, which can simply be understood as tags not intended to be paired and represented with a slash at the end. For example, instead of <br> one would use <br/>. At this point in time, this would not be valid HTML.
It is important to note that it is not necessary to implement these changes yet. Someday changes may be necessary to make HTML readable in XML browsers. At the moment it seems more likely that traditional Web browsers will be modified to read XML in much the same way current browsers read Java scripts embedded in HTML documents. Alternatively, one might use a plug-in, which would pop up when an XML declaration was encountered and then would open an XML browser. So, until we know how the two languages will interact, it is unnecessary to adapt our HTML for XML use, but it is reassuring to know that when and if the time comes, the task will not be very complex.
XML's Impact on Libraries
First of all, the advent of XML should cause the variety of materials available on the World Wide Web to increase. The breadth of scientific and research information made available will increase as specifications for new fields are made. Patrons already use the Web extensively as an information source, so understanding the developments in information availability is important. Additionally, the publishing industry should find the new developments to be of enormous benefit. XML will have greater versatility than PDF, and will be more economical and easier to implement than SGML. That means electronic publishing could benefit enormously from the new developments, which will naturally impact the already fast-changing field of electronic collections. Electronic reserves will also see tangential effect, as professors can put greatly varied materials on the Web Naturally, library Web designers may also find that new developments will offer new opportunities and challenges as we try to keep our digital collections up to date. The ever-increasing field of distance education may also see effects from the new variety of data potentially coming to the Web with XML.
For now, the best thing that librarians can do is to stay abreast of the developments in XML and other emerging Web technologies. It is good to know what is coming so we can be aware of new terms and trends. XML is certainly a significant advance in the handling of data and information in the Web environment, and anything that affects information will also impact the library field.
1. Dan Connolly and Jon Bosak. Extensible Markup Language (XML), 1997. Available at http://www.w3.org/XML.
2. Eve Mater and Steve DeRose. XML Pointer Language (XPointer), 1998. Available at http://www.w3.org/TR/1998/WD-xptr- 19980303.
3. Eve Maler and Steve DeRose. XML Linking Language (XLink), 1998. Available at http:// www.w3.orgITRIl 998/WD-xlink- 19980303.
4. James Clark and Steven Deach. Extensible Stylesheet Language (XSL) Version 10, 1998. Available at http://www.w3.org/TR/l998/WD-xs1-1 9980818.
5. Rohit Khare and Adam Rifk in. X Marks the Spot; eXtensible Markup Language opens the door to a motherlode of automated web applications, October 1997. Available at http://www.cs.caltech.edu/adam/papers/xml/ x-marks-the-spot.html.
6. Patrick Ion and Robert Miner. Mathematical Markup Language. 1.0 Specification, April 1998. Available at http://www.w3.org/TR/REC-MathML.
7. Peter Murray-Rust. Chemical Markup Language (CML), Version 1.0, January 1997. Available at http://www.venus.co.uk/omf/cmll intro.html.
8. Heather Empey. XML, April 1998. Available at http://www.slis.ualberta.ca/538/hempcy.
9. Dan Connolly, Rohit Khare, and Adam Rifkin. The Evolution of Web Documents: The Ascent of XML. January 1998. Available at http://www.cs.caltech.edu/adam/adamlpapers/xml/ ascent-of-xml.html.
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||extensible Markup Language|
|Author:||Exner, Nina; Turner, Linda F.|
|Publication:||Computers in Libraries|
|Date:||Nov 1, 1998|
|Previous Article:||PEOPLE & PLACES.|
|Next Article:||In Search of Better Sites.|