Printer Friendly

Anatomy of an XML document.

My XML-related languages, applications, and Web sites have appeared since XML development began in e mid-1990s. The pace of development is accelerating, too, but without properly constructed XML documents, none of them can be effective.

What Are XML Documents?

Documents have evolved from files created by text applications to electronic files of any size for any media (for example, text, audio, video, and graphics) created by any application. As noted, the XML 1.0 Recommendation defines an XML document as a "data object if it is well-formed, as defined in (Extensible Markup Language Recommendation). Each XML document has both a logical and a physical structure."

Expanding that definition, each XML document contains a unique instance of logically structured data, plus additional instructions for the parser and the application. The data instance portion contains data components with unique values. All the components and their respective values must conform to definitions in the language's conformance-checking mechanisms-in other words, a document type definition or schema. After being processed by an XML parser, the data in a document is structured and then passed to the application.

However, the W3C has drawn a bit of a boundary around XML documents when they refer to them as data objects. They are not quite the same as, say, Java objects, which can contain a combination of data and procedures to manipulate the data. With XML, manipulation is left to the parsers and applications. As you progress yon will begin to understand why those who think XML documents are just text documents-mostly because, on the surface, text is all they seem to contain-tend to underestimate XMLs capability to structure and integrate data of all types.

XML Document Processing

XML documents can't do anything on their own. Applications must be written to process the data contained in them. Here is an overview of the process by which applications call for and use XML documents.


Used alone, the term application means a program or group of programs intended for end users and designed to access and manipulate data (in out case, the data in XML documents). Don't confuse this term with XML application, which is one of several terms used to refer to a derivative markup language created according to XML 1.0.

To clarify, consider the following comparison: A Web browser is an application that can access and display the information from XML documents. But the Synchronized Multimedia Integration Language (SMIL), is an XML application because it is its own language, developed using XML 1.0 specifications.

To process XML documents, applications must have XML parsers integrated within them.

XML Parsers

XML processors-more commonly called XML parsers-are reusable pieces of code that are integrated with computer applications. Application developers can write their own parsers, but they don't need to; several are available-for free, on the Internet-which they can include in their applications. Later when an application calls for an XML document, the parser is activated, reads the XML document, and screens it on behalf of the application. Screening means the parser performs checks on the document, creates a data structure, and passes the structured data to the application. Figure 1.0 illustrates the process.

XML parsers are of two general types: those that check only for well-formedness and those that check for well-formedness and validity. The second type, which consults DTDs or schemas to check the document for conformance to the respective XML-related language, is called a validating parser. Parsers generally contain four basic types of operators:

A content handler which turns the document's string of characters into a sequence of events that are then translated into a treelike data structure (Figure 1.0), which it then provides to the application.

An error handler that determines the nature of any errors in the XML document and then acts accordingly.

A DTD and schema handler, which examines the DTD or schema and then checks the XML document for conformity with the DTD or schema. This operator only appears in validating parsers.

An entity resolver. Incorporates any data referenced within the XML document's referential markup that is located outside the XML document entity itself or that is not intended to be parsed in a customary manner.

Document Errors

Parsers occasionally encounter errors in XML documents. The W3C classifies errors in two ways: nonfatal and fatal errors. A nonfatal error is a violation of the rules of XML 1.0. For these errors, the W3C does not define specific penalties. They leave that up to the respective parser and application developers. They just say that "conforming software may detect and report an error and may recover from it.'

Fatal errors are a different matter. The W3C stipulates that a conforming XML parser must be able to detect fatal errors and must then report them to the application, which can then produce its own error message. It is up to the application developer to code that in. The W3C goes on to say if a parser detects a fatal error, it may continue processing, but only to look for more errors; it is not allowed to continue normal content processing.

(For a more comprehensive explanation of errors and fatal errors, consult the XML 1.0 Recommendation.)

The Structure of XML Documents

XML 1.0 states that XML documents have two kinds of structure: a logical structure and a physical structure. We will discuss the basic physical structure of an XML document in this feature together with the logical structure.

This is because:

* It's the easiest way to give you an idea of how the languages and their respective documents are supposed to work-that is, to show you how to create and structure components to achieve your objectives.

* The logical approach provides a good model for understanding, comparing, and even combining XML related vocabularies and documents.

The physical structure of XML documents tends hot to stray far from the basics we'll shown you in this feature. Before we begin discussing the logical structure, though, let's fine-tune three of our fundamental definitions. Here we've paraphrased the text, markup, and character data definitions listed by the W3C in XML 1.0:

* Text consists of intermingled markup and character data.

* Markup consists of the following:

In the prolog: XML declarations, processing instructions, document type declarations, comments, and any white space. In the data instance (that is, within the scope of the root element): start tags, end tags, empty element tags, attributes entity references, character references, and CDATA section delimiters.

* Character data is all text that is not markup.

The Logical Structure

The basic logical structure of an XML document consists of the following:

* The prolog

* The data instance (that is, the root element and any elements contained in the foot element)

The Prolog

The prolog is a preface or introduction to the XML document. It is the first major logical component of an XML document and, because of its content, must be inserted prior to the next major logical component, the data instance. The prolog provides initial advice to the application, the parser, and any human reader, about the document and, especially, prepares the parser to better handle the data instance.

The prolog may contain up to rive types of components:

* An XML declaration

* Processing instructions

* A document type declaration m Comments

* White space

Refer to the simple XML document gems-excerpt-02.xml in Figure 2.0. It has a five-line prolog right at the beginning, consisting of an example of each of the five components listed previously. In tact, there are two comments. The use of white space may not be so obvious to you, but if there were no spaces or end-of-line indicators in the prolog of this document, we would have trouble recognizing the components easily and quickly; they would all run together.

The XML Declaration

The XML 1.0 Recommendation suggests that every XML document should begin with an XML declaration that states, basically, that the document is indeed an XML document. The declaration (also called the header) must be on the document's fist line. XML 1.0 also states that all prolog components are optional, but that a well-formed XML document should begin with an XML declaration.

We strongly recommend that you include an XML declaration at the beginning of every XML document to help ensure that it is well formed.

Let's examine the XML declaration statement from Figure 2. The basic tag for an XML declaration statement is <?xml ... ?>. 1.0 specifies that xml must be lowercase. The XML declaration is actually a kind of processing instruction (discussed next); that is, it talks to the application, not to the parser. What it says, in a way, is "activate the XML parser; this is an XML document' and then provides additional information about the document for use by the application and the parser. The information appears in three pseudo-attributes: the XML version number (version="1.0"), the document's language encoding designation (encoding="UTF-8'), and the standalone pseudo-attribute specification (standalone="no").

In the XML declaration, the XML version pseudo-attribute refers to the version of the XML Recommendation whose specifications the document has been written to. It is mandatory to state the version number. Currently, there is only Version 1.0, corresponding to the W3C's XML Recommendation 1.0, so 1.0 is the value that must be specified.

The encoding pseudo-attribute is optional. XML supports several character sets listed on the Internet Assigned Numbers Authority's Official Names for Character Sets Web site at Several values tan be specified for the encoding pseudo-attribute. If you do not specify a value, the parser will use the UTF-8 default value.

The third part of the declaration, the standalone pseudo-attribute, is also somewhat optional. If the document will be parsed by itself-that is, if there will be no need to refer to any external entities like DTDs or schemas that contain declarations for the components in the XM-L document-the standalone value should be yes (which is the default value if the standalone pseudo- attribute does hOt appear). If there are declarations in such external entities, however, and they must be enlisted by the XML parser before it tan process the document, specify no.
 < ! DOCTYPE diamonds [

Processing Instructions < ! ELEMENT diamonds
 (gem) * >

The second line of Figure 2 is an < ! ELEMENT gem (name,
 carats, color, clarity,
 cut, cost, reserved?) >
example of a processing instruction < ! ELEMENT name
 (#PCDATA) >
(PI). Pls are instructions passed by the < ! ELEMENT carats
 (#PCDATA) >
XML processor to the application and, < ! ELEMENT color
 (#PCDATA) >
so, are rather frowned on by XML < ! ELEMENT clarity
 (#PCDATA) >
purists. Processing instruction syntax < ! ELEMENT cut
 (#PCDATA) >
looks similar to the following < ! ELEMENT cost
 (#PCDATA) >
 < ! ELEMENT reserved
 (EMPTY) >
<?piname pseudo-attributes?> ]>

Similar to the XML declaration statement, a single question mark appears at the beginning and the end of a processing instruction. The piname, also called the PI name or PI target, tells the application what type of PI it is. It is up to the application developers to code in which PI targets will be recognized. The second line of Figure 2 is a common PI that is recognized by browsers like Internet Explorer and Netscape Navigator. The PI name is the fairly common xml--stylesheet; we're telling the application that we are associating a style sheet with this document. The type pseudo-attribute tells the application to look for a text-type cascading style sheet that will instruct it how to display the components found in the XML document. The style sheet uniform resource identifier (URI) is simply diamonds2.css meaning the name of the style sheet document is diamonds2.css and is found locally on the system because the URI contains no additional pathing information.

A PI similar to the following:

<?xml-stylesheet type="text/xsl" href="gems1.xsl"?>

points the application to a different type of style sheet, one that will help transform an XML document to an HTML document.

If you are coding any other type of PI, don't use PI names beginning with the characters "XML," "xml," or similar. They have been reserved by the W3C for future XML standardization.

The Document Type Declaration

XML does not require the inclusion of the document type of declaration in all circumstances. The document type declaration (also called a DOCTYPE definition) tells the parser what function the document's author expects the document to play: That is, it tells the parser what type of document it is, then indicates to the parser how the document's components will be defined and related to one another. Let's look at the declaration on the third line of Figure 2.


The opening keyword DOCTYPE tells the XML parser that this statement is indeed a document type declaration. "Diamonds" indicates that the name of the class that the document belongs to is diamonds; that the document is a diamonds type of document. The class name is arbitrarily specified by the document developer and often coincides with the name of document element. For example, a developer who is writing a book might name the class of the basic document book and then import other XML documents, whose class names might be chapter, section, or whatever, into the book document.

Let's deviate from the Figure 2 example for a moment. If a developer chooses to provide the appropriate component declarations and then have the parser validate the document as well as check the document for well-formedness, the DOCTYPE definition statement is the place where the declarations would be inserted. For the Figure 2 document components, the document type declaration, complete with the inserted declarations, would resemble the following:

Notice that if the DOCTYPE definition (to use the alternate name) lists these declarations within its own confines, the developer must place the declarations between an opening square bracket and a closing square bracket. Doing so creates an internal DTD. If such an internal DTD is constructed, the standalone pseudo-attribute in the XML declaration would have to be standalone="yes'. Returning to the Figure 2.0 example, the keyword SYSTEM indicates to the parser that the declarations for the document's components will not be found in the Figure 2.0 document, but within an external document. Further, the parser should be ready to look for that external document on the local system and then check the Figure 2.0 document for validity against the declarations in the external document. But which external document and where is it? That is specified next in the URI that appears in quotation marks. The parser is to look for an external document named diamonds2.dtd.

If that external document is located even further remotely, the full path to the document would have to be specified in the URI instead of just the filename.

Questions & Answers

Q: So you're saying that the declarations can be located in the XML document or in that other external document, right?

A: Not quite. We realize that, at this point, we have left you with that impression. However, declarations can exist in both places and work together. Your XML document may contain extra components in addition to those declared in the external document. Or maybe, for this document, you want to alter one or more of the component declarations from those in the external document. To do so, you would declare the additional or updated components right there in the Figure 2 document-in what is termed an internal subset-and rely on the external document-that is, the external subset-to provide the declarations for the rest of the components. The combination of the internal subset and the external subset is what you would correctly call the document type definition. In other words, both portions would form the complete DTD.

Even though document type declarations are optional, one is required if the developer intends the parser to validate the document by internal or external markup declarations. As a best practice to avoid ambiguity, we recommend always including a document type declaration in the prolog.


The purpose of adding comment statements to an XML document is not to provide instruction to the parser or to the application, because comments are ignored by the parser. Here are three purposes for comments:

* To say something to anyone who will later examine the XML document

* combined with white space, to break a document into sections m

* To temporarily disable sections of the document

XML uses the same comment syntax as HTML. The following are two examples:
<!-- Gems Version 1--Space Gems, Inc. -->
<!-- filename: gems_excerpt_04 .xml -->

Properly constructed, comments tan be placed anywhere in a document; however, it is considered bad form to place a comment before the XML declaration statement.

After you have begun a comment, he careful not to use the literal string '-' (that is, two hyphens in a row) anywhere in it except at the very end The XML parser will otherwise see the string and presume that the comment has ended, then create errors based on any remaining characters in the rest of the intended comment.

From XML in 60 minutes a day. Wiley Tech Publishing. ISBN 0-471-43254-1

Book Browser

sendmail Cookbook

The "sendmail Cookbook" by Craig Hunt provides step-by-step solutions for the administrator who needs to solve configuration problems quickly. Suppose you need to configure "sendmail to relay mail for your clients without creating an open relay that will he abused by spammers. A recipe in the Cookbook shows you how to do just that. No more wading through pages of dense documentation and tutorials to create a custom solution-just go directly to the recipe that addresses your specific problem.

The fact that the "sendmail Cookbook' provides quick answers to common problems is of critical importance to system administrators, says Hunt. "The one thing that most system administrators do not have enough of is time," he explains. "They're swamped with other work and have very little rime to devote to sendmail. This book is for the busy administrator who needs to solve a problem fast."

Each recipe in the "sendmail Cookbook" outlines a configuration problem, presents the configuration code that solves that problem, and then explains the code in detail. The discussion of the code is critical because it provides the insight administrators need to tweak the code for their own circumstances.

'Readers should understand that the recipes in this book are complete solutions The "sendmail Cookbook" begins with an overview of the configuration languages, offering a quick how-to for downloading and compiling the sendmail distribution This is followed with a baseline configuration recipe upon which many of the subsequent configurations, or recipes, in the book are based. Recipes in the following chapters stand on their own and offer solutions for properly configuring important sendmail functions such as:

--Delivering and forwarding mail



--Routing mail

--Controlling spam

--Strong authentication

--Securing the mail transport "sendmail Cookbook" is more than just a new approach to discussing sendmail configuration. The book also provides lots of new material that doesn't get much coverage elsewhere-STARTTLS and AUTH are given entire chapters, and LDAP is covered in recipes throughout the book.


'Learning XSLT", Michael Fitzgerald

"Learning XSLT helps developers find a clear path into this technology by explaining XSLT in detail, while realising that much of XSLT isn't obvious, even to experienced programmers. Readers will explore a broad range of XSLT features, from simple templates to the more obscure comers of the technology, practising the techniques along the way. The book is rich with hands-on examples to help readers begin doing useful work with XSLT the same day they start reading it.

The focus of 'Learning XSLT' is getting readers up to speed quickly. The book contains little reference material. For that, Fitzgerald recommends O'Reilly's "XSLT" by Doug Tidwell. As Fitzgerald explains, 'Learning XSLT' will get readers off to a running start. It's not bogged down in getting all the excruciating details right-it's all about getting XSLT to do stuff immediately. It will be hard for a reader hOt to feel successful in the first chapter.

"Learning XSLT" moves smoothly from the simple to the complex illustrating all aspects of XSLT 1.0 through step-by step examples. Thorough in its coverage of the language, the book makes few assumptions about what readers may already know. The book covers XSLTs template-based syntax how XSLT templates work with each other, and XSLT variables. "Learning XSLT" also explains how the XML Path Language (XPath) is used by XSLT and provides a glimpse of what the future holds for XSLT 2.0 and XPath 2.0. The ability to transform one XML vocabulary to another is fundamental to exploiting the power of XML. 'Learning XSLT" is a carefully paced, bands-on introduction to the technology that will have readers understanding and using XSLT in no time, even if it's their first
COPYRIGHT 2004 A.P. Publications Ltd.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2004, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:Teach-In
Author:McKinnon, Al
Publication:Software World
Date:Jan 1, 2004
Previous Article:Getting started with JAVA.
Next Article:Grokker-new search, find, and display system.

Related Articles
Printing from XML. (Monograph).
Systems integration. (Technology Tools).
XML Pointer fromW3C. (Internet Focus).
The Real-Time Enterprise.
Reusing educational material for teaching and learning: current approaches and directions.
Geen G. Chong.

Terms of use | Privacy policy | Copyright © 2021 Farlex, Inc. | Feedback | For webmasters |