XML - BASICS

DOM  |  NAMESPACE  |  SCHEMA

   
Intro

 

XML, or eXtensible Markup Language, appears similar to the HTML markup language one finds on the various 'browser pages' on the world-wide web. It uses the same sort of 'tags', with greater and less signs - called, angle brackets. (< left angle bracket, >, right angle bracket) In HTML, for example, a horizontal line is specified as - <HR> . 'Tags', in XML, are referred to as - elements. Such elements may have arguments/parameters, called attributes (called that in both HTML and XML), but where XML requires the argument's value be placed in double quotes - for example <HR SIZE="2"> .

 
XML as 'mark-up'

 

For some time, XML was touted as a sort of HTML with custom extensions. People saw that and shrugged their shoulders. Turns out that wasn't quite it, and there was more to it; the 'early adopters' were not very clear, themselves, just what that was. They hadn't really been told. Some, in fact, will insist that XML isn't even a markup language, itself, even though it is called that, but rather a meta-language (used to create other languages), and a specific subset of SGML (a long time standard meta-language).

The XML specification, then, doesn't necessarily set aside any particular tags or attributes, unlike HTML which clearly does. You can call the elements and attributes just whatever you want (see, for example). Thus, it's a meta-language - you're using XML to build you own set of elements and attributes. The elements may include parameters/attributes, as seen already, but not always. And so on.

 
Empties

 

Something like the HR tag, in HTML, doesn't need a closing tag. It's an - empty tag. XML elements can be empty, as well. The requirement, however, is that all elements must be, essentially, containers. They have to be written as such. There must be a start 'tag', and a closing 'tag'. So one can close a HR, for example, that way - <HR> </HR>. Or there's a shorthand - <HR />. To repeat, for XML, except for particular proprietary implementations, generally based on namespaces, what elements and attributes you use are entirely up to you.

So a tag with some data:<tag1> data </tag1>
An 'empty' tag, without data:<tag1 />
With a parameter/attribute and data:<tag1 parm1="10"> something <tag1>
Attribute only:<tag1 parm1="10" />

 

HR, again using it as the example, also includes the possibility of a "noshade" attribute. Now this would translate to XML as, noshade="noshade". Noshade is an HTML attribute that does not use any value. It might be called an - empty tag. To make an empty HTML tag into a proper XML element, you use the trailing slash. That's what you do for the, IMG, tag. It's the same for the, BR, tag. You write <BR> as <BR />. Adding a space before that last slash might help older browsers. If there's a real problem for a particular kiosk or net that you must use, you might find that simply closing empty elements will work - so that <HR> is literally written as, <HR></HR>, for example, again.

This above would be the case for XHTML. Also, it tends to be used more with XSLT transform programs, where the HTML output is being specified, but using XML rules; that is, XHTML. So, in that XSLT program/template, you might find a need to write a <BR />, for example, or else you would get an error when running the transform.

 
Root Element

 

Now just XML, itself, it might look something like:

<!-- comment -->
<root>
   <author>Adams
      <book price="5.95">Biography
      </book>
      <avail />>
   </author>
</root>

 

That is, XML is just a list of nested elements. Every XML document needs a root node - even if it's not actually called, root. And there can only be one such - root node. One can also add comments to XML with the same opening, <!--, and closing, -->, used in javascript for multi-line comments.

There's some confusion about - 'root node'. It can be called <mable>, if you like. But there can't be another at the same level. It has to be 'top of the outline'. If it isn't, the various XML parsers will return an error. But you'll notice that XML refers to elements. So the root is the root - element. The confusion comes over in XPath, where a system node is created and also called the root. But it's called a root - node. So, if one just keeps calling this topmost element in XML the - root element - which is what it is, there shouldn't be that confusion.

 
Database

 

Perhaps it also seems suggested that XML, as a 'table of contents' type list, a tree or one-side tree, could lend itself to database use, if perhaps not in the form of a list. But it could be translated into a tree from a relational set of tables, for example, sent over wire as XML, and then re-parsed or reassembled back into relations that might lend themselves more easily to standard SQL data retrieval (this, in fact, is partly what Microsoft's SQLXML is designed to do). It could be a standard, and open standard, a non-proprietary standard, for encoding literally millions of bits of information into this tree, table of contents, form, even if that isn't necessarily the best way to permanently store that information, at least large amounts of information, for fast retrieval. The trees, it might be noted, by themselves, allow for a changeable hierarchy in a way that is more difficult, or impossible, with fixed table relational databases, and without using any self-referential tables. But fitting XML into a table/relational format would be preferred, by most.

 
Web Page

 

A particularly important ramification of XML is that a proper XML version of HTML - XHTML - is created when generating a page with XSLT (and other methods, too). At the minimum, all the tags which were opened, are closed and accounted for. You could make sure of this by hand, manually, as well. But the generated pages will be XHTML. In that sense only, the web page will be - valid. That means that the entire web page can be programmatically stored in internal memory, using javascript and various built-in classes or program objects. Once stored as an internal hierarchy, changes can be made to that tree, which will be reflected in the web page. Content which didn't originally exist in the page can be loaded from the server, and 'suddenly' appear on the page, by simply adding the new text or info from the server to the internal tree structure of the web page; using javascript or vbscript programs/scripts on the web page itself, for example. It's likely, as well, that a query of the server will send back that new information for the web page in XML format.

 
Element/Attribute Design

 

Now the question of when to use attributes or else put data/text between opening/closing tags is probably a matter of taste. Above, it might even be better if price were a separate tag, rather than an attribute. So instead of <book price=" it would be, instead, <book>Biography</book> <price>5.95</price>. The suggestion of availability - avail - is just a flag, in this case, an empty tag with no attributes, which might suggest that the book is available.

Here's another example:

<?xml version="1.0"?>

<root>
   <group>Group1
      <Italic>
         <Block>Heading2
         </Block>
      </Italic>
   </group>
</root>

 

The custom is to include the XML version tag at the top. It's recommended.

You can also see where it might be easier to determine what should be a tag, and what should more properly be a attribute. In this case, the thinking could be tied to how the XML will translate into HTML. And in HTML, italic can be a nested tag, a structural element, rather than an attribute. The Italic could translate directly into an HTML <i> or <em>. And the structure in the XML would indicate how it might appear if translated to HTML.

But consider where an attribute might more easily be used as HTML:

<?xml version="1.0"?>

<root>
   <group bgcolor="#f0f0f0" width="100%">Group1
      <font size="-1">
         <italic>
            <block>Heading2
            </block>
         </italic>
      <font>
   </group>
</root>

 

If one were of the mind to do so, it might make more sense to place "width" and "size" as attributes, as done here, rather than as tags. These would, in HTML, typically be used as attributes, and not as structural or possibly nested elements. It's also a rule that all parameter values have to be double-quoted, as seen, above. It's also the rule that you cannot include spaces in element or attribute names, and they even should be the same case (e.g. <Able> </able> would not match). And the ampersand should always be written as the entity, &amp; (similarly the less than, or left or 'open' angle bracket, should be - &lt; instead of < - and so on).


 
 
 More to read:

ZVON Basics Simple examples of basic XML
Inet.com Basics Multi-page simple tutorial on XML
Dev Shed Ad-laden basics and links
W3 Schools W3schools XML tutorial
XML News Nicely presented, more simple basics and links
XML FAQ Suggesting a more formal presentation and standards links
Microsoft lead page Microsoft's MSDN section concerning XML
Microsoft Help Microsoft's original compiled help file for msxml 4.0 (click: download de SDK)
(Robin) Cover Pages Cover's extensive XML links (long download)
XML 10-Points WWW Consortium basic intro to XML
W3C XML 1.0 Spec WWW Consortium specification for XML 1.0