Recent Posts

Archives

Categories

Introduction to XML

A XML (short for Extensible Markup Language) document consists of:

1. the prolog (optional)
2. the document type definition (DTD, optional)
3. the root element (containing more elements; e.g. tree structure)

Comments and processing instructions can be defined outside of tags.

Prolog

The basic prolog looks like this:

<?xml version="1.0" ?>
<!-- or -->
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes" ?>

Attributes explained:

  • version: XML version
  • encoding: Character set, defaults to UTF-8
  • standalone: define whether extern entities/DTDs are being referenced

Document Type Definition

The DTD defines structure validation rules for our documents. We fundamentally construct elements with their respective type (e.g. database schema).

Reasons to use DTD:

  • validation purposes
  • gathering information about the document
  • force same structure for multiple documents
  • comparability of documents
  • automated processing of specific document types

The general syntax looks like this:

<!DOCTYPE root-element [
...
]>

Above DTD can be directly placed in the respective XML document. To reference an external DTD file use:

<!DOCTYPE root-element SYSTEM "path-to-dtd.dtd">

An example for the external DTD file (note that it does not require the doctype declaration):

<!ELEMENT root-element (test-element*)>
<!ELEMENT test-element (#PCDATA)>
<!ATTLIST test-element id ID #REQUIRED>

Elements

General syntax: <!ELEMENT name category> or <!ELEMENT name (content)>. A quick overview of some category keywords:

Syntax Meaning
<!ELEMENT name EMPTY> An empty element
<!ELEMENT name ANY> Element with arbitrary content

Content is furthermore specified through these keywords:

Syntax Meaning
<!ELEMENT name (#PCDATA)> Element with parsed character data
<!ELEMENT name (#CDATA)> Element with (non parsed) character data
<!ELEMENT name (child1, child2)> Element surrounding two childs (strict order!)
<!ELEMENT name (child1 | child2)> Element surrounding either child1 or child2

Occurrences of these children can also be specified:

Syntax Meaning
<!ELEMENT name (child)> Exact one children
<!ELEMENT name (child?)> 0..1 children
<!ELEMENT name (child*)> 0..N children
<!ELEMENT name (child+)> 1..N children

Attributes

An ATTLIST binds one or more attributes to specific elements.

<!ATTLIST element-name attribute-name attribute-type attribute-value>

<!ATTLIST human
id ID #REQUIRED
salary Currency(Dollar, Euro) "Dollar">

As you already noticed, we are able to specify an explicit value range aside a type. The following table lists some of the attribute-types:

Syntax Meaning
CDATA Character data
(val1, val2...) Explicit value range
ID A unique ID
IDREF Reference to another ID
IDREFS Set of IDREFS
NMTOKEN Valid XML name
NMTOKENS Set of NMTOKENS
ENTITY An entity
ENTITIES Set of entities

*Note: Set values are separated with whitespaces*

Now a list of valid attribute-values:

Syntax Meaning
"value" Explicit value
#REQUIRED Attribute is required
#IMPLIED Attribute is optional
#FIXED "value" Explicit fixed value

XML Schema

An alternative to DTDs are XML schemas, which actually use XML syntax, support more data types and offer better referencing (in contrast to the IDREF mechanism).

Note: XML schemas extensively use the namespace mechanism (see bottom section).

XML Schema Structure

The schema element is the root element of every XML Schema:

<?xml version="1.0" ?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
...
</xsd:schema>

Simple Element

<xsd:element name="test-element" type="xsd:string" default="Default Value">
<xsd:element name="test-element2" type="xsd:string" default="Fixed Value">

The equivalent to above using DTD:

<!ELEMENT test-element "Default Value">
<!ELEMENT test-element2 #FIXED "Fixed Value">

Types: xsd:string, xsd:decimal, xsd:integer, xsd:boolean, xsd:date, xsd:time.

Note: simple elements cannot contain attributes.

Complex Element

<xs:element name="employee" type="person-info"/> <!-- Reference to a complex type (similar to nesting in DTD) -->

<xs:complexType name="person-info">
<xs:sequence><!-- Firstname, then lastname (similar to DTD: comma separation) -->
<xs:element name="firstname" type="xs:string"/>
<xs:element name="lastname" type="xs:string"/>
</xs:sequence>
</xs:complexType>

Referencing to XML Schema

And finally a reference to an XML schema:

<?xml version="1.0" ?>
<root-element xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="namespace_path_for_schema schema.xsd">
</root-element>

<?xml version="1.0" ?>
<root-element xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="schema.xsd">
</root-element>

Entities

An entity is a separate data unit within a XML document. Entities are also resolved before a validation is taking place.

<!ENTITY entityname "value">

We categorize entities into two sections:

Parsed entity (XML-fragment):

  • Internal: defined within a DTD
  • External: defined in another document

Unparsed entity (miscellaneous data):

  • Value of an attribute with type ENTITY or ENTITIES
  • Reference to an external file

Predefined Entities

Syntax Meaning
&lt; <
&gt; >
&amp; &
&apos; '

Using an already defined entity: &entityname;.

References In XML

References to entities (as we already know): &entityname;.
References to elements: an element with an ID attribute can be referenced through IDREF(S).

Note: references via IDREF(S) only work within a document.

Namespace

Motivation: Mixing different XML documents will result in a conflict, when they contain elements with the same names.
General syntax: xmlns:PREFIX="URI".

<store xmlns:s="http://gimu.org/s">
<s:title>Something</s:title>
<s:example s:id="1">Example</s:example>
</store>

Note: an attribute does not inherit the namespace of its parent element (also applies for the default namespace).

Default Namespace

A namespace which applies to all child elements without a prefix.

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>All unprefixed elements belong to the xhtml namespace</title>
</head>
...
</html>

XPath

XML documents are treated as trees. There are several distinguishable node types:

  • root node
  • element nodes
  • attribute nodes
  • text nodes
  • instructional processing nodes
  • comment nodes
  • namespace nodes

Explicit Path

The XPath query /PersonalFile/Particulars/Firstname would result in

Result
<Firstname>Foo</Firstname>
<Firstname>Foo2</Firstname>

as defined in the previous section.

Predicates

<AAA>
<BBB/>
<BBB/>
<CCC/>
<BBB/>
</AAA>
Query Result
/AAA/BBB/ <BBB/>, <BBB/>, <CCC/>, <BBB/>
/AAA/BBB[1]/ <BBB/> (first one)
/AA/BBB[last()] <BBB/> (last one)
/AAA/* <BBB/>, <BBB/>, <CCC/>, <BBB/> (any child of AAA)
//BBB <BBB/>, <BBB/>, <BBB/> (hierarchical independent nodes)

Processing Instructions

To influence XML processing, you can use processing instructions: <?name data>.

Character Data

Arbitrary character sets can be included in XML (e.g. HTML documents).
This is done by enclosing the data with <![CDATA[...]]>.

<xml>
<![CDATA[
<html>
<b>Bold text</b><br/>
</html>
]]>
</xml>