Notes on Well Formed XML

Taken From XML 1.0 (Third Edition) W3C Recommendation 02/04/04

Author: Daniel Kemper http://www.dankemper.net

 

A textual object is a well-formed XML document if:

           

1.       Taken as a whole, it matches the production labeled document.

2.       It meets all the well-formedness constraints given in this specification.

3.       Each of the parsed entities which is referenced directly or indirectly within the document is well-formed.

 

So, what is a document?  A document is defined as a production, and a production, for our purposes, is defined as (α → β), meaning α generates β, or that α can be thought of in terms of β.

When a grammar is used to define the syntax of a computer language, the productions are usually written as α ::= β. 

 

The document production is defined as

 

            document ::= prolog element Misc*

 

To understand a document, we must understand the three productions within it.  The first of which is prolog. 

 

The prolog production is defined as

 

            prolog ::= XMLDecl? Misc* (doctypedecl Misc*)?

 

Like document, the prolog production is defined by other productions.  The idea here is to drill down to the definition of a production simply as a string of characters, or a space, not the combination of other productions, which will eventually be accomplished.    Below is a table of productions that defines the prolog.  The following standard quantifiers are recognized:

 

        *           Match 0 or more times
    +           Match 1 or more times
    ?           Match 1 or 0 times
 
Do not confuse the presence of the question mark ‘?’ with the Match quantifier ?. 

 

Production L-value

Production R-value

XMLDecl

‘<?xml’ VersionInfo EncodingDecl? SDecl? S? ‘?>’

VersionInfo

S ‘version’ Eq (“’” VersionNum “’” | ‘”’ VersionNum ‘”’)

Eq

S? ‘=’ S?

VersionNum

‘1.0’

S

(#x20 | #x9 | #xD | #xA)+  

 

/* note: these are space, carriage return, line feed, and tab respectively. */

 

EncodingDecl

S ‘encoding’ Eq (‘”’ EncName ‘”’ | “’” EncName “’”)

EncName

[A-Za-z] ([A-Za-z0-9._] | ‘-‘)*

 

/* note: the previous character class just contains only Latin characters, all numbers and letters both upper and lower case, all numbers and the period, underscore, and dash.  */

 

SDecl

S ‘standalone’ Eq ( (“’” (‘yes’ | ‘no’) “’” ) | (‘”’ (‘yes’ | ‘no’) ‘”’) )

Misc

Comment | PI | S

Comment

‘<--‘ ((Char – ‘-‘) | (‘-‘ (Char – ‘-‘)))* ‘-->’

Char

/* any Unicode character, excluding the blocks FFFE, and FFFF */

PI

‘<?’ PITarget (S (Char* - (Char* ‘?>’ Char*)))? ‘?>’

PITarget

Name – ((‘X’ | ‘x’) (‘M’ | ‘m’) (‘L’ | ‘l’))

Name

(Letter | ‘-‘ | ‘:’) (NameChar)*

 

/* a note from the actual REC-xml-20040204 specification regarding Letter and NameChar.  Just think of these as Latin characters.:

Following the characteristics defined in the Unicode standard, characters are classed as base characters (among others, these contain the alphabetic characters of the Latin alphabet), ideographic characters, and combining characters (among others, this class contains most diacritics). Digits and extenders are also distinguished. */

 

doctypedecl

‘<!DOCTYPE’ S NAME (S ExternalID) ? S? (‘[‘ intSubset ‘]’ S?)? ‘>’

 

/* note: validity constraint on document type declaration – the Name in the document type declaration MUST match the element type of the root element. */

 

ExternalID

‘SYSTEM’ S SystemLiteral | ‘PUBLIC’ S PubidLiteral S SystemLiteral

SystemLiteral

( ‘”’ [^”]* ‘”’) | (“’” [^’]* “’”)

PubidLiteral

‘”’ PubidChar* ‘”’ | “’” (PubidChar – “’”)* “’”

PubidChar

#x20 | #xD | #xA | [a-zA-Z0-9] | [-‘()+,./:=?;!*#@$_%]

intSubset

(markupdecl | DeclSep)*

markupdecl

elementdecl | AttlistDecl | EntityDecl | NotationDecl | PI | Comment

elementdecl

‘<!Element’ S Name S contentspec S? ‘>’

Examples of element type declarations:

<!ELEMENT br EMPTY>
<!ELEMENT p (#PCDATA|emph)* >
<!ELEMENT %name.para; %content.para; >
<!ELEMENT container ANY>

 

AttlistDecl

‘<!ATTLIST’ S Name AttDef* S? ‘>’

AttDef

S Name S AttType S DefaultDecl

AttType

StringType | TokenizedType | EnumeratedType

StringType

‘CDATA’

TokenizedType

‘ID’ | ‘IDREF’ | ‘IDREFS’ | ‘ENTITY’ | ‘ENTITIES’ | ‘NMTOKEN’ | ‘NMTOKENS’

 

/* notes on the TokenizedType production:

Validity constraint: ID

Values of type ID MUST match the Name production. A name MUST NOT appear more than once in an XML document as a value of this type; i.e., ID values MUST uniquely identify the elements which bear them.

Validity constraint: One ID per Element Type

An element type MUST NOT have more than one ID attribute specified.

Validity constraint: ID Attribute Default

An ID attribute MUST have a declared default of #IMPLIED or #REQUIRED.

Validity constraint: IDREF

Values of type IDREF MUST match the Name production, and values of type IDREFS MUST match Names; each Name MUST match the value of an ID attribute on some element in the XML document; i.e. IDREF values MUST match the value of some ID attribute.

Validity constraint: Entity Name

Values of type ENTITY MUST match the Name production, values of type ENTITIES MUST match Names; each Name MUST match the name of an unparsed entity declared in the DTD.

Validity constraint: Name Token

Values of type NMTOKEN MUST match the Nmtoken production; values of type NMTOKENS MUST match Nmtokens.

*/

 

EnumeratedType

NotationType | Enumeration

NotationType

‘NOTATION’ S ‘(‘ S? Name (S? ‘|’ S? Name)* S? ‘)’

Enumeration

‘(‘ S? Nmtoken (S? ‘|’ S? Nmtoken)* S? ‘)’

Nmtoken

(NameChar)+

DefaultDecl

‘#REQUIRED’ | ‘#IMPLIED’ | ((‘#FIXED’ S)? AttValue)

 

/* note: The meaning of #REQUIRED, #IMPLIED, and #FIXED

 

In an attribute declaration, #REQUIRED means that the attribute MUST always be provided, #IMPLIED that no default value is provided. [Definition: If the declaration is neither #REQUIRED nor #IMPLIED, then the AttValue value contains the declared default value; the #FIXED keyword states that the attribute MUST always have the default value. When an XML processor encounters an element without a specification for an attribute for which it has read a default value declaration, it MUST report the attribute with the declared default value to the application.]

 

*/

 

EntityDecl

GEDecl | PEDecl

GEDecl

‘<!ENTITY’ ­S Name S EntityDef S? ‘>’

PEDecl