3 Parsing DocBook Documents

$Revision: 7790 $

$Date: 2008-03-03 09:16:36 -0500 (Mon, 03 Mar 2008) $

A key feature of XML markup is that you validate it. The DocBook schema is a precise description of valid nesting: the order of elements, and their content. All DocBook documents must conform to this description or they are not DocBook documents (by definition). The validation technology that is built into XML is the Document Type Definition or DTD. A validating parser is a program that can read the DTD and a particular document and determine whether the exact nesting and order of elements in the document is valid according to the DTD.

DocBook is now defined by a RELAX NG grammar so it is no longer necessary to validate with the DTD. In fact, it isn't even very valuable since the DTD version doesn't enforce many DocBook constraints. Instead, an external RELAX NG validator must be used.

RELAX NG validation is performed on a document after it has been parsed. It is possible for parse errors to occur as well as validation errors (if, for example, your document isn't well-formed XML). We're going to assume that your documents are well formed and not discuss XML parsing errors.

If you are not using a structured editor that can enforce the markup as you type, validation with an external tool is a particularly important step in the document creation process. You cannot expect to get rational results from subsequent processing (such as document publishing) if your documents are not valid.

There are several free RELAX NG validators including Jing and MSV. For more detail about available RELAX NG tools, see http://www.relax-ng.org/.

ID/IDREF constraints and validation

Before we begin, we need to get a slightly tricky subject out of the way: ID/IDREF constraints. In XML, attributes of type ID and IDREF provide a straightforward cross-referencing mechanism. The value of an attribute of type IDREF must be the same as the value of some other attribute of type ID in the document. Checking these constraints is not a core part of RELAX NG, instead they are provided by a set of “DTD Compatibility” extensions.

Unfortunately, schema extensibility and DTD compatibility don't mix well. Several aspects of the RELAX NG grammar for DocBook introduce errors with respect to ID/IDREF constraint checking.

Luckily, because DocBook uses xml:id for its ID attribute, it's not necessary to enforce the constraints with RELAX NG.

You can either tell your processor not to perform the DTD compatibility extension checks, or ignore the warning messages that they produce.

Validating Your Documents

As examples, we'll describe how you can use Jing and MSV for validation. For information about your particular validator, consult the documentation that came with it.

Using Jing

The jing tool performs RELAX NG validation.

java -jar jing.jar -t -i docbook.rng test.xml
Elapsed time 562+75=637 milliseconds

The elapsed time is printed because we used the -t option. Without that option, it produces no output if there are no errors. The -i option suppresses ID/IDREF checks.

Using MSV

The MSV tool performs RELAX NG validation.

java -jar msv docbook.rng test.xml
start parsing a grammar.
warnings are found. use -warning switch to see all warnings.
validating test.xml
the document is valid.

If you use the -warnings option, you'll see the ID/IDREF warnings.

Understanding Validation Errors

Every validator produces slightly different error messages, but most indicate exactly (at least technically^[2]) what is wrong and where the error occurred. With a little experience, this information is all you'll need to quickly identify what's wrong.

In the rest of this section, we'll look at a number of common errors and the messages they produce in msv. We've chosen msv because it generally produces informative error messages.

Character Data Not Allowed Here

Out of context character data is frequently caused by a missing start tag, but sometimes it's just the result of typing in the wrong place!

<chapter xmlns="http://docbook.org/ns/docbook" version="5.0">
<title>Test Chapter</title>
<para>
This is a paragraph in the test chapter. It is unremarkable in
every regard. This is a paragraph in the test chapter. It is
unremarkable in every regard. This is a paragraph in the test
chapter. It is unremarkable in every regard.
</para>
You can't put character data here.
<para>
<emphasis role="bold">This</emphasis> paragraph contains
<emphasis>some <emphasis>emphasized</emphasis> text</emphasis>
and a <superscript>super</superscript>script
and a <subscript>sub</subscript>script.
</para>
<para>
This is a paragraph in the test chapter. It is unremarkable in
every regard. This is a paragraph in the test chapter. It is
unremarkable in every regard. This is a paragraph in the test
chapter. It is unremarkable in every regard.
</para>
</chapter>

java -jar msv.jar docbook.rng badpcdata.xml
start parsing a grammar.
warnings are found. use -warning switch to see all warnings.
validating badpcdata.xml
Error at line:10, column:7 of badpcdata.xml
  unexpected character literal

You can't put character data directly in a chapter. Here, a wrapper element, such as para, is missing around the sentence between the first two paragraphs.

Misspelled Start Tag

If you spell it wrong, the parser gets confused.

<chapter xmlns="http://docbook.org/ns/docbook" version="5.0">
<title>Test Chapter</title>
<para>
This is a paragraph in the test chapter. It is unremarkable in
every regard. This is a paragraph in the test chapter. It is
unremarkable in every regard. This is a paragraph in the test
chapter. It is unremarkable in every regard.
</para>
<paar>
<emphasis role="bold">This</emphasis> paragraph contains
<emphasis>some <emphasis>emphasized</emphasis> text</emphasis>
and a <superscript>super</superscript>script
and a <subscript>sub</subscript>script.
</para>
<para>
This is a paragraph in the test chapter. It is unremarkable in
every regard. This is a paragraph in the test chapter. It is
unremarkable in every regard. This is a paragraph in the test
chapter. It is unremarkable in every regard.
</para>
</chapter>

java -jar msv.jar docbook.rng misspell.xml
start parsing a grammar.
warnings are found. use -warning switch to see all warnings.
validating misspell.xml
Error at line:9, column:7 of misspell.xml
  tag name "paar" is not allowed. Possible tag names are: <address>,<anchor>,
<annotation>,<bibliography>,<bibliolist>,<blockquote>,<bridgehead>,
<calloutlist>,<caution>,<classsynopsis>,<cmdsynopsis>,
<constraintdef>,<constructorsynopsis>,<destructorsynopsis>,<epigraph>,
<equation>,<example>,<fieldsynopsis>,<figure>,<formalpara>,
<funcsynopsis>,<glossary>,<glosslist>,<important>,<index>,
<indexterm>,<informalequation>,<informalexample>,<informalfigure>,
<informaltable>,<itemizedlist>,<literallayout>,<mediaobject>,
<methodsynopsis>,<msgset>,<note>,<orderedlist>,<para>,
<procedure>,<productionset>,<programlisting>,<programlistingco>,
<qandaset>,<refentry>,<remark>,<revhistory>,<screen>,<screenco>,
<screenshot>,<section>,<section>,<segmentedlist>,<sidebar>,
<simpara>,<simplelist>,<simplesect>,<synopsis>,<table>,<task>,
<tip>,<toc>,<variablelist>,<warning>

Luckily, these are pretty easy to spot, unless you accidentally spell the name of another element. In that case, your error might appear to be out of context.

Out of Context Start Tag

Sometimes the problem isn't spelling, but placing a tag in the wrong context. When this happens, the parser tries to figure out what it can add to your document to make it valid. Then it proceeds as if it had seen what was added in order to recover from the error seen, which can cause future errors.

<chapter xmlns="http://docbook.org/ns/docbook" version="5.0">
<title>Test Chapter</title>
<para>
This is a paragraph in the test chapter. It is unremarkable in
every regard. This is a paragraph in the test chapter. It is
unremarkable in every regard. This is a paragraph in the test
chapter. It is unremarkable in every regard.
</para>
<para><title>Paragraph With Inlines</title>
<emphasis role="bold">This</emphasis> paragraph contains
<emphasis>some <emphasis>emphasized</emphasis> text</emphasis>
and a <superscript>super</superscript>script
and a <subscript>sub</subscript>script.
</para>
<para>
This is a paragraph in the test chapter. It is unremarkable in
every regard. This is a paragraph in the test chapter. It is
unremarkable in every regard. This is a paragraph in the test
chapter. It is unremarkable in every regard.
</para>
</chapter>

$ java -jar msv.jar docbook.rng context.xml
start parsing a grammar.
warnings are found. use -warning switch to see all warnings.
validating context.xml
Error at line:9, column:14 of context.xml
  tag name "title" is not allowed. Possible tag names are: <abbrev>,<accel>,
<acronym>,<address>,…,<varname>,<warning>,<wordasword>,<xref>

In this example, we probably wanted a formalpara, so that we could have a title on the paragraph. But note that the parser didn't suggest this alternative. The parser only tries to add additional elements, rather than rename elements that it's already seen.

^[2] It is often the case that you can correct an error in the document in several ways. The parser suggests one possible fix, but this is not always the right fix. For example, the parser may suggest that you can correct out of context data by adding another element, when in fact it's “obvious” to human eyes that the problem is a missing end tag.