Overview

xmlroff produces PDF or PostScript output using the GNOME Print library. Other output formats can be added.

xmlroff is written in C, and uses it uses libxml2 and libxslt plus the GLib, GObject and Pango libraries that underlie GTK+ and GNOME (although it does not require either GTK+ or GNOME). GLib is a general-purpose utility library, GObject is a flexible extensible object-oriented framework for C, and Pango is a framework for the layout and rendering of internationalized text. This combination made it easier to develop the formatter, makes it easier for current GTK+ and GNOME developers to also work on the formatter, and allows the formatter to use the internationalization support of Pango.

xmlroff has no connection with troff, nroff, groff or related programs. The names are similar, but so are the purposes. The roff(7) man page from the groff distribution describes a roff type-setting system as "an extensible text formatting language and a set of programs for printing and converting to other text formats." Since this program's input and and its extensible text formatting language are both XML, it therefore makes sense to call this program "xmlroff" in homage to the traditional Unix type-setting programs.

XSL describes formatting – for both paper and screen – in terms of "formatting objects." There are formatting objects, for example, for pages, blocks of text, list items, and tables. XSL also defines their allowed properties and the properties' meaning: for example, all formatting objects containing text may specify the font size or font weight (normal, bold, etc.) of the text, but only a table cell may use the "number-rows-spanned" property that indicates how many rows that cell spans.

XSL also describes the conceptual procedure for processing the input XML document and XSL stylesheet to create the formatted output. xmlroff implements formatting largely in accordance with the stages described in the XSL Recommendation. The stages are shown in the context diagram and summarized in the following sections.

Context Diagram

The context diagram is in the form used in Software Requirements & Specifications by Michael Jackson.

Source XML to Result Tree

This stage transforms the source XML into a representation of the formatting objects that are used to direct the formatting.

The inputs to xmlroff are an XML document that is to be formatted -- typically described as the "source" document -- and the XSL stylesheet that specifies the transformation of the source XML document into the XML vocabulary used for specifying formatting objects and their properties.

xmlroff incorporates the libxslt XSLT processor, which performs the transformation. The result, termed the result tree in the XSLT Recommendation, is an in-memory representation of the structure of an XML document. The element names and attribute names appearing in the result tree are the names of the XSL formatting objects and their properties, respectively. Later processing stages use the structure of the result tree and the property values specified to determine the appearance of the formatted output.

The result tree could be identical to the source document's tree or could be radically different, since the stylesheet can drop any part of the source tree, duplicate any part of the source tree, create elements and text in the result tree, and merge in any part of other XML documents that could be specified in the source XML, in the stylesheet, or in a parameter passed to the XSLT processor.

XSLT

The mechanics of this transformation, from source XML to a different XML document, is standardized by the XSL Recommendation. Actually, it is standardized by the separate XSLT Recommendation. XSLT is conceptually part of XSL, and in early drafts of XSL, the XSLT specification was in one section and the formatting objects' descriptions were in another. XSLT was broken out as a separate W3C Recommendation because it now has widespread use for general XML-XML, XML-HTML, and XML-text transformations in addition to its initial purpose of transforming arbitrary XML into the specific XML vocabulary used for expressing formatting objects and their properties.

A beneficial side effect of XSLT's success is the availability of a choice of free, stable, and high performance XSLT processors that can be incorporated into xmlroff as the much preferred alternative to writing an XSLT processor. Accordingly, xmlroff incorporates Daniel Velliard's libxslt XSLT processor.

XSLT operates on documents as trees. That is, XSLT's processing model views the source document not as a sequence of characters, and not as start-tags and end-tags with text between them, but as a tree of nodes, where, for example, each element, each attribute, and each contiguous run of text is a separate node. The structure of the source XML document — for example, the containment of one element by another — is reflected in the structure of the tree of nodes comprising the source tree.

The result tree is similarly a tree of nodes, where a node represents an element, an attribute, some text, etc. In the general XSLT processor, the result tree is usually written out to a file or transmitted to another application as an XML document. The result tree doesn't have to be written out in any form, however, and it can be used as-is by the application. Accordingly, xmlroff uses this in-memory representation as the input to the next processing stage.

Result Tree to Formatting Object and Area Trees

This processing stage transforms the result tree into a tree of real programmatic objects with properties that are expressed as numeric, boolean, color, or other datatypes (instead of just text).

The result tree is a representation of an XML document. XML documents are just text, so in the result tree, formatting object and property names are represented as text, as are property values. Some property values, however, represent numeric quantities, and many may contain expressions that need to be evaluated to determine the exact value to use. Furthermore, there are complex interactions and dependencies between formatting objects and between properties.

This stage also creates the area tree representation of the formatted document layed out onto pages.

The formatting object tree expresses the specification for the formatted document (with expressions, interactions, and dependencies fully resolved) in terms of objects and datatypes that are useful for manipulation by a program. If the output "page" of a formatter is always infinitely wide and infinitely long (as is both possible with an electonic display and supported by the XSL Recommendation), then creating the output would be a comparatively simple matter of writing out the formatting object tree.

When the output isn't infinitely wide, however, a formatter has to support breaking lines, and when the output also isn't infinitely long, a formatter has to support breaking the output into discrete pages. In the real world, formatting content into pages also means numbering pages, supporting running headers and footers, and possibly handling different page sizes or margins on different pages.

The formatting objects each create zero, one, or more than one areas in the area tree:

  1. Some formatting objects, e.g. wrapper, generate no areas because that's how they're specified, and some don't generate areas because of some conditionality in the formatting process; e.g., only the "preferred" of any number of applicable marker formatting objects is formatted in place of a retrieve-marker formatting object

  2. Many formatting objects generate one area.

  3. Some formatting objects generate more than one area; for example, a block formatting object split across two pages, or a page-sequence that, by definition, generates as many pages as necessary to contain its content.

Many formatting object properties may be expressed as percentages of another value, often a percentage of a dimension of the area generated by an ancestor formatting object. Xmlroff builds the formatting object tree and the area tree in parallel so expressions containing percentages are resolved when a formatting object is added to the formatting object tree.

Area Tree Adjustment

This stage works on the area tree as a whole to optimize the arrangement of the areas.

Creating each page in isolation produces a workable result, but there can be dependencies between pages; for example, page number citations to other pages. In addition, producing quality pages (i.e., pages that look good) means, for example:

  • Balancing the amount of text on facing pages so the content of both pages extend the same distance down the page

  • Aligning the lines on facing pages and on back-to-back pages

  • Not splitting a block of text such that only one line appears before a page break or only one line appears after a break

  • Not ending a page on a hyphen

Area Tree to Output

This stage writes out the area tree in a format that can be used by other programs or sent to a printer. The initial output format is PDF.


DocBookLibxslt SourceForge