Role of XSL-FO XML
XSL-FO is a vocabulary for expressing the formatting of XML. It is defined in the XSL 1.1 Recommendation 1 from the W3C. XSL-FO is used for formatting XML, but it originally wasn’t meant to be written as XML. Your XML was meant to be transformed to XSL-FO inside the FO processor, as shown in Figure 1.
The first XSL 1.0 Working Draft 2 mentions the option of producing formatting objects as XML output, without mentioning it as a possible input:
When the result tree uses the formatting vocabulary, a conforming XSL implementation must be able to interpret the result tree according to the semantics of the formatting vocabulary as defined in this document; it may also be able to externalize the result tree as XML, but it is not required to be able to do so.
Section 1.1, Processing a Stylesheet, from the current XSL 1.1 Recommendation contains both:
An XSL stylesheet processor accepts a document or data in XML and an XSL stylesheet and produces the presentation of that XML source content that was intended by the designer of that stylesheet.
In some implementations of XSL/XSLT, the result of tree construction can be output as an XML document. This would allow an XML document which contains formatting objects and formatting properties to be output. This capability is neither necessary for an XSL processor nor is it encouraged.
The above definition of an XSL stylesheet processor as accepting XML and XSL first appeared in the 1 March 2000 XSL 1.0 Working Draft 3, which was the first XSL 1.0 WD for a while to have this sort of introduction and which also postdates XSLT 1.0 4. In contrast, the combined acknowledgement and discouragement of outputting the result tree as XML first appeared two WDs later, in the 18 October 2000 Working Draft 5. In a similar fashion, the first version of the DTD for XSL tests 6 produced by the XSL FO Subgroup on 10 January 2001 did not support using FO files as input, and that was later allowed in an update on 4 April 2001.
Another indication that you were not meant to externalize
XSL-FO as XML is that the XSL-FO vocabulary has no mechanism to indicate the version
number of the XSL-FO document. This means, in part, that there's no way to indicate
whether the XSL-FO document uses features added in XSL 1.1. This isn’t a problem
when the transformation happens inside the XSL processor, but it could be if the transformation
was not intimately tied to the XSL formatter that does the presentation of that XML source content
.
At this point, it’s not possible to say how much of the de-emphasis of producing XSL-FO XML was from adherence to the processing model of the original XSL submission 7, which makes no mention of XML output, from a general acknowledgement that translation to the formatting object semantics loses the original semantics, or from the lingering shadow of the “Formatting Objects considered harmful” 8 essay[1] by Håkon W Lie, which caused some discussion when it was published in April 1999.
In practise, however, the transformation part was broken out as XSLT, and now most people do the transformation as a separate step using an XSLT processor, and every FO processor that I know of accepts XSL-FO as input. The two benefits of doing the transformation separately are that you can use the XSLT processor of your choice, possibly with your own extensions, and that you can debug the XSLT.
There are also times when you want to write your own XSL-FO, or edit an existing XSL-FO file, particularly when you want to experiment with XSL-FO markup to see what it produces. Most FO processors operate in batch mode either from the command line or through an API. oXygen ships with the FOP processor and it makes it easy to use the Antenna House and RenderX formatters.
Why make a schema?
XSL-FO documents are typically generated as the result of an XSLT transformation and are rarely edited by hand. However, validating generated XSL-FO markup is useful as a check of the correctness of the transformation. Also, people do edit XSL-FO by hand either when prototyping the XSL-FO markup that will later be generated using XSLT or when debugging generated XSL-FO. Being able to validate the XSL-FO in an XML editor helps in both scenarios.
Validating XSL-FO is not easy because:
-
Constraints in the definitions of FOs are hard or impossible to express in structure-checking schema languages.
-
Some FOs can appear almost anywhere in an XSL-FO document but, equally, cannot appear where they are not allowed.
-
The properties of an FO are expressed in the XML as attributes of the XML element representing the FO, but inherited properties 11 are allowed to appear on any FO, not just on the FOs for which they are defined.
-
While the XSL 1.1 Recommendation 12 defines the allowed values of properties, most properties can contain expressions in the expression language 13 that is defined in the spec, so determining the correctness of an attribute in the XML initially requires evaluating it. Figure 2 shows part of the EBNF for the XSL-FO expression language.
A schema for XSL-FO was in the requirements for XSL 2.0 14, but the design of XSL was shaped by the requirements of formatting rather than any requirement to conform to a schema language. The result has been that XSL-FO was hard to validate except by running it through an FO formatter. Systems for checking the formatted result exist 1516, but they require usable input.
Schemas for XSL-FO do exist, including several from RenderX 17 and the schema that is provided by the oXygen XML Editor 18 prior to oXygen 17.1, but: they do not cover XSL 1.1; they each cut corners in their models for element content; and they do not properly evaluate property value expressions. An added problem that we experienced more often than most people is that none of the existing schemas covered the extensions built into the Antenna House formatter.
One of the validation methods tried by RenderX but noted as
longer used 17 is a validator written in XSLT
19. A 2004 paper 20 by Alexander Peshkov of
RenderX describes the XSLT approach as powerful but requiring more
resources than, for example, DTD validation and also not being
suitable for visual XSL-FO editors or document builders.
That
paper then describes a Relax NG schema that includes a limited
ability to handle property value expressions.
The approach taken by Antenna House combines Relax NG and Schematron for detailed validation of the XSL-FO. The Relax NG handles structural validation and is, we believe, more correct than pre-existing schemas. The Schematron handles the additional constraints that cannot be expressed in Relax NG. The Schematron parses property value expressions using an XSLT-based parser generated by the REx parser generator 21 plus an XSLT library for reducing the parse tree to XSL-FO datatypes.
The Relax NG and Schematron is available on GitHub (https://github.com/AntennaHouse/focheck) and you can download an oXygen add-on framework for XSL-FO validation directly from the GitHub page 22. An earlier focheck version was bundled with oXygen 17.1 and 18.0.
We also considered wiring the Schematron directly to the expression parser built into an FO formatter through XSLT extension functions. However, doing the interfacing would have been a non-trivial task, plus the Antenna House AH Formatter 23 is a native application on each platform and wouldn’t be as portable as purely-XSLT Schematron.
Why Relax NG?
Three features of Relax NG made it the best choice for the schema:
-
Non-deterministic content models
-
Easy extensibility by redefining or extending patterns
-
Ability to interleave elements in content models
Why Relax NG Compact Syntax?
The schema is written in Relax NG compact syntax (and usable with Emacs) and then converted into the Relax NG XML syntax for use with oXygen (and into W3C XML Schema for use with other editors). It is not written directly in the XML syntax for multiple reasons:
-
Relax NG compact syntax closely matches the syntax of the content models in the spec, which made it easier to include their text in the generated schema. If the Relax NG XML syntax had been used, we would have had to replicate part of the work of the Relax NG validator and parse the delimiters and occurrence indicators when converting the textual content models into XML syntax.
-
It was easy to write and check the initial patterns that would be replicated by the programmatically generated schema.
-
Reading the generated schema to check it is easier with the compact syntax than with the XML syntax.
-
The handwritten parts, including the schema module defining Antenna House extensions, were only ever going to be written in the compact syntax.
Generating the Relax NG and Schematron
The bulk of the FO portion of the Relax NG and Schematron is generated by transforming the XML source 24 for the XSL 1.1 Recommendation using XSLT. The XML is consistent enough for this to be feasible: it’s not the first time that I’ve generated code from the XML, nor am I the only person to have done it 25.
Validating FOs
At first glance, this seems quite straightforward to do using Relax NG: the content models are in the spec, where every FO is in a separate div3 element and the FO’s content model is easy to identify:
<div3 id="fo_block"><head>fo:block</head> <p><emph>Common Usage:</emph></p> <p>The fo:block formatting object is commonly used for formatting paragraphs, titles, headlines, figure and table captions, etc.</p> ... <p><emph>Contents:</emph></p> <eg xml:space="preserve"> (#PCDATA|<loc href="#inline.fo.list" xlink:type="simple" xlink:show="replace" xlink:actuate="onRequest" xmlns:xlink="http://www.w3.org/1999/xlink">%inline;</loc>|<loc href="#block.fo.list" xlink:type="simple" xlink:show="replace" xlink:actuate="onRequest" xmlns:xlink="http://www.w3.org/1999/xlink">%block;</loc>)* </eg> <p>In addition this formatting object may have a sequence of zero or more fo:markers as its initial children, optionally followed by an fo:initial-property-set. </p>
The %inline;
and %block;
behave like parameter
entities in a DTD, though there isn’t a DTD, and their
expansions are given in the text of the Recommendation 26:
The parameter entity, "%block;" in the content models below, contains the following formatting objects:
block block-container table-and-caption table list-blockThe parameter entity, "%inline;" in the content models below, contains the following formatting objects:
bidi-override character external-graphic instream-foreign-object inline inline-container leader page-number page-number-citation page-number-citation-last scaling-value-citation basic-link multi-toggle index-page-citation-list
So far, so good; the corresponding Relax NG pattern generated
for fo:block
looks like:
fo_block.model = (text|inline.fo.list|block.fo.list)*
where inline.fo.list
and
block.fo.list
are defined in literal text that is
included in the generated schema.
However, the XSL 1.1 Recommendation defines neutral
and out-of-line
classes of FOs
that can appear anywhere where #PCDATA
, %inline;
or %block;
is allowed in FO content
models (although additional constraints apply). Handling those simply required matching
on
#PCDATA
, %inline;
, or %block;
in the content models in the spec. The generated pattern
for fo:block
then becomes:
fo_block.model = ( ( (text|inline.fo.list|block.fo.list)* & (inline.out-of-line.fo.list)* ) & (neutral.fo.list)* )
The neutral
and out-of-line
FO classes
were also in XSL 1.0. XSL 1.1 added fo:change-bar-begin
and
fo:change-bar-end
as point
FOs that may be used anywhere
as a descendant of fo:flow or fo:static-content
26. Since that couldn’t be handled by just looking at either
the FO or its content model, the XSLT contains a list of FOs to which
to not add the point
FOs:
<xsl:variable name="no-point-fos" select="'root layout-master-set declarations bookmark-tree page-sequence page-sequence-wrapper color-profile title folio-prefix folio-suffix simple-page-master page-sequence-master flow-map single-page-master-reference repeatable-page-master-reference repeatable-page-master-alternatives conditional-page-master-reference region-body region-before region-after region-start region-end flow-assignment flow-source-list flow-target-list flow-name-specifier region-name-specifier'" as="xs:string" /> <xsl:variable name="no-point-fo-list" select="tokenize($no-point-fos, '\s+')" as="xs:string+" />
such that every FO not in the list will
allow fo:change-bar-begin
and
fo:change-bar-end
, so the model for fo:block
becomes:
fo_block.model = ( ( (text|inline.fo.list|block.fo.list)* & (inline.out-of-line.fo.list)* ) & (neutral.fo.list)* & (point.fo.list)* )
But there’s also the additional constraints about
allowing fo:marker
and fo:initial-property-set
as initial children of
an fo:block
. This is handled by adding those elements to content models
only where the significant zero or more fo:markers
or
optionally followed by an fo:initial-property-set
text occurs in the
FO’s definition. The complete, and completely auto-generated,
model for fo:block
is:
fo_block.model = fo_marker*, fo_initial-property-set?, ( ( (text|inline.fo.list|block.fo.list)* & (inline.out-of-line.fo.list)* ) & (neutral.fo.list)* & (point.fo.list)* )
fo:block
is actually a quite straightforward FO to validate. fo:footnote
27, for
example, would appear to be even easier, since its content model is:
(inline,footnote-body)
The neutral
and out-of-line
FOs don’t apply to fo:footnote
, but the point
FOs do, so the
generated model is:
fo_footnote.model = ( (fo_inline,fo_footnote-body) & (point.fo.list)* )
If only it was that simple. There are additional constraints in the text of the XSL 1.1 Recommendation:
It is an error if the fo:footnote occurs as a descendant of a flow that is not assigned to one or more region-body regions, or of an fo:block-container that generates absolutely positioned areas...
...
An fo:footnote is not permitted to have an fo:float, fo:footnote, or fo:marker as a descendant.
Additionally, an fo:footnote is not permitted to have as a descendant an fo:block-container that generates an absolutely positioned area.
from its content model, fo:retrieve-table-marker
28 (added in XSL 1.1) would appear to be even
simpler:
EMPTY
producing:
fo_retrieve-table-marker.model = ( empty )
but it has its own constraints:
An fo:retrieve-table-marker is only permitted as the descendant of an fo:table-header or fo:table-footer or as a child of fo:table in a position where fo:table-header or fo:table-footer is permitted.
These are the sorts of constraints that can’t be
expressed in Relax NG (except by exploding the size of the schema
through making separate versions of every FO that can appear in each
constrained context) but that are well suited to Schematron. There
aren’t enough of these constraints that are expressed in a
consistent way for it to be worthwhile autogenerating them, so they
have to be written by hand. For example, this is the
fo:retrieve-table-marker
constraint as a Schematron rule:
<rule context="fo:retrieve-table-marker"> <assert test=" exists(ancestor::fo:table-header) or exists(ancestor::fo:table-footer) or (exists(parent::fo:table) and empty(preceding-sibling::fo:table-body) and empty(following-sibling::fo:table-column))" >An fo:retrieve-table-marker is only permitted as the descendant of an fo:table-header or fo:table-footer or as a child of fo:table in a position where fo:table-header or fo:table-footer is permitted.</assert> </rule>
Validating properties
Generating Relax NG patterns for the properties is
straightforward. The XML for each FO includes a list of its allowed
properties or groups of properties. For example, for
fo:footnote
27:
<p><emph>The following properties apply to this formatting object:</emph></p><slist> <sitem><specref ref="common-accessibility-properties"/></sitem> <sitem><specref ref="id"/></sitem> <sitem><specref ref="index-class"/></sitem> <sitem><specref ref="index-key"/></sitem> </slist>
Here, the specref/@ref
refers either to a
div2
containing the div3
for multiple
properties or to a div3
for a property. The
div2
each generate a named pattern, so the pattern for
the properties of fo:footnote
is:
fo_footnote.attlist = common-accessibility-properties, id, index-class, index-key
where common-accessibility-properties
is:
common-accessibility-properties = source-document, role
Because, as stated previously, the properties are evaluated as
expressions, each property is generated in the Relax NG as containing
only text. For example, for the column-count
property:
column-count = ## <number> | inherit attribute column-count { text }?</number>
where ##
begins an annotation that is the
property’s allowed value as extracted from the XML for the XSL
1.1 spec (and, similarly, annotations for FOs are also extracted from
the spec). The annotations appear in oXygen as tool-tips, as shown in Figure 3.
Whether or not a particular property is required for an FO is not easy to automatically determine from the XML for the XSL 1.1 spec, so that is enforced by the Schematron, not by the Relax NG.
Some property values are described in terms of compound
datatypes 29, which are expressed in the XML as multiple attributes. For
example, the value of the “space-before” property is the compound <space> property.
A space-before
may be specified as:
space-before.minimum="2.0pt" space-before.optimum="3.0pt" space-before.maximum="4.0pt" space-before.precedence="0" space-before.conditionality="discard"
In the Relax NG, the properties that may have a value that is a compound datatype each generate multiple attribute definitions. For example:
space-before = ## &space> | inherit attribute space-before { text }?, attribute space-before.minimum { text }?, attribute space-before.optimum { text }?, attribute space-before.maximum { text }?, attribute space-before.precedence { text }?, attribute space-before.conditionality { text }?
As stated previously, property values are evaluated using a parser generated by the REx parser generator 21. The productions in the XSL 1.1 spec 13 were mostly suitable for feeding to REx, although to get a functioning parser required a lot of modifications based on the example of the grammar for XPath 2.0 30 that is provided on the REx website.
Running the parser on a property value expression produces markup corresponding to
the
productions in the grammar. For example, for -1 - -2
, the expression parser
produces:
<Expression> <Expr> <AdditiveExpr> <MultiplicativeExpr> <UnaryExpr> <TOKEN>-</TOKEN> <UnaryExpr> <PrimaryExpr> <Numeric> <AbsoluteNumeric> <AbsoluteLength> <Number>1</Number> </AbsoluteLength> </AbsoluteNumeric> </Numeric> </PrimaryExpr> </UnaryExpr> </UnaryExpr> </MultiplicativeExpr> <TOKEN>-</TOKEN> <MultiplicativeExpr> <UnaryExpr> <TOKEN>-</TOKEN> <UnaryExpr> <PrimaryExpr> <Numeric> <AbsoluteNumeric> <AbsoluteLength> <Number>2</Number> </AbsoluteLength> </AbsoluteNumeric> </Numeric> </PrimaryExpr> </UnaryExpr> </UnaryExpr> </MultiplicativeExpr> </AdditiveExpr> </Expr> <EOF/> </Expression>
This, obviously, has little resemblance to an XSL-FO
datatype. The Schematron uses a handwritten
parser-runner.xsl
library that runs the expression parser
and (mostly) reduces the elements for the grammar productions into
elements representing XSL-FO datatypes. For example, this is the
current implementation of the function for the AdditiveExpr
element:
<xsl:function name="ahf:AdditiveExpr" as="element()"> <xsl:param name="parse-tree" as="element()*" /> <xsl:variable name="term1" as="element()" select=" ahf:reduce-tree($parse-tree/MultiplicativeExpr[1])" /> <xsl:choose> <xsl:when test=" count($parse-tree/MultiplicativeExpr) = 1"> <xsl:sequence select="$term1" /> </xsl:when> <xsl:otherwise> <xsl:sequence select="ahf:nextAdditiveExpr($term1, $parse-tree/MultiplicativeExpr[position() > 1])" /> </xsl:otherwise> </xsl:choose> </xsl:function> <xsl:function name="ahf:nextAdditiveExpr" as="element()"> <xsl:param name="term1" as="element()" /> <xsl:param name="otherTerms" as="element()+" /> <xsl:variable name="term2" select="ahf:reduce-tree($otherTerms[1])" as="element()" /> <xsl:variable name="token" select="$otherTerms[1]/preceding-sibling::TOKEN[1]" as="xs:string" /> <xsl:variable name="result" as="element()"> <xsl:choose> <xsl:when test="$term1 instance of element(Object) or $term2 instance of element(Object)"> <Object /> </xsl:when> <xsl:when test="$term1 instance of element(Length) and $term2 instance of element(Length)"> <Length /> </xsl:when> <xsl:when test="$term1 instance of element(Number) and $term2 instance of element(Number)"> <Number> <xsl:if test="$term1/@value and $term2/@value"> <xsl:variable name="value" select="if ($token eq '+') then $term1/@value + $term2/@value else $term1/@value - $term2/@value" as="xs:double" /> <xsl:sequence select="ahf:number-attrs($value)" /> </xsl:if> </Number> </xsl:when> <xsl:otherwise> <ERROR>Not coping.</ERROR> </xsl:otherwise> </xsl:choose> </xsl:variable> <xsl:choose> <xsl:when test="count($otherTerms) = 1"> <xsl:sequence select="$result" /> </xsl:when> <xsl:otherwise> <xsl:sequence select="ahf:nextAdditiveExpr($result, $otherTerms[position() > 1])" /> </xsl:otherwise> </xsl:choose> </xsl:function>
The result returned from the parser-runner.xsl
for -1 - -2
is:
<Number value="1" is-positive="yes" is-zero="no"/>
Expression evaluation is used in three Schematron phases:
-
Automatically generated Schematron rules that report syntax errors and incorrect datatypes.
-
Handwritten Schematron rules for the additional constraints in the XSL 1.1 Recommendation.
-
Handwritten Schematron rules for Antenna House extensions.
For example, the value of the column-count
property
is defined as <number> | inherit
, but the
definition of <number>
for
column-count
is:
<number>
A positive integer. If a non-positive or non-integer value is provided, the value will be rounded to the nearest integer value greater than or equal to 1.
The automatically generated rule for column-count
is:
<rule context="fo:*/@column-count"> <let name="expression" value="ahf:parser-runner(.)"/> <assert test="local-name($expression) = ('Number', 'EnumerationToken', 'ERROR', 'Object')"> 'column-count' should be Number, EnumerationToken. '<value-of select="."/>' is a <value-of select="local-name($expression)"/>.</assert> <report test="$expression instance of element(EnumerationToken) and not($expression/@token = ('inherit'))"> Enumeration token is: '<value-of select="$expression/@token"/>'. Token should be 'inherit'.</report> <report test="local-name($expression) = 'EMPTY'" role="Warning">column-count="" should be Number or 'inherit'.</report> <report test="local-name($expression) = 'ERROR'">Syntax error: 'column-count="<value-of select="."/>"'</report> </rule>
and the handwritten rule is:
<rule context="fo:*/@column-count" role="column-count"> <let name="expression" value="ahf:parser-runner(.)"/> <report test="local-name($expression) = 'Number' and (exists($expression/@is-positive) and $expression/@is-positive eq 'no' or $expression/@is-zero = 'yes' or exists($expression/@value) and not($expression/@value castable as xs:integer))" role="column-count">Warning: @column-count should be a positive integer. A non-positive or non-integer value will be rounded to the nearest integer value greater than or equal to 1.</report> <sqf:fix id="column-count-fix"> <sqf:description> <sqf:title>Change the @column-count value</sqf:title> </sqf:description> <sqf:replace node-type="attribute" target="column-count" select="max((1, round(.)))"/> </sqf:fix> </rule>
Note that the expression evaluation stops short of evaluating
the inherited value. Also, a <Number>
might not
have a @value
; for example, if it is the result of
‘evaluating’ an XSL-FO function that isn’t fully
implemented in parser-runner.xsl
.
The sqf:fix
element is a Schematron Quick Fix
31. Schematron QuickFix is an extension of ISO standard Schematron originally by Nico
Kutscherauer that is now being developed by a W3C Community Group and is implemented
by oXygen. It allows a Schematron processor to fix the errors that it finds, or in
an editor such as oXygen, allows the editor to offer possible fixes, as shown in Figure 4.
Some of the property value definitions that are shared with CSS need to be expanded
32 into multiple enumeration tokens or XSL-FO datatypes
before generating the Schematron for checking a property’s
value. For example, the value of the border-start-width
33
property is defined as:
<border-width> | <length-conditional> | inherit
but <border-width>
is considered a ‘notational shorthand’ in XSL 1.1 32, so the
value to be checked for expands to:
thin | medium | thick | <length> | <length-conditional> | inherit
although only the presence or absence of attributes for
border-start-width.length
or
border-start-width.conditionality
would determine whether
a length value for border-start-width
is a <length>
or a <length-conditional>
.
Antenna House extensions
Antenna House AH Formatter 34 implements a number of extensions 353637 to the XSL 1.1 Recommendation to provide improvements to the formatted output. Validation of AH Formatter extensions is also implemented using a combination of Relax NG and Schematron.
The documentation for the AH Formatter extensions is in XML, as you would expect. However, it’s not in a format that is useful for automating the connections between extensions and applicable FOs and properties, so the Relax NG and Schematron both needed to be handwritten.
The modules for the AH Formatter extensions use the Relax NG
include
pattern to include the schema for XSL-FO and
merge it with the definitions of the extensions. The shortened schema
module below demonstrates this:
default namespace axf = "http://www.antennahouse.com/names/XSL/Extensions" namespace fo = "http://www.w3.org/1999/XSL/Format" include "fo.rnc" { # http://www.antennahouse.com/product/ahf60/docs/ #ahf-ext.html#fo.change-bar-begin fo_change-bar-begin.model = ( fo_float? ) # http://www.antennahouse.com/product/ahf60/docs/ #ahf-ext.html#axf.document-info fo_root.model = ( (axf_document-info*, fo_layout-master-set, axf_document-info*, fo_declarations?, axf_document-info*, fo_bookmark-tree?, axf_document-info*, (fo_page-sequence| fo_page-sequence-wrapper)+) ) } # http://www.antennahouse.com/product/ahf60/docs/ #ahf-ext.html#axf.document-info axf_document-info = element axf:document-info { attribute name { ('document-title' | 'subject' | 'author' | 'author-title' | 'description-writer' | 'keywords' | 'copyright-status' | 'copyright-notice' | 'copyright-info-url' | 'xmp' | 'pagemode' | 'pagelayout' | 'hidetoolbar' | 'hidemenubar' | 'hidewindowui' | 'fitwindow' | 'centerwindow' | 'displaydoctitle' | 'openaction' )}, attribute value { text }, empty } common-border-padding-and-background-properties &= axf_border-radius, axf_border-top-right-radius axf_border-radius = attribute axf:border-radius { text }? axf_border-top-right-radius = attribute axf:border-top-right-radius { text }?
This module includes fo.rnc
. The
definitions of fo_change-bar-begin.model
and
fo_root.model
redefine and override the corresponding
definitions in fo.rnc
. Conversely, the definition
of common-border-padding-and-background-properties
that
is outside the include
interleaves the
axf_border-radius
and
axf_border-top-right-radius
patterns with the existing
common-border-padding-and-background-properties
defined
in fo.rnc
to add additional optional attributes
to any FO defined by the XSL 1.1 spec to already have the common
border, padding, and background properties.
The definitions of axf_document-info
,
axf_border-radius
, and
axf_border-top-right-radius
have to be outside the
include
pattern. It would be an error to put any of them
inside the include
since there are no corresponding
definitions in fo.rnc
that they would
override.
MathML and SVG
MathML and SVG are the two formats most commonly used in fo:instream-foreign-object
, so their schemas have been included in the focheck schema using 'external' references.
The new top-level schema file, axf-mathml3-svg11.rnc
, contains:
default namespace axf = "http://www.antennahouse.com/names/XSL/Extensions" namespace fo = "http://www.w3.org/1999/XSL/Format" namespace m = "http://www.w3.org/1998/Math/MathML" namespace svg = "http://www.w3.org/2000/svg" namespace local = "" # MathML 3.0 schema with Antenna House overrides. math = external "axf-mathml3-override.rnc" # SVG 1.1 schema with Antenna House overrides. svg = external "axf-svg11-override.rnc" include "axf.rnc" { # For fo:instream-foreign-object # Exclude axf:* elements and attributes from counting as 'non-xsl'. # Allow MathML m:math and svg:svg non-xsl = ( attribute * - ( local:* | axf:* | xml:* ) { text }*, ( ( element * - ( local:* | fo:* | axf:* | m:* | svg:* ) { attribute * - (id | ref-id | internal-destination ) { text }*, anything }* ) | math | svg) ) # For fo:instream-foreign-object anything = ( element * -( fo:* ) { attribute * - (id | ref-id | internal-destination ) { text }*, anything } | text )* }
This module overrides the non-xsl
and anything
patterns from several layers down in fo.rnc
to include the elements from the 'start' patterns of the external MathML 3.0 and
SVG 1.1 grammars and to also exclude XSL-FO elements and Antenna House extension elements
and attributes from appearing in or on non-XSL-FO elements.
The standard MathML and SVG schemas need to be modified, however. For MathML, all
XSL-FO, Antenna House, and SVG elements are additionally excluded from the content
of the MathML annotation-xml
element:
# MathML 3.0 include "mathml3.rnc" { anyElement = element (* - (m:* | axf:* | fo:* | svg:*)) {(attribute * {text}|text| anyElement)*} }
and for SVG 1.1, the svg/@version
attribute definition is relaxed so that SVG 1.0 documents are also valid and, because
the same SVG document may be included multiple times in one XSL-FO document, the svg/@id
attribute is made an NMTOKEN to avoid ID value clashes:
# SVG 1.1 include "svg11-flat-20110816.rnc" { # Repeat entire 'attlist.svg' pattern just to override @version. attlist.svg &= SVG.xmlns.attrib, SVG.Core.attrib, SVG.Conditional.attrib, SVG.Style.attrib, SVG.Presentation.attrib, SVG.DocumentEvents.attrib, SVG.GraphicalEvents.attrib, SVG.External.attrib, attribute x { Coordinate.datatype }?, attribute y { Coordinate.datatype }?, attribute width { Length.datatype }?, attribute height { Length.datatype }?, attribute viewBox { ViewBoxSpec.datatype }?, [ a:defaultValue = "xMidYMid meet" ] attribute preserveAspectRatio { PreserveAspectRatioSpec.datatype }?, [ a:defaultValue = "magnify" ] attribute zoomAndPan { "disable" | "magnify" }?, [ a:defaultValue = "1.1" ] attribute version { string }?, attribute baseProfile { Text.datatype }?, [ a:defaultValue = "application/ecmascript" ] attribute contentScriptType { ContentType.datatype }?, [ a:defaultValue = "text/css" ] attribute contentStyleType { ContentType.datatype }? # end of SVG.svg.attlist # Don't treat SVG @id as an ID since may be repeated in multiple SVGs # in one FO document. SVG.id.attrib = attribute id { xsd:NMTOKEN }? }
Putting it all together – the onion and the string
As shown in Figure 5, the Relax NG schema
resembles an onion: the outer layer is
axf-mathml3-svg11.rnc
, which brings in the external
MathML and SVG schemas. The first substantive layer is
axf.rnc
with the definitions and redefinitions for the
Antenna House extensions. The next layer, which is included by
axf.rnc
, is the auto-generated definitions that
interleave the inheritable Antenna House extension properties with the
properties that are defined for each FO. That layer then includes the
auto-generated module with definitions for the XSL 1.1 inherited
properties, which in turn includes the inner layer that is the
autogenerated definitions for the FOs and their properties.
The Relax NG compact syntax schema is also converted into Relax NG XML syntax for use with oXygen (since oXygen does not extract documentation annotations from a compact syntax schema) and into W3C XML Schema for use with other editors. As noted previously, the annotations in the schema, which were extracted from the XML for the XSL 1.1 spec, are presented as tool-tips when editing an FO document with oXygen.
The Schematron is written as multiple phases strung together. With a Schematron implementation that supports progressive validation by executing each phase in order of its appearance, this will lead to progressively more refined error checking. The phases are:
-
Handwritten rules for FO constraints that aren’t captured by the Relax NG.
-
Autogenerated rules for checking property values for syntax errors and correct datatypes.
-
Handwritten rules for extra constraints on property values, such as the rule that
column-count
should be a positive integer. -
Handwritten rules for the Antenna House extensions.
There is an oXygen framework file that refers to both the Relax NG and the Schematron so that oXygen can automatically validate FO files using them. The framework is available as a downloadable add-on for oXygen. An earlier version of the framework was bundled with oXygen 17.1 and 18.0.
You can also validate FO files from the command-line using the
validate
target from the
build-focheck.xml
Ant build file.
Testing
There are multiple levels of testing of the Relax NG and Schematron.
At the lowest level, the parser-runner.xsl
XSLT library is tested using XSpec 38 tests, for
example:
<x:scenario label="AdditiveExpr"> <x:call function="axf:parser-runner"/> <!-- ... --> <x:scenario label="-1 - -2"> <x:call> <x:param name="input" select="'-1 - -2'" as="xs:string"/> </x:call> <x:expect label="is a number"> <Number value="1" is-positive="yes" is-zero="no"/> </x:expect> </x:scenario> </x:scenario>
At the next highest level, the Schematron is tested using stf 39:
<?stf column-count:1 ?> <fo:retrieve-table-marker column-count="-1" xmlns:fo="http://www.w3.org/1999/XSL/Format" />
Finally, complete documents can be validated using both
Relax NG and Schematron using the validate
Ant
task.
Need for speed
The bottleneck for the validation is obviously going to be executing the XSLT for the Schematron validation and, in particular, the expression parser.
In oXygen, the XSLT generated from the Schematron is run using Saxon EE. Saxon EE includes a saxon:memo-function() 40 extension that memorizes the result from every combination of arguments for an xsl:function. The second and subsequent times that you call that function with a set of arguments that has been seen before, the function immediately returns the result for those arguments instead of running the function all over again just to return the same value. The same property values tend to appear many times in the average XSL-FO file, so this shortcuts a lot of parsing with the REx-generated parser. Figure 6 shows saxon:memo-function is use with the xsl:function for parsing XSL-FO property value expressions.
Additionally, oXygen also caches the schema used to validate a document 41, and oXygen Support have confirmed that this includes caching a Schematron validator, so the memorised property expression values are available across documents.
When I demonstrated focheck running on a laptop at XML London 2015, the RELAX NG validation and Schematron checking for a 28 MB file took six seconds. Since then, however, I have added more ID-IDREF checking to the schema, so validation is slower now, and I have added some keys to the Schematron to help check some XSL constraints. It is not clear how keys interact with the XSLT caching by oXygen. On the plus side, focheck now catches 14 errors in the 28 MB file instead of just one.
The REx parser generator 21 is able to generate a parser as a Saxon extension function. It should, therefore, be possible to optionally include the compiled extension function in the classpath for Saxon and make the ‘parser runner’ library use the compiled extension function if it is available and to fallback to the XSLT parser when the function is unavailable. However, using Saxon extension functions with Schematron validation is not a common use-case, so it is not possible with oXygen 18 to just add the Jar file for an extension function to an oXygen framework and have it be used when validating Schematron. Using the extension function currently requires registering the extension function in the default Saxon configuration in the oXygen preferences. Since that can’t be done just by downloading the oXygen add-on, it's not currently part of focheck.
Translations
Antenna House is a Japanese company with a high proportion of Japanese users of AH Formatter, so I also wanted to make focheck available in Japanese.
There’s four parts to translating focheck into Japanese:
-
Schematron messages
-
Documentation annotations
-
Jing validation messages
-
Xerces parsing messages
I started with the Schematron messages because they are the most bespoke, since I
wrote them all. The official Schematron method for localizing messages using the
diagnostic
element, detailed in Annex G of the Schemtron specification, outputs every translation
of a message at once. oXygen 18 has its own method for outputting messages from only
one language at a time, but focheck was localized before that was available, plus
focheck has to also work outside of oXygen.
I initially considered a gettext-like mechanism, such as Jirka Kosek's Saxon extension
42, but you end up in a sort of double-think as, instead of writing the message that
you want to write, you have to write a version that’s deconstructed for use with gettext()
; for example:
name="<value-of select="@name"/>" cannot be used when axf:document-info with name="xmp" is present.becomes:
<value-of select="t:format('name="{1}" cannot be used when axf:document-info with name="xmp" is present.', (@name))"/>
I also started an XSLT stylesheet to extract messages from the Schematron so the messages could be input for the open source OmegaT computer-aided translation (CAT) tool, but I didn’t get as far as reconstructing a translated Schematron using translated message. Since I’d had to modify the CAT tool that I was using to be able to handle the Java XML property files containing the extracted messages, it was simpler to modify it again so it could work directly on Schematron files, as shown in Figure 7. I have more recently modified it again so that the messages in Schematron Quick Fixes are also presented for translation.
Another part of translating the Schematron to Japanese was translating Schematron’s own messages into Japanese. Since the Schematron project on GoogleCode is now read-only, I created a Schematron organization on GitHub 43.
That covered the Schematron messages that I was creating, but it didn’t cover the messages from validation and well-formedness errors in the XML for the XSL-FO. Neither the Jing library that does the RELAX NG validation nor the Xerces library that does the lower-level XML parsing has had its messages translated into Japanese, so even when oXygen is set to use its Japanese interface, the parsing and validation messages have remained in English. I have now had the Jing messages translated, and they have been contributed to the official jing-trang project on GitHub 44. The Xerces messages about SAX parsing have been translated into Japanese, but it’s less clear how to contribute them to Xerces.
Having modified the CAT tool twice, it was almost inevitable that I would modify it again for the RELAX NG schema. The end result is that, through a combination of using OmegaT for natural language translation and Jing for schema language translation (see Figure 8), the three flavors of schema language are also available with documentation annotations in Japanese.
Figure 9 shows a Japanese tooltip and Japanese error messages in the Japanese version of oXygen.
Future improvements
A necessary improvement is adding to and improving the handwritten parts of the Relax NG and Schematron. The constraints in the XSL 1.1 spec are spread through much of the spec, and some of the details require careful reading. Getting the Relax NG and Schematron to be complete and correct is an ongoing and iterative process. Pull requests on the GitHub project will be appreciated.
The current expression parser cannot evaluate some of the shorthand properties, such as font 45, that are shared with CSS2. Handling those will require either writing custom XSLT or generating a different parser using REx. I would also like to try to write an expression language parser that produces a simpler parse tree to see if that can be quicker than the REx-generated parser, but the use of saxon:memo-function may reduce the significance of any speed improvements when validating large documents inside oXygen.
I experimented with NVDL as the solution for validating MathML and SVG inside fo:instream-foreign-object
, but the current solution is working well enough that it doesn’t seem necessary to
look at it again.
I am working on adding on-the-fly Schematron checking to the Emacs mode for focheck using `flymake-mode'. Initially, this will use Saxon on the command line to run the compiled XSLT for the Schematron. The eventual goal is to use 'beanshell' to run a persistent Java object for the stylesheet and so avoid the startup cost of starting Java and reading the stylesheet each time.
Conclusion
focheck provides the most comprehensive error checking and authoring support for XSL-FO
that is currently available. It validates XSL 1.1 plus Antenna House extensions and
also validates both SVG and MathML 3 in fo:instream-foreign-object
. focheck is the only XSL-FO validation utility (short of running an FO processor)
that evaluates property value expression and, because it is not an FO processor, it
can identify multiple fatal errors in an XSL-FO document instead of halting at the
first. The addition of Schematron Quick Fixes makes it easy to also correct your
XSL-FO.
focheck is open source, so the focheck approaches both to parsing attribute value expressions and to maintaining an oXygen framework add-on project can be used by other projects, as can the focheck contributions to the CAT tool for translating Schematron, RELAX NG, and Java XML property files and to the Japanese localizations of Schematron, Jing, and Xerces.
References
[1] https://www.w3.org/TR/xsl11/
[2] https://www.w3.org/TR/1998/WD-xsl-19980818#AEN50
[3] https://www.w3.org/TR/2000/WD-xsl-20000301/#section-N591-Processing-a-Stylesheet
[4] https://www.w3.org/TR/xslt
[5] https://www.w3.org/TR/2000/WD-xsl-20001018/
[6] https://www.w3.org/Style/XSL/TestSuite/tools/testsuite.dtd
[7] https://www.w3.org/TR/NOTE-XSL
[8] Formatting Objects considered harmful, Håkon Wium Lie, April 1999, http://people.opera.com/howcome/1999/foch.html
[9] Go-to statement considered harmful, Commun. ACM 11 (1968), 3: 147-148. https://www.cs.utexas.edu/users/EWD/ewd02xx/EWD215.PDF. doi:https://doi.org/10.1145/362929.362947
[10] “Considered Harmful” Essays Considered Harmful, Eric A. Meyer, 28 December 2002, http://meyerweb.com/eric/comment/chech.html
[11] http://www.w3.org/TR/xsl/#inheritance
[12] http://www.w3.org/TR/2006/REC-xsl11-20061205/
[13] http://www.w3.org/TR/xsl/#d0e5032
[14] http://www.w3.org/TR/xslfo20-req/#N67198
[15] http://www.antennahouse.com/antenna1/antenna-house-regression-testing-system/
[16] https://github.com/xmlroff/xmlroff/tree/master/testing
[17] http://www.renderx.com/tools/validators.html
[18] http://www.oxygenxml.com/
[19] http://xep.xattic.com/xep/resources/validators/folint/folint.xsl
[20] http://xep.xattic.com/xep/resources/validators/xmleurope2004-peshkov.pdf
[21] http://www.bottlecaps.de/rex/
[22] http://inasmuch.as/2013/10/23/oxygen-add-on-hosted-on-github/
[23] https://www.antennahouse.com/antenna1/formatter/
[24] http://www.w3.org/TR/2006/REC-xsl11-20061205/xslspec.xml
[25] https://lists.w3.org/Archives/Public/www-xsl-fo/2007Aug/0011.html
[26] http://www.w3.org/TR/xsl/#d0e6532
[27] http://www.w3.org/TR/xsl/#fo_footnote
[28] http://www.w3.org/TR/xsl/#fo_retrieve-table-marker
[29] http://www.w3.org/TR/xsl/#datatype
[30] http://www.bottlecaps.de/rex/xpath20.ebnf
[31] Schematron Quick Fixes Specification, Nico Kutscherauer, Octavian Nadolu, http://schematron-quickfix.github.io/sqf
[32] http://www.w3.org/TR/xsl/#cssdatat
[33] http://www.w3.org/TR/xsl/#border-start-width
[34] http://www.antennahouse.com/product/ahf60/ahf6top.htm
[35] http://www.antennahouse.com/CSSInfo/extension.html
[36] http://www.antennahouse.com/CSSInfo/float-extension.html
[37] http://www.antennahouse.com/CSSInfo/ruby-extension.html
[38] https://code.google.com/p/xspec/
[39] https://github.com/MenteaXML/stf
[40] http://www.saxonica.com/documentation/index.html#!extensions/attributes/memo-function
[41] http://www.oxygenxml.com/doc/ug-editor/#topics/validation-actions-in-user-interface.html
[42] http://xmlguru.cz/2006/10/saxon-gettext
[43] https://github.com/Schematron/schematron