How to cite this paper
Kimber, Eliot. “Loose-leaf publishing using Antenna House and CSS.” Presented at Balisage: The Markup Conference 2019, Washington, DC, July 30 - August 2, 2019. In Proceedings of Balisage: The Markup Conference 2019. Balisage Series on Markup Technologies, vol. 23 (2019). https://doi.org/10.4242/BalisageVol23.Kimber01.
Balisage: The Markup Conference 2019
July 30 - August 2, 2019
Balisage Paper: Loose-leaf publishing using Antenna House and CSS
Eliot Kimber
Senior Solutions Architect
Contrext, LLC
Eliot Kimber is an XML practitioner currently working with a U.S. government agency
on
a new report authoring, management, and delivery system. He has been involved with
SGML
and XML for more than 30 years. Eliot has contributed to a number of standards, including
SGML, HyTime, XML, XSLT, DSSSL, and DITA. While Eliot's focus has been managing large
scale hyperdocuments for authoring and delivery, most of his day-to-day work involves
producing online and paged (or pageable) media from XML documents. Eliot maintains
a
number of open-source projects including DITA for Publishers, The Wordinator, and
the DITA
Community collection of DITA-related tools and other aids. Eliot is author of
DITA for Practitioners, Vol 1: Architecture and Technology, from
XML Press. When not trying to retire the technical debt in his various open-source
projects, Eliot lives with his family in Austin, Texas, where he practices Aikido
and
bakes bread.
Copyright ©2019 W. Eliot Kimber
Abstract
Loose-leaf publishing is the ability to typeset and print only the pages in a document
that have changed since its last publication. This presents many interesting challenges.
We developed a loose-leaf publication system using Antenna House Formatter, CSS for
pagination, and XSLT for post processing the area tree into “change packages” which
include only the changed pages. Both the CSS markup and the publication workflow warrant
a closer look.
Table of Contents
- Problem Statement
-
- Loose Leaf Challenges
- CSS Pagination Challenges
- Preparing XML For CSS Pagination
-
- Providing Structured Page Edge Content
- Styling That Depends on Descendant or Following Element Properties
- Synthesizing or Reordering Content
- Generated Text That Cannot Be Constructed Using CSS
- CSS Pagination and Area Trees
- Modifying the Area Tree
-
- Set Page Number and Format
- Update Page Numbers
-
- Identifying Change Page Sets
- Constructing New Page Number Sequences
- Filter Pages
- Renumber Absolute Page Numbers
- Final Update Processing
-
- Page History Database
- Capturing Element Target Details In The Area Tree
- Updating Page Number References
- Update Page Number Database
- Conclusions and Future Work
Problem Statement
Loose-leaf publishing is the production of updates to previously-printed documents
where
the numbers of the previously-printed pages must be preserved. When an update to a
document
results in new pages those pages are given page numbers that reflect the last original
page's
number plus a modifier, i.e., "10.1", "10.2", etc. Such pages are often called "point
pages".
Updates to documents are produced that reflect only the new or changed pages, which
are then
manually inserted into copies of the target document, bringing those copies up to
date with
the latest version of the master document.
Loose leaf publishing was common place before the advent of low-cost printers and
digital
document delivery. In a world where you can regenerate a PDF or print a 1000 pages
on a
high-volume laser printer in minutes, the need for loose-leaf publication has all
but
disappeared.
One area where the requirement still exists is municipal code, a specialized area
of legal
publishing.
Legal documents, in particular codified municipal law and regulations, present several
practical problems:
-
The documents tend to be large: 2000 pages for a city's municipal code is
typical.
-
People and other documents make references to the previously-published page
numbers.
-
Municipal code is updated frequently: every city council meeting will likely result
in new or changed ordinances that cause changes to the codified municipal code.
-
The documents have very long life cycles. While cities may choose to "reflow" their
code periodically, republishing it in its entirety with new page numbers, reflows
may
only be done once a decade or less.
-
City staffers and others who work with the city code maintain printed copies of the
municipal code to support their day-to-day jobs. It would be disruptive to completely
replace these working copies every time the code is updated.
The size of the publications, the frequency of update, and the number of actively-used
printed copies would make reprinting the entire code for every update prohibitive,
leaving
aside the review and quality assurance implications of republishing a 2000-page document
with
critical legal implications.
Municipalities do not, as a rule, do their own codification. Codification and publication
of municipal code is a service. One such service provider is Municode, one of the
largest
suppliers of municipal codes in the U.S.
Municode had been for decades using the Xyvision Parlance Publisher (XPP) product
to
produce loose-leaf pages for municipal code. XPP did the job but was being used as
a
traditional typesetting system, not as an SGML or XML publishing system. Codifiers
authored
directly in XPP's typesetting format ("gencode"), going right back to the beginnings
of
structured markup and computerized typesetting.
Municode realized they needed to replace their XPP system with a modern XML-based
publishing pipeline. They developed an HTML5-based vocabulary for the source, decided
to use
CSSCSS pagination as the layout technology, and selected Antenna House
Formatter (AHF)AHF as the pagination engine. Antenna House Formatter
implements CSS pagination.
However, AHF does not itself do loose-leaf publishing, so loose-leaf processing would
need
to be implemented. In particular, AHF (and CSS generally), does not provide a direct
way to
generate point page numbers or references to them.
This author had previously designed an approach to using Antenna House area tree post
processing to produce change pages in the context of a proposal to a U.S. federal
agency for
publishing updates to publications used by field agents. The proposal was not accepted
but I
had provided my design to Antenna House as part of the proposal development process.
Antenna
House recommended me to Municode and I was hired by Municode to implement loose leaf
publishing.
Because Municode insisted on using CSS rather than XSL-FO for doing the pagination
styling, I also ended up developing the pagination styles and the preprocessing needed
to
produce the required XML input to the CSS pagination process.
Antenna House Formatter was selected as the pagination engine because it was at the
time
the only CSS pagination implementation that met all of Municode's typesetting requirements.
If
Municode had been willing to use XSL-FO it probably would have been possible to use
Apache FOP
as FOP provides the necessary layout features and also produces an area tree, enabling
the
post processing required to generate the point page numbers.
Loose Leaf Challenges
One challenge inherent in loose-leaf publishing is determining which pages have changed
between two versions of a document. Municode does this manually as an editorial activity,
so
automatic determination of pagination differences was not a requirement for the initial
implementation. Determining changes automatically is a potential area for future
study.
The general challenge then was to implement an automated process that takes XML source
as input and produces "change packages" as output.
The high level processing pipeline is:
-
Editors prepare XML source, including marking the starts and ends of changed
pages, where the start always reflects the start of a page in the previously-published
version and the end is wherever the change ends.
-
The input XML source is preprocessed to generate XHTML, augmented as needed to
enable both CSS pagination generally and change page generation specifically.
-
The augmented XHTML is rendered by AHF using the CSS styles to produce an initial
area tree.
-
The initial area tree is processed to update the page numbers of pages that should
have point page numbers and references to those pages. If a change package is being
produced, all unchanged pages are filtered out, producing a result area tree that
reflects just the changed pages and any other pages required for the package, such
as
a generated "update instructions" section, table of contents, cover, etc.
-
A master page history database is updated to reflect the page details in the
updated version of the publication, including the mapping from elements with IDs to
the their start and end pages.
-
The updated area tree is rendered by AHF to PDF.
The primary processing challenge in producing the point pages is knowing at which
point
within a sequence of changed pages point pages are required.
The incoming source marks the start of the changed pages and specifies the page number
that the first page of the change had in the previously-published version, and, if
necessary, the page number of the page that follows the
changed pages.
At authoring time, the author has no way to know how many pages the change will produce
and thus no way to know where the first point page will start or if point pages are
required
at all. If the change is at the end of a section where the following section starts
a new
page number sequence, then there is no need to specify the page number that follows
the
change. Otherwise, the page number following the change must be specified.
Given the start of the change and the page number of the page following the change,
it
is then possible to determine, once the pages are laid out, whether or not the number
of
pages is greater than the original set of pages for the same source range and thus
determine
which of the changed pages require point page numbers. It is then possible to update
the
page numbers on those pages and update any page number references to elements that
start on
those pages (for example, table of contents entries, cross references, etc.).
Another challenge is generating tables of contents for change packages.
When producing a change package, the entire publication's table of contents must be
replicated. However, the only reliable page numbers in the rendered document are those
for
the changed pages themselves. All other page numbers are unreliable for a variety
of
reasons, not least of which is that the previously-published version may have been
published
with a completely different tool (XPP instead of AHF) or there could simply be differences
in layout details between versions of AHF itself or there may be changes to the CSS
that
affect the pagination details.
Thus the page numbers reflected in the table of contents for elements on pages not
in
the current set of changes must reflect the page numbers those elements fell on in
the last
published version. This requires some form of pagination database that correlates
elements
to the page numbers they were published on in a given version in time of the
publication.
A final challenge is generating the "update instructions" and "list of effective pages"
sections of the change package.
The update instructions specify, for each set of changed pages in the current update
package, what pages from the previously-published version to remove and what new pages
to
insert. This requires knowing, for a given page, what its page number was in the previous
version and what its page number is in the current version.
The list of effective pages indicates, for each page in the publication, what update
(version in time) of the publication it was last changed. This requires a historical
record
for each physical page that captures its update history.
With these two sets of historical data it is then possible to fully generate the update
instructions and list of effective pages. Both of these artifacts had to be manually
created
in the old system. The tasks were both tedious and error prone, adding significant
time to
the production process and limiting the ability of Municode to produce updates in
a timely
fashion.
Finally, note that Municode also publishes municipal code on the web, so another
business requirement was to have a single authored source that could then directly
serve
both print production and web delivery with no manual intervention in either production
process.
In the old system, the XPP typesetting source was first converted to an intermediate
XML
format from which the final delivered HTML was generated. This process was not 100%
automatic, did not always result in 100% correct online results, and was also time
consuming.
CSS Pagination Challenges
CSS pagination presents a number of challenges:
-
The CSS pagination specifications are not complete, either editorially or
functionally
-
The CSS pagination specifications are spread across a large number of separate
specifications
-
CSS itself is optimized for styling HTML
Compared to XSL-FO purely in terms of layout and typographic features defined in the
standard, CSS lacks a number of important features, including complete control over
header
and footer geometry, no way to impose link behavior onto arbitrary elements or attributes,
nothing comparable to XSL-FO's table marker feature, and other more esoteric and lesser-used
features of XSL-FO, especially around support for Asian languages.
The clear advantage of CSS for pagination is the ease of specifying the styles
themselves. CSS is objectively much easier to specify than XSL-FO for the simple reason
that
a CSS style sheet is completely separate from the source being styled, while XSL-FO
is a
source format that must be generated by a transform. The need to generate XSL-FO has
the
effect of conflating the data processing aspect with the styling aspect in a way that
makes
it hard to separate the two concerns.
Another advantage of using CSS for pagination is that it allows the same core styles
to
be used for browser and print delivery, if that is a requirement.
Because CSS is not itself a transformation language, it necessarily keeps separate
the
concerns of data processing and styling.
From a practical standpoint, it is much easier to find people who know CSS or are
willing to learn it than it is to find people who know XSL-FO or are willing to learn
it.
Even if the two technologies were otherwise functionally equivalent, the simple ability
to
find people to do CSS work would be a significant advantage.
As a style language, CSS can only do decoration, it cannot do reordering or creation
of
complex elements. This keeps CSS architecturally and syntactically simple, a requirement
for
use in browsers where rendering speed is paramount, but it means that typical source
documents, even if authored in HTML, cannot be completely styled as authored.
Another important CSS limitation is selectors: CSS selectors cannot look ahead of
the
current element, meaning you cannot directly style an element based on properties
of its
descendants or following elements.
In addition, because CSS is specifically designed for styling HTML it cannot be used
to
fully style arbitrary XML without extensions.
Thus, using CSS for pagination effectively requires generating HTML that is suitable
for
rendering. While the same augmentation tasks could be applied to any XML, because
CSS is
optimized for HTML, it's easier and usually more appropriate to simply generate HTML.
For
many, if not most, XML vocabularies there will already be an HTML generation transform
that
can be repurposed to generate CSS pagination-ready HTML. Note that this also avoids
the need
for a completely separate XSL-FO generation transformation, which can be a significant
savings.
This generation step is analogous to the generation often done by Javascript in
browsers. In both cases the source HTML has to be extended or modified to meet the
specific
rendering requirements of the delivery environment.
The specific things that must be done to enable CSS pagination include:
-
Generating tables of contents, back-of-the-book indexes, and similar navigation
structures.
-
Generating elements used to populate structured headers and footers. For example,
a multi-line header where individual lines may have different formatting or there
may
be inline formatting requiring separate elements in the HTML.
-
Adding @class values or other clues to make CSS styling either possible
(lookahead) or more convenient.
-
Reordering elements that are presented out of their source order, for example,
moving a figure caption element from the top of the figure container to the end of
the
figure container or using metadata elements or attributes to synthesize displayed
content, such as a copyright page or authorship for an individual article or
chapter.
-
Adding wrapper structures to either enable specific formatting effects or to make
styling easier.
-
Normalizing markup for elements that may have different markup patterns as
authored, for example, adding paragraph elements to list items, to make the CSS style
sheet simpler.
-
Generating text that would be difficult or impossible to generate with CSS
alone.
If the source as authored is itself XHTML this transform can be relatively simple.
If
the source is some other vocabulary it is likely that there is an existing HTML generation
transform that can be adapted to produce CSS-pagination-ready HTML.
In addition to the need to generate pagination-ready HTML, the CSS pagination
specification lacks essential features needed by more challenging layouts. Thus any
complete
CSS pagination implementation will include proprietary extensions to fill this feature
gap.
AHF does this by essentially mapping every XSL-FO feature or AHF-extension to a
corresponding CSS extension. For example, AHF provides additional page sequence types
to
enable creating last- and only- page sequences.
A final challenge is the lack of parameterization in the base CSS syntax. While tools
like "less" provide a way to parameterize and modularize CSS style sheets, we did
not find a
suitable Java-based less compiler at the start of the project. Thus the CSS style
sheets
have a lot of redundancy, especially in the page master rules, but since this code
did not
change dramatically once developed, the lack of parameterization turned out to be
a minor
problem and solving it never became a priority.
When implementing the CSS style sheets the main challenges were:
-
Finding the relevant definitions in the appropriate W3C specification for a given
layout feature.
-
Determining whether or not AHF implemented a given feature as defined in the
specification.
-
For challenging layout requirements determining the best solution using
AHF.
-
Controlling page breaks dynamically.
For most layout requirements, development of the CSS was a straightforward application
of normal CSS techniques.
Challenging requirements included:
-
Managing counters and variables across element boundaries for running heads and
feet that need to reflect first or last values on a page.
-
Managing page breaks. The CSS semantics for break control are not as definitive as
for XSL-FO. In particular CSS does not have a "keep together always" or "keep with
next always" control. Keeps in CSS are truly "hints". This sometimes resulted in
unfortunate breaks, such as a between a section head that falls at the bottom of a
page and the head for a subordinate section where there is no intervening content.
It
required using AHF extensions to get better control of page breaks.
-
Controlling the size and layout of wide page edge regions. The CSS design for page
edge regions does not explicitly allow for a single region that takes up most or all
of the edge region. This made it difficult to create right- or left-aligned headers
that had long content (for example, a long section title).
Preparing XML For CSS Pagination
CSS is a declarative style language rather than a transformation language. This makes
CSS
relatively simple and provides a clear separation of concerns between the visual and
behavioral style definition and the data processing applied to the source XML but
means that
most XML cannot be fully styled as authored. Thus the authored XML must be transformed
into a
form that can then be fully styled using CSS. In addition, CSS is optimized for styling
HTML
and therefore lacks certain features needed to style arbitrary XML. For example, CSS
assumes
that all links are represented by HTML <a> elements that use @href with URLs for addressing
and does not provide a way to associate linking behavior with arbitrary elements or
attributes.
While it is technically possible to apply CSS directly to arbitrary XML and get a
reasonable result, that result will necessarily be incomplete for non-trivial page
layouts.
Even with HTML, if the HTML is authored in a typical way where authors are only responsible
for content and not things that should be automatically generated or rendered, such
as running
heads and feet, the source HTML will still require some amount of augmentation to
meet the
page layout requirements.
Thus some amount of transformation is always required to create pagination-ready XML
from
whatever the source XML is, even if that source is HTML.
Compare this transformation to the transformation required to use XSL-FO: for XSL-FO
the
target vocabulary is always XSL-FO and the style details are embedded in the generated
XSL-FO
markup and content. For CSS pagination the target vocabulary can be any XML but is
most often
HTML and the transform need only modify the structural and semantic details of the
source
XML—all style details remain in the CSS style sheet.
If the authored source is HTML then the transform is really an "augmentation" task,
adding
things and doing some reordering if needed, but otherwise just an identity transform
that
preserves the HTML as authored.
If the authored source is not HTML then the transform is most effectively a transform
to
HTML with all necessary augmentation applied. If an HTML transformation already exists
for the
source XML vocabulary then all that is required is to add the augmentation required
to meet
the CSS pagination requirements.
In the context of full-featured page layout as typically required for technical
documentation, legal documents, trade books, and other highly-designed publications
that still
lend themselves to 100% automated composition, the following layout requirements require
augmenting the HTML as authored or as generated by a pre-existing HTML tranform:
-
Structured page edge content, such as multi-line headers or footers or content with
typographic differences (i.e., bold or italic words in a title reflected in a running
head or foot).
-
Elements that depend on properties of descendants or following elements to determine
their style.
-
Content that is presented in an order or structure different from its authored
format, for example, moving figure titles from the start of the figure container to
the
bottom or reflecting metadata elements in the main flow.
-
Generated text that cannot be produced easily or at all using CSS (for example, text
that requires calculation or string manipulation of the source)
In addition, the CSS style definition can be made easier to create and maintain by
adding
additional elements or attributes that while not strictly required make the style
definition
easier, for example by simplifying the selectors required or by normalizing variable
markup
patterns into a single consistent pattern.
Providing Structured Page Edge Content
CSS provides two mechanisms for reflecting content in the source in different places
via
style declarations:
-
String variables
-
Element variables
String variables copy content or attribute values into named variables that can then
be
used in content:
properties:
chapterTitle {
string-set: chapterTitle content();
display: none;
}
The string-set:
property captures the the text content of the
<chapterTitle> element into a variable named "chapterTitle". The variable can then
be
used in
content:
@page portrait:right {
@top-center {
content: string(chapterTitle, last);
vertical-align: bottom;
margin-top: 0.25in;
margin-bottom: 2pc;
text-transform: uppercase;
}
...
}
Here
the last value set for the "chapterTitle" variable will be used as the content of
the
top-center page edge region for this type type.
Element variables remove elements from the source flow and capture them as variables
that can then be used in page edge regions. The elements will be styled using the
styles in
effect at point where they occur in the
source:
sectionTitlesMultiline {
position: running(runningHead);
}
Here the position:
property sets the "position" of the
<sectionTitlesMultiline> element as being in an element variable named "runningHead".
The
<sectionTitlesMultiline> element is removed from the document where it occurs.
The element variable can then be used in a content:
property:
@page landscape:left {
@top-left {
content: element(runningHead start);
border-bottom: 0.5pt solid black;
margin-top: 0.25in;
margin-bottom: 0.33in;
vertical-align: bottom;
text-align: left;
font-size: 10pt;
}
...
}
Here the content:
property uses the element()
function get the
value of the "runningHead" element variable and use it as the content of the top-left
page
area.
Note that use of position: running()
consumes the element to which it
applies, which means you cannot, for example, simply reflect title elements or title
element
containers in running heads—you must have separate source elements for the title in
the main
flow and for the running heading.
For example, the Municode HTML preprocessing transform has this rule to generate the
<sectionTitleMultiline>
elements:
<xsl:template mode="frills-detailed" match="xhtml:header">
<xsl:variable name="sectLevel" as="xs:integer"
select="count(ancestor-or-self::xhtml:section)"
/>
<sectionTitlesMultiline class="sect-{$sectLevel}">
<!-- Process ancestors from highest to lowest -->
<xsl:apply-templates select="(ancestor::xhtml:section)"
mode="frills-get-multi-line-entries"
/>
</sectionTitlesMultiline>
</xsl:template>
Where
the authored source
is:
<section id="x88B6B95C248C" data-type="section">
<header>
<h1>Sec. 1.1</h1>
<p data-type="subtitle">Incorporation.</p>
</header>
...
</section>
And
the transformed result
is:
<section id="x88B6B95C248C" data-type="section" class="">
<header>
<sectionNumberVerso>§ 1.1</sectionNumberVerso>
<sectionNumberRecto>§ 1.1</sectionNumberRecto>
<sectionTitlesMultiline class="sect-4">
<headerLine class="charter">
<h1 class="">Charter</h1>
</headerLine>
<headerLine class="chapter">
<h1 class=" has-subtitle">Chapter 1</h1>
<p data-type="subtitle" class="">General Provisions</p>
</headerLine>
<headerLine class="section">
<h1 class=" has-subtitle">Sec. 1.1</h1>
<p data-type="subtitle" class="">Incorporation.</p>
</headerLine>
</sectionTitlesMultiline>
<h1 class=" has-subtitle">Sec. 1.1</h1>
<p data-type="subtitle" class="">Incorporation.</p>
</header>
Many, if not most, publications will require element, rather than string, page edge
content and thus will need to generate this type of additional markup to provide
them.
Styling That Depends on Descendant or Following Element Properties
CSS selectors can only select on the current node or nodes that have come before,
they
cannot make reference to descendant or following elements.
In the Municode content there are at least two cases where the style of a paragraph
depends in part on its descendants or content, which requires setting a specific class
value
on the <p>
element:
<xsl:template match="xhtml:p">
<!-- This will always reflect the original @class value, if any -->
<xsl:variable name="class-tokens" as="xs:string*">
<xsl:apply-templates mode="set-class-for-p" select="."/>
</xsl:variable>
<xsl:copy>
<xsl:apply-templates select="@* except (@class)" mode="#current"/>
<xsl:if test="exists($class-tokens)">
<xsl:attribute name="class"
select="string-join($class-tokens, ' ')"
/>
</xsl:if>
<xsl:apply-templates select="node()" mode="#current"/>
</xsl:copy>
</xsl:template>
<xsl:template mode="set-class-for-p" as="xs:string+" priority="10"
match="xhtml:p[.//xhtml:span[exists(@data-lf)]]">
<xsl:sequence select="'lf'"/>
<xsl:next-match/>
</xsl:template>
<xsl:template mode="set-class-for-p" as="xs:string+"
match="xhtml:p[ends-with(normalize-space(.), ':')][not(ancestor::xhtml:header)]
">
<xsl:sequence select="'keepwithnext'"/>
<xsl:next-match/>
</xsl:template>
Synthesizing or Reordering Content
This is simply constructing the appropriate structures as required to achieve the
required presentation result. It is not a workaround for limitations in CSS but is
simply a
requirement. For many publications the requirement likely exists for digital delivery
as
well.
Generated Text That Cannot Be Constructed Using CSS
The CSS content:
property combined with :before and :after pseudo elements
can do quite a bit but CSS does not provide functions for doing string manipulation
or more
complex calculations (for example, for converting dates and times into formatted values).
Thus it may be necessary to generate attributes or elements that contain text that
would
otherwise be generated purely as a matter of style.
CSS Pagination and Area Trees
Antenna House Formatter (AHF) can produce as output an "area tree", which is an XML
representation of the composed pages. The area tree captures all the information needed
to
generate the rendered page, including all font details, placement details, and so
on. AHF can
take an area tree as input and produce the final deliverable, i.e., PDF.
In order to post-process the area tree it needs to include the following information:
-
The start and end of each change ("take" in Municode parlance)
-
The start and end of each element with an ID for which the page number needs to be
captured (sections, tables, figures)
-
The page numbers as rendered in the edge regions of pages, including distinguishing
any prefolio and postfolio text.
-
Boundaries or occurrence of other important elements, such as the update
instructions, and things that need to be counted on a per-page basis (tables, images,
etc.), and so on.
AHF does not provide a general way to inject arbitrary information into area trees.
It
does, however, preserve any @id values it finds on input elements.
We take advantage of this by constructing elements in the input HTML with @id values
that
are structured
fields:
<areaTreeMarker id="take:take-begin:job=S138-U02:d20p60"/>
which
then becomes this area tree
element:
<InlineArea id="take:take-begin:job=S138-U02:d20p60"
font-size="0pt"
width="0pt"
height="0pt"
baseline-after="0pt"
...
/>
Note that the InlineArea has an effective size of zero width and height, so it does
not
affect the rendering of the page on which it occurs.
These area tree marker elements are added during the HTML preprocessing step.
For page numbers, we use zero-width-space characters (\u200B) to separate the prefolio,
folio, and postfolio components of the page number as rendered in any context where
a page
number
occurs:
@page :blank {
size: 8.5in 11in;
margin-left: 7.5pc;
margin-right: 7.5pc;
margin-bottom: 6pc;
margin-top: 8pc;
@bottom-center {
content: string(prefolio, first) '\200B' counter(page) '\200B' string(postfolio, first);
margin-top: 1pc;
vertical-align: top;
font-family: "New Century Schoolbook", serif, 'Arial Unicode';
font-size: 10pt;
}
...
}
The zero-width-space character is not used in any other place in this content and
has no
visible result in the rendered pages.
Occurrences of page numbers in the area tree can be found
reliably:
<!-- Gets all the text areas for the page's page number, including the prefolio and postfolio, if
present.
-->
<xsl:function name="at:get-page-number-text-areas" as="element(at:TextArea)*">
<xsl:param name="context" as="element()?"/><!-- Any element that is or is within a page viewport area -->
<xsl:variable name="page" select="$context/ancestor-or-self::at:PageViewportArea"/>
<xsl:variable name="margin-region" as="element(at:MarginRegionViewportArea)?"
select="$page/at:PageReferenceArea/at:MarginRegionViewportArea[.//at:TextArea[@text = '​']]"
/>
<!-- At least in the legacy style there should be exactly one block with exactly one line with three or more text areas. -->
<xsl:variable name="result" as="element(at:TextArea)*" select="$margin-region//at:TextArea"/>
<xsl:sequence select="$result"/>
</xsl:function>
Another challenge is recording the numeric, as opposed to display, page number for
each
page.
In XSL-FO the page number for a page is a property of the page formatting object and
AHF
records the ordinal page number in the area tree.
In CSS, however, page numbers are just counters indistinguishable from any other counter,
so AHF does not record the page number.
The numeric page number is needed so that page numbers can be correctly calculated.
A
given page may not have a display page number or the display page number may be non-numeric
(roman, alphabetical, etc.). Thus, for each page we need to know the numeric (ordinal)
page
number, the display page number, and the page number formal (roman, arabic, etc.).
In
particular, roman numerals cannot be distinguished from alphabetical page numbers
for
characters that are used in roman numerals, so it's not possible to determine the
page number
format by inspecting the page number itself.
To capture this information we use corner regions, which are otherwise not used for
anything in the Municode page layouts.
In CSS, there are four corner regions and for each page edge three edge regions.
The numeric page number and page number format are captured in corner regions like
so:
@page {
size: 8.5in 11in;
margin-left: 7.5pc;
margin-right: 7.5pc;
counter-reset: footnote;
background-image: attr(background-graphic, url);
background-repeat: no-repeat;
background-size: contain;
background-clip: border-box;
background-position: 50% 50%;
@bottom-left-corner {
content: '^npm^' counter(page);
visibility: hidden;
font-size: 8pt;
color: white;
font-family: monospace, 'Arial Unicode';
}
@bottom-right-corner {
content: '^pnf:1';
visibility: hidden;
font-size: 8pt;
font-family: monospace, 'Arial Unicode';
}
...
}
Note that this page rule applies to all pages (there is no page name qualifier). The
visibility value of "hidden" means that the text will be in the area tree but not
rendered on
the actual page.
This results in area tree elements like
so:
<MarginRegionViewportArea region-name="bottom-left-corner"
visibility="hidden"
...>
<MarginRegionReferenceArea ...>
<BlockArea ...>
<LineArea >
<TextArea ...
text="^npm^"
/>
<TextArea
text="1" ...
/>
</LineArea>
</BlockArea>
</MarginRegionReferenceArea>
</MarginRegionViewportArea>
The formatting of the corner regions makes them invisible
(visibility="hidden"
) but they are easily findable during post processing. A
similar technique is used to capture the page number format.
Side edge regions are also used for debugging and other purposes, for example, Municode
puts details about the source in a side edge region for printing on draft versions
of the
document.
As a debugging aid, the CSS can be quickly modified to reflect the page master name
in a
side edge
region:
@page :left {
size: 8.5in 11in;
margin-left: 7.5pc;
margin-right: 7.5pc;
margin-bottom: 6pc;
margin-top: 8pc;
/* Used for debugging page rule application. */
@left-bottom {
content: 'Page Rule: :left';
/* content: none;*/
font-size: 10pt;
font-family: "Courier New", monospace, 'Arial Unicode';
color: white;
-ah-reference-orientation: 90;
width: 4pc;
height: 6in;
}
...
}
By
changing the color property from "white" to e.g., "cyan" in each page rule, the page
rule name
is shown on every page, which is useful for debugging.
Thus, through a combination of elements added to the input HTML
(<areaTreeMarker>
), structured content (display page numbers), and use and
abuse of page edge regions, and AHF's automatic capturing of element IDs, it is possible
to
inject into the area tree any information needed to support post processing of the
area
tree.
Modifying the Area Tree
The ultimate goal is to produce an area tree that represents an update package that
reflects any required pages (covers, table of contents, update instructions, etc.)
and the
changed pages with point page numbers created for changed pages that require them.
In
addition, every sequence of changed pages must have an even number of pages. By Municode's
editorial rules, a change set always starts on an odd (right hand) page and thus must
end on
an even (left-hand page).
Municode refers to sets of changed pages as "takes" and so the markers used in the
source
to mark the boundaries of changes are "take markers". The term "take" is reflected
in the code
examples in this section.
The area tree processing is implemented as a logical pipeline with the following stages:
-
Set page number and format
Uses the ordinal page number and page number format to update attributes on each
page viewport area element.
It also applies any "page start" values that were specified on change markers (a
change marker can specify what the actual page number of the first page of a change
set
should be, irrespective of what it is based on automatic page numbering). The initial
pagination styling does not attempt to create new page sequences because the change
markers can occur anywhere within the HTML hierarchy and thus are not easily used
to
trigger new page sequences. So it's easier to just update the page numbers in the
area
tree after the fact.
This output of this stage is an area tree where the page number details are reliable
and easy to get by subsequent processing.
For debugging purposes it also allows inspection of the resulting area tree to
verify that the page numbers have been set correctly.
-
Update page numbers
Marks pages as being either in the result pages (a required page or a page in a
change set) or not.
For pages that are in change sets, updates the display page numbers to reflect any
required point pages.
Generates any required blank even pages for change sets that do not naturally end
on
an even page or for which the following page is not already blank.
-
Filter pages
If an update package is requested, filters out all pages that are not marked as
being in the update package.
-
Renumber absolute page numbers
AHF captures the absolute page number of each page and requires that they be
correct, so the pages must be renumbered to reflect the absolute page numbers following
any filtering.
-
Final update processing
Does any remaining update processing, primarily updating page number references to
reflect the final display page numbers. Also filters out things in the area tree that
were needed for post processing but that are not needed or wanted in the final PDF
given
to customers, such as the hidden corner regions. Users of the PDF often cut and paste
from the PDF in order to draft new ordinances, and hidden content can be selected
and
copied, which is not good.
-
Update page number database
If the processing is producing a "final" update package, meaning a package that will
be delivered to the client, then the page number database is updated to reflect the
details of the new and changed pages for the publication.
Set Page Number and Format
This phase determines, for each page, what it's ordinal page number is in the current
page sequence and what the display format of the number is (arabic, roman, etc.).
AHF does not capture this information because it is not available in the CSS
model.
In XSL-FO page numbers are defined explicitly as properties of page sequences and
then
reflected where needed through dedicated page number reference formatting objects.
In CSS, by contrast, page numbers are simply counters like any other counters and
are
not explicit properties of pages or page sequences. CSS does define a built-in counter
named
"page" that is automatically incremented for each new page, but that is just a convenience:
there is no requirement that the "page" counter be used to reflect page numbers.
On a given page, the page's own page number is reflected by reference to the page
number
counter in a normal content
property:
@page portrait-first:right {
counter-reset: footnote;
counter-reset: page 1;
@bottom-center {
content: string(prefolio, first) '\200B' counter(page) '\200B' string(postfolio, first);
margin-top: 1pc;
vertical-align: top;
font-family: "New Century Schoolbook", serif, 'Arial Unicode';
font-size: 10pt;
}
...
}
Thus there is no reliable way for a CSS processor know what the intended page number
is
for a given page.
The AHF area tree markup includes attributes for the page number on the pageViewportArea
element:
<PageViewportArea
...
abs-page-number="1"
page-number="1"
format="1"
>
The
page-number attribute has a value but it is not reliable.
To work around this limitation in CSS, the CSS puts the ordinal page number and format
value into corner regions as hidden
text:
@page {
size: 8.5in 11in;
margin-left: 7.5pc;
margin-right: 7.5pc;
counter-reset: footnote;
background-image: attr(background-graphic, url);
background-repeat: no-repeat;
background-size: contain;
background-clip: border-box;
background-position: 50% 50%;
@bottom-left-corner {
content: '^npm^' counter(page);
visibility: hidden;
font-size: 8pt;
color: white;
font-family: monospace, 'Arial Unicode';
}
@bottom-right-corner {
content: '^pnf:1';
visibility: hidden;
font-size: 8pt;
font-family: monospace, 'Arial Unicode';
}
}
This results in easy-to-find data in the area tree from which the page number and
format
can be found and then set on the pageViewportArea elements as though it had always
been
there:
<MarginRegionViewportArea
visibility="hidden"
region-name="bottom-left-corner"
...
>
<MarginRegionReferenceArea ...>
<BlockArea ...>
<LineArea ...>
<TextArea
text="^npm^"
...
/>
<TextArea
text="1"
...
/>
</LineArea>
</BlockArea>
</MarginRegionReferenceArea>
</MarginRegionViewportArea>
Putting the page number and format on the pageViewportArea pages is not strictly
necessary but it makes follow-on processing simpler and avoids the need to repeatedly
search
for the page number details for a given page.
Update Page Numbers
For a given sequence of changed pages it is necessary to update the page numbers of
each
page in the sequence to reflect both the initial page number of the sequence, if specified
on the change start processing instruction, and to reflect any required point page
numbers.
Updating the page numbers involves three main processing tasks:
-
Identifying the start and end pages of a change page set
-
Construct the sequence of new page numbers based on the starting page number and
page number of the page that follows the change set, if specified (it can be
unspecified if the change is followed by the start of a new page sequence where page
numbers are reset, i.e., the end of a chapter).
-
Updating the display page number on each page in the change set. This may involve
adjusting the horizontal position of the page number on the page to reflect a change
in width of the page number as displayed.
Identifying Change Page Sets
Change boundaries are reflected in the area tree by inline objects with IDs that have
a specific structure, created from areaTreeMarker elements generated as part of the
HTML
preprocess from the change marking processing instructions ("take markers") inserted
by
the authors.
The source as authored
is:
<?pdf take-begin job="S01" firstpage="15"?>
<section id="x88B6B95E7E6A" data-type="chapter" >
...
<?pdf take-end job="S01" firstpage-ref="15"?>
...
</section>
Which results in this HTML input to the pagination
process:
<areaTreeMarker id="take:take-begin:job=S01:firstpage=15:d16p512"/>
<section id="x88B6B95E7E6A" data-type="chapter" >
...
<areaTreeMarker id="take:take-end:job=S01:firstpage=15:d16p512"/>
...
</section>
And then these elements in the area
tree:
<InlineArea
id="take:take-begin:job=S01:firstpage=15:d16p512"
width="0pt" height="0pt"
...
/>
...
<InlineArea
id="take:take-end:job=S01:firstpage-ref=15:15:d16p512"
width="0pt" height="0pt"
...
/>
These marker areas are then used by utility functions that find pages that are within
change sets:
<!-- Determine if the context element is the start of a take
for the specified update ID. A given page can have zero or more take starts
for a given update ID.
@param context A PageViewPortArea element (or other element that
could be an ancestor of a take marker)
@param jobID The update ID to check for
@param doDebug Turns debugging on or off.
@return true if a take begin marker is found for the specified update ID.
-->
<xsl:function name="at:is-take-start" as="xs:boolean">
<xsl:param name="context" as="element()"/>
<xsl:param name="jobID" as="xs:string?"/>
<xsl:param name="doDebug" as="xs:boolean"/>
<xsl:variable name="page" as="element(at:PageViewportArea)"
select="$context/ancestor-or-self::at:PageViewportArea"
/>
<xsl:variable name="take-specifier" as="element()*"
select="at:get-take-specifier($context, $jobID)"
/>
<xsl:variable name="result" as="xs:boolean"
select="exists($take-specifier)"
/>
<xsl:sequence select="$result"/>
</xsl:function>
<!--
Gets the element that specifies the start of a take for the
the specified update.
@context Element to look in for a take specifier
@jobID The ID of the update to get the take specifier for
@return The first element that starts a take for the specified update, if any.
-->
<xsl:function name="at:get-take-specifier" as="element()?">
<xsl:param name="context" as="element()"/>
<xsl:param name="jobID" as="xs:string?"/>
<xsl:variable name="take-specifier" as="element()*"
select="($context//*[starts-with(@id, 'take:take-begin:')]
[contains(@id, concat('job=', $jobID))])[1]"
/>
<xsl:sequence select="$take-specifier"/>
</xsl:function>
These functions then make it easy to distinguish changed pages from unchanged
pages:
<!-- Context is the area tree root element -->
<xsl:template name="update-page-numbers">
<xsl:param name="doDebug" as="xs:boolean" tunnel="yes" select="false()"/>
<xsl:for-each-group
select="at:PageViewportArea"
group-starting-with="at:PageViewportArea[at:is-take-start(., $jobID)]">
<xsl:choose>
<xsl:when test="at:is-take-start(., $jobID)">
<xsl:call-template name="update-page-numbers-for-take-group">
<xsl:with-param name="doDebug" as="xs:boolean" select="$doDebug"/>
<xsl:with-param name="pages" as="element(at:PageViewportArea)+"
select="current-group()"
/>
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<!-- Must be before first take -->
<xsl:apply-templates select="current-group()" mode="update-page-numbers">
<xsl:with-param name="doDebug" as="xs:boolean" select="$doDebug"/>
</xsl:apply-templates>
<xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>
</xsl:template>
At this point we know the start of a set of changed pages (the take start) but we
don't know the end page.
A set of changed pages must always be an even number of pages.
If the explicitly-marked last page of the change set happens to be even then we're
done.
However, if the last page is not even then a blank backing page has to be added to
the
change sequence.
The blank page can either come from the area tree if the page following the last
changed page happens to be a blank even page (for example, a page generated in order
to
force the following page onto an odd page), otherwise it is necessary to synthesize
a
blank page. This is done by copying the current page and adding 1 to the page's absolute
and ordinal page numbers. Only those parts of the page that are needed on a blank
page are
retained, such as the page edge region that contains the display page number or other
edge
region components that need to be kept. The XSLT provides simple configuration to
adjust
this as needed. This also requires a function that can distinguish blank pages from
non-blank pages, i.e., pages where the effective string value of the page body is
only
whitespace or matches a configured "This page intentionally left blank" marker.
Another non-obvious complication is when a change set initially ends with a blank
odd
page. In this case, rather than adding a second blank even backing page, you simply
omit
the blank odd page. It should never be possible to have a change set that ends with
three
blank pages. Two ending blank pages can happen due to either author error where
unnecessary page breaks have been forced or some other combination of factors results
in
two blank pages.
Constructing New Page Number Sequences
Given the set of pages in a change set we know number of pages in the set, the
starting page number, and the page number of the page that follows the change set,
if
any.
With that it's a simple matter of calculating the difference between start + following
and the actual count of pages. Any pages with an ordinal number greater than number
of
original pages must be a point page.
The XSLT then constructs a sequence of new page numbers and uses it to update the
pages in the change set (some details omitted for
brevity):
<xsl:template name="update-page-numbers-for-take-group">
<xsl:param name="pages" as="element(at:PageViewportArea)+"/>
<xsl:variable name="page-numbers" as="xs:string*">
<xsl:variable name="first-page-number" as="xs:string?"
select="at:get-folio($take-start)"
/>
<!-- All pages should have page numbers of one form or another
but it could happen that a page has no number -->
<xsl:if test="exists($first-page-number)">
<xsl:sequence select="at:calculate-page-numbers-for-take(
$pages-in-take, $jobID)"
/>
</xsl:if>
</xsl:variable>
<xsl:for-each select="$pages-in-take">
<xsl:variable name="pos" as="xs:integer" select="position()"/>
<xsl:variable name="new-page-folio" as="xs:string?"
select="$page-numbers[$pos]"
/>
<xsl:apply-templates select="." mode="update-page-numbers">
<xsl:with-param name="new-page-folio" as="xs:string?"
select="$new-page-folio"
/>
<xsl:with-param name="in-job" as="xs:boolean" select="true()"/>
</xsl:apply-templates>
</xsl:for-each>
</xs:template>
...
<xsl:function name="at:calculate-page-numbers-for-take" as="xs:string+">
<xsl:param name="pages" as="element()+"/>
<xsl:param name="jobID" as="xs:string"/>
<!-- The number of pages whose page number does not need to change -->
<xsl:variable name="ordinal-page-count" as="xs:integer"
select="$next-page-num - $first-page-num"
/>
<xsl:variable name="point-page-count" as="xs:integer"
select="count($pages) - $ordinal-page-count"
/>
<xsl:variable name="point-page-count" as="xs:integer"
select="if ($point-page-count lt 0)
then 0
else $point-page-count"
/>
<!-- The number of pages whose page number does not need to change -->
<xsl:variable name="ordinal-page-count" as="xs:integer"
select="$next-page-num - $first-page-num"
/>
<xsl:variable name="point-page-count" as="xs:integer"
select="count($pages) - $ordinal-page-count"
/>
<xsl:variable name="point-page-count" as="xs:integer"
select="if ($point-page-count lt 0)
then 0
else $point-page-count"
/>
<xsl:variable name="before-points" as="element()+"
select="$pages[position() le $ordinal-page-count]"
/>
<xsl:variable name="point-pages" as="element()*"
select="$pages[position() gt $ordinal-page-count]"
/>
<!-- Generate display page numbers for each page before the point pages -->
<xsl:for-each select="$before-points">
<xsl:variable name="page-num-base" select="string($first-page-num + position() - 1)" as="xs:string"/>
<xsl:number value="$page-num-base" format="{$page-number-format}"/>
</xsl:for-each>
<!-- Generate display page numbers for point pages -->
<xsl:for-each select="$point-pages">
<xsl:variable name="point-number" as="xs:integer" select="position()"/>
<xsl:sequence select="concat($point-page-base-formatted, '.', $point-number)"/>
</xsl:for-each>
</xsl:function>
At this point, the display page numbers of each page in each change set have been
updated to reflect the application of starting page numbers and point page numbers.
References to these pages have not yet been updated.
This process also marks every page as being "in the job" or "out of the job", which
is
then used by the filtering step. Any page that is part of a change set or is a page
that
is always included (cover pages, insertion instructions, etc.) are marked as in the
job,
all other pages are marked as out of the job.
Note that the code does not bother to adjust the horizontal position of point page
numbers on the pages for the simple reason that there's no easy way to know how the
page
number alignment should be adjusted: centered, left aligned, or right aligned? At
least
for the Municode styles, the visual affect of adding a point page or changing a number
from 1 to 2 digits is minimal and would not normally be noticed.
However, references to the pages in some contexts do need to be adjusted, for example
in the table of contents, as the numbers are consistently right-aligned and therefore
a
change in horizontal placement will be noticeable. These documents do not normally
use
page numbers for references in normal flowed text, which simplifies the problem.
If page numbers were used it would probably be necessary to reserve extra space around
the numeric part of the page number reference or some marker technique used to indicate
what the current alignment and justification are. Given that information there should
be
no problem adjusting the horizontal position of the text before or after the page
number
reference. In the worst case, the code would need to apply word or character spacing
adjustments to avoid having a long number cause the line to end pass a margin (for
example, where the text would end up overlapping a border or text not tagged as being
in
the same line on the page.
Because the area tree includes all geometric information down to the individual text
string level, it's always possible to adjust the layout or otherwise detect overlaps
but
that level of sophistication should not normally be needed.
In legal publications the normal practice is to refer only to a section or paragraph
number or a figure or table number (and possibly title), which simply avoids the problem.
Page numbers are usually limited to generated navigation structures like tables of
contents and indexes.
Filter Pages
If a filtered update package is requested, then pages not in the update are filtered
out. This is simply a matter of omitting any PageViewportArea that is not marked as
being in
the job, a value that is set on every page in the preceding
phase:
<xsl:template name="process-takes">
<xsl:param name="doDebug" as="xs:boolean" tunnel="yes" select="false()"/>
<xsl:for-each-group select="at:PageViewportArea" group-adjacent="@in-job eq 'true'">
<xsl:choose>
<xsl:when test="self::at:PageViewportArea[@in-job eq 'true']">
<xsl:for-each-group select="current-group()" group-starting-with="*[at:is-take-start(., $jobID)]">
<xsl:sequence select="current-group()"/>
</xsl:for-each-group>
<!-- A set of take pages -->
</xsl:when>
<xsl:otherwise>
<!-- Not take pages, ignore. -->
</xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>
</xsl:template>
This code uses for-each-group, reflecting a refactoring of early logic that was not
as
simple and actually required grouping. This could be done with a simple apply-templates
and
a pair of templates, one that matches @in-job eq 'true' and one that does not.
Renumber Absolute Page Numbers
The area tree that is input to this phase reflects the final set of pages to be
rendered. To meet AHF requirements the @absolute-page-number attribute on each page
must
correctly reflect the ordinal position of the page in the
document:
<xsl:template mode="renumber-abs-pages" match="@abs-page-number">
<xsl:variable name="pageNumber" as="xs:integer"
select="count(../preceding-sibling::*) + 1"
/>
<xsl:attribute name="{name(.)}" select="$pageNumber"/>
<xsl:attribute name="orig-abs-page-number" select="."/>
</xsl:template>
The rest of this mode is just normal identity transform processing that handles
PageViewportArea elements.
Final Update Processing
The final processing step is to update page number references to reflect the updated
page numbers for changed pages and, if the area tree has been filtered, to reflect
page
numbers from the page history database for references to pages that are not changed.
This requires adjusting the horizontal position of text on the line in which a page
number reference occurs in order to account for the change in width from the original
page
number. This requires knowing the difference in displayed width between the original
page
number and the new page number.
Page numbers for pages that are not changed are pulled from the page history database's
mapping of element IDs to page numbers.
Page History Database
Conceptually the page history database consists of two separate data sets:
-
A record for each physical page (odd page and backing even page) by display page
number capturing for each update it occurs in the absolute page number, the folio
details for the front and back pages, and the first line of the odd page (which
helps with debugging).
-
A record for each element ID capturing, for each update that element occurs in,
the absolute page number and folio details of the page the element starts on, as
well as a time stamp of when the record was created. The title of the element is
also captured (elements with IDs must have titles or they would not be targets of
page number references in this content).
The physical page records enable generation of the update instructions and the list
of
effective pages. The element history records enable generating page number references
to
elements on unchanged pages.
The history database is maintained over time for a given publication, updated when
a
given update is published for delivery to the client.
The page history database could be implemented in many different ways. For this
project, the initial implement uses an XML file that is maintained along with the
publication's source in the version control system that maintains the publication
source.
The file is read during area tree post processing and updated when a final publication
is
produced.
A more robust implementation might use a dedicated database application and a REST
service to manage the database but the scope and infrastructure for that was not available
in this project.
Capturing Element Target Details In The Area Tree
In CSS a page number reference is a counter reference scoped to a specific
element:
div.toc-entry > span.page > a.body:before,
div.minitoc.pg div.minitoc-entry > span.page > a:before
{
content: target-counter(attr(href url), page);
}
And
the area tree result is just the resulting text unless something more is done.
The <a> element is also rendered as a navigable hyperlink using the AHF -ah-link:
extension
property:
a[href]
{
-ah-link: attr(href url)
}
in order to mark references to page numbers the HTML preprocess adds spans with
structured ID values that mark the text as a page number
reference:
<div class="toc-entry">
<span class="title">
<a href="#x88B6B95B1C04">
<span class="h1">
<span class="uppercase">Officials </span> of the <span class="uppercase">Town of
Lovettsville, Virginia At the Time of This Codification </span></span></a></span>
<span class="page" id="page:d121e33">
<span class="page-number-marker">​</span>
<a id="pageref:d121e36" class="frontmatter" href="#x88B6B95B1C04"></a>
<span class="page-number-marker">​</span>
</span>
</div>
The span with a class of "page" marks its content as being a page number (full folio)
using the structured @id value of "page:{generated-id}", while the <a> element's @id
of
"pageref:{generated-id}" marks it as a page number reference (just the page number
part of
the folio).
This results in reliably-findable markup in the area
tree:
<InlineArea
id="page:d121e49"
...>
<InlineArea ...>
<TextArea ...
text="​"
/>
</InlineArea>
<InlineArea
internal-destination="x88B6B95B4E50"
id="pageref:d121e52"
...>
<InlineArea ...>
<TextArea ...
text-width="5.37pt"
text="v"
/>
</InlineArea>
</InlineArea>
<InlineArea ...>
<TextArea ...
text="​"
/>
</InlineArea>
</InlineArea>
The outer InlineArea is identified as a page (full folio). the InlineArea elements
with content of ​ (non-breaking space) mark the boundaries between the
prefolio, page number, and post folio. There is no prefolio or postfolio in this
example.
The page number is identified by @id "pageref:d12e52" in this example. Note that the
TextArea specifies both the text ("v") and the rendered width ("5.37pt").
In addition, the target ID is automatically captured by AHF on the
@internal-destination attribute of the InlineArea that results from the -ah-link:
extension property.
The final challenge is finding the target element's starting position within the area
tree in order to find the physical page referenced in order to then get the new page
number for that page. While AHF will capture the ID of any element in the area tree,
it
won't necessarily do so at the right position.
Thus the HTML preprocess generates markers that signal the start and end of elements
to which page number reference might be made (sections, figures, and tables in the
current
content
set):
<section id="x88B6B95B4E50" ...>
<header>
...
<h1 style="line-height:2em;">
<span class="uppercase">Boards and Commissions</span>
<br /> of the <br />
<span class="uppercase">Town of <br /> Lovettsville, Virginia</span></h1>
<p data-type="startpage">7</p>
</header>
<areaTreeMarker
id="marker:section:officialsoriginal:start:narrow:startpage=7:id=x88B6B95B4E50"
/>
...
<areaTreeMarker
id="marker:section:officialsoriginal:end:narrow:id=x88B6B95B4E50"
/>
</section>
Note that the marker is emitted after the header, which produces the section title
in
the final result. This ensures that the marker in the area tree occurs on the same
page as
the title and not on the page before, for example, if the CSS for the header creates
a
page break or the keep rules result in a forced break before the title.
This then results in inline areas in the area tree that serve as reliable markers
for
the start and end of the section. Only the start marker is needed for page number
updating
but the end marker is needed in order to count the pages taken up by some sections,
such
as the tables of contents and indexes, where the number of pages is needed, for example
to
distinguish the count of generated pages from authored pages for billing purposes.
To get the actual page number for a given target element the code uses the same logic
used during initial page update processing to find the rendered page number on the
page.
That is, one you have a page in the area tree, getting that page's display page number
is
a single function call to the appropriate utility
function:
<xsl:variable name="target-page" as="element()?"
select="at:get-referenced-page($context)"
/>
<xsl:choose>
<xsl:when test="at:is-within-take($target-page)">
<xsl:sequence select="at:get-folio($target-page)"/>
</xsl:when>
<xsl:otherwise>
<xsl:sequence select="at:get-stored-page-number-for-pageref($pageListingDoc, $context, false())"/>
</xsl:otherwise>
</xsl:choose>
Updating Page Number References
The display page numbers are constructed with a format that makes it reliable to find
them:
<LineArea text-altitude="8.87939pt" ...>
<TextArea text="​" .../>
<TextArea text="2" .../>
<TextArea text="​" .../>
</LineArea>
The marker is a LineArea containing TextArea elements where the text contains a
zero-width space (\u200B). Zero-width spaces are not otherwise used for anything in
this
content and thus make ideal marker characters as they have no visual effect.
The XSLT provides helper functions that work with display page numbers: finding them,
getting the folio parts, etc.
The practical challenge with updating page numbers is that the width of the new page
number will usually be different from the width of the original, which means that
the
horizontal position of the text area has to be adjusted to account for the difference
in
width.
However, the post process does not have any direct way to know what the rendered width
of the new page number is.
One solution would be to use extension functions and something like the Java2D
libraries to render the characters using the font details available in the area
tree.
For this project we took a more brute force but more reliable approach, namely to
include in the area tree an instance of every possible character that could occur
in a
page number (0-9, a-z, A-Z, ".", "-", ":", etc.) in every font and size used in the
publication in a form that enables lookup of the rendered characters. This set of
"character samples" is generated as part of the HTML preprocessing.
The HTML
is:
<char-samples id="util:char-samples">
<char-sample class="body">
<char-set class="number-set sz83" id="util:char-set:body-sz83">
<decimal>.</decimal><char>0</char><char>1</char><char>2</char><char>3</char><char>4</char><char>5</char><char>6</char>
<char>7</char><char>8</char><char>9</char><char>a</char><char>b</char><char>c</char><char>d</char><char>e</char><char>f</char>
<char>g</char><char>h</char><char>i</char><char>j</char><char>k</char><char>l</char><char>m</char><char>n</char><char>o</char>
<char>p</char><char>q</char><char>r</char><char>s</char><char>t</char><char>u</char><char>v</char><char>w</char><char>x</char>
<char>y</char><char>z</char><char>A</char><char>B</char><char>C</char><char>D</char><char>E</char><char>F</char><char>G</char>
<char>H</char><char>I</char><char>J</char><char>K</char><char>L</char><char>M</char><char>N</char><char>O</char><char>P</char>
<char>Q</char><char>R</char><char>S</char><char>T</char><char>U</char><char>V</char><char>W</char><char>X</char><char>Y</char>
<char>Z</char>
</char-set>
...
</char-sample>
...
</char-samples>
Because eash <char-set> specifies one of the font-setting classes from the CSS, the
font and size will be present in the area tree. Those values then serve as the lookup
key
for any character, along with the character itself. The only maintenance task is keeping
the generating XSLT in sync with the CSS styles so that all relevant font and size
differences are represented.
The area tree markup is
then:
<BlockViewportArea id="util:char-samples" ...>
<BlockArea
id="util:char-set:body-sz83" ...>
<LineArea ...>
<InlineArea ...>
<TextArea ...
width="2.224pt"
font-family="NewCenturySchlbk LT Std"
font-size="8pt"
text-width="2.224pt"
text="."
/>
</InlineArea>
...
<LineArea>
</BlockArea>
</BlockViewportArea>
Because each source character is in a separate element it ensures that there is one
text area for each character in the area tree. The text areas specify the exact width
in
points of the character.
This XSLT code then looks up the details of a character given the font
details:
<xsl:key name="char-samples"
match="at:BlockArea[starts-with(@id, 'util:char-set:')]//at:TextArea"
use="at:make-char-sample-key(@font-family, @font-size, @text)"
/>
<!-- Try to find a character sample for the specified font family, size, and text value. -->
<xsl:function name="at:get-char-sample" as="element(at:TextArea)?">
<xsl:param name="context" as="element()"/><!-- Area tree element -->
<xsl:param name="font-family" as="xs:string"/>
<xsl:param name="font-size" as="xs:string"/>
<xsl:param name="text" as="xs:string"/>
<xsl:variable name="key" as="xs:string"
select="at:make-char-sample-key($font-family, $font-size, $text)"
/>
<xsl:variable name="result" as="element()?"
select="key('char-samples', $key, root($context))[1]"
/>
<xsl:sequence select="$result"/>
</xsl:function>
<xsl:function name="at:make-char-sample-key" as="xs:string">
<xsl:param name="font-family" as="xs:string"/>
<xsl:param name="font-size" as="xs:string"/>
<xsl:param name="text" as="xs:string"/>
<xsl:variable name="key" select="string-join(($font-family, $font-size, $text), ':')"/>
<xsl:sequence select="$key"/>
</xsl:function>
Then it's simply a matter of doing the math to the determine the amount of adjustment
to apply to the text areas in the affected line area and update them, here for the
case of
page references within ToC entries that use leaders (the most common
case):
<!-- Page references in lines with leaders (e.g., ToC pages). This will be the most common case.
There should only be one page reference, which means we can simply calculate the adjustment to
the leader and all inline and text nodes following the leader should be good to go.
The prefolio and postfolio should always be the same.
-->
<xsl:template match="at:LineArea[at:is-within-page-reference-line(.)]">
<xsl:variable name="pagerefs" as="element(at:InlineArea)+"
select=".//at:InlineArea[starts-with(@id, 'pageref:')]"
/>
<!-- There should be exactly one TextArea in the pageref InlineArea -->
<xsl:variable name="pageref" as="element(at:InlineArea)"
select="$pagerefs[1]"
/>
<xsl:variable name="target-page" as="element()?"
select="at:get-referenced-page($pageref)"
/>
<xsl:variable name="page-ref-text-area" as="element(at:TextArea)?"
select="($pageref//at:TextArea)[1]"
/>
<xsl:variable name="orig-page-number" as="xs:string?"
select="$page-ref-text-area/@text"
/>
<!-- NOTE: This is the formatted page number, e.g., 'xix', not 19 -->
<xsl:variable name="page-number" as="xs:string?"
select="at:get-referenced-page-number($pageref)"
/>
<xsl:choose>
<xsl:when test="empty($page-number)">
<!-- Nothing to do, just use what we have -->
<xsl:next-match/>
</xsl:when>
<xsl:otherwise>
<xsl:variable name="current-width" as="xs:double"
select="at:get-length-value($page-ref-text-area/@text-width)"
/>
<xsl:variable name="new-number-width" as="xs:double"
select="at:get-string-width($page-ref-text-area, $page-number)"
/>
<xsl:variable name="width-difference" as="xs:double"
select="$new-number-width - $current-width"
/>
<xsl:choose>
<xsl:when test="$width-difference ne 0.0">
<!-- Now figure out how many dots to remove from the
leader to account for the added space: -->
<xsl:variable name="leader-area" as="element(at:LeaderArea)"
select="$pageref/preceding::at:LeaderArea[1]"
/>
<xsl:variable name="leader-string" as="xs:string"
select="$leader-area/at:TextArea/@text"
/>
<xsl:variable name="leader-width" as="xs:double"
select="at:get-length-value($leader-area/at:TextArea/@text-width)"
/>
<xsl:variable name="dots" as="xs:string*"
select="tokenize($leader-string, ' ')"
/>
<xsl:variable name="dot-width" as="xs:double"
select="$leader-width div count($dots)"
/>
<xsl:variable name="sign" as="xs:integer"
select="if ($width-difference lt 0) then 1 else -1"
/>
<xsl:variable name="dot-count-diff" as="xs:integer"
select="if (number($dot-width))
then (($width-difference idiv $dot-width) + 1) * $sign
else 0
"
/>
<xsl:variable name="dot-count" as="xs:integer"
select="$dot-count-diff + count($dots)"
/>
<xsl:variable name="new-leader-string" as="xs:string?"
select="string-join(for $n in 1 to $dot-count return $dots[1], ' ')"
/>
<xsl:variable name="new-leader-width" as="xs:double"
select="$dot-width * $dot-count "
/>
<xsl:copy>
<xsl:apply-templates select="@*"/>
<xsl:apply-templates>
<xsl:with-param name="new-leader-string" as="xs:string?"
tunnel="yes" select="$new-leader-string"
/>
<xsl:with-param name="new-leader-width" as="xs:double?"
tunnel="yes" select="$new-leader-width"
/>
<xsl:with-param name="left-adjust" as="xs:double"
tunnel="yes" select="$width-difference"
/>
<xsl:with-param name="page-ref" as="element(at:InlineArea)?"
tunnel="yes" select="$pageref"
/>
</xsl:apply-templates>
</xsl:copy>
</xsl:when>
<xsl:otherwise>
<xsl:copy>
<xsl:apply-templates select="@*"/>
<xsl:apply-templates>
<xsl:with-param name="doDebug" as="xs:boolean"
tunnel="yes" select="$doDebug"
/>
<xsl:with-param name="new-leader-string" as="xs:string?"
tunnel="yes" select="()"
/>
<xsl:with-param name="new-leader-width" as="xs:double?"
tunnel="yes" select="()"
/>
<xsl:with-param name="left-adjust" as="xs:double"
tunnel="yes" select="0.0"
/>
<xsl:with-param name="page-ref" as="element(at:InlineArea)?"
tunnel="yes" select="$pageref"
/>
</xsl:apply-templates>
</xsl:copy>
</xsl:otherwise>
</xsl:choose>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
Update Page Number Database
The final task is to update the page number database if the pages being produced are
the
final delivered version of the update package (although the database can be updated
at any
time for given update until the update is published, at which point the entries for
that
update are fixed and should not be modified).
The processing reads the page history database from its location in the source version
control working copy and writes a new one to a temporary location, avoiding the restriction
in XSLT on reading and writing to the same file. A separate processing script then
copies
the new version of the page history database to the working copy, if update of the
page
history database has been requested as part of the processing run.
The logic for updating the page listing is simply a matter of processing each page
to
create or update a physical page entry and then processing each marked ID to create
or
update the element-to-page entry for the element.
Processing starts with the existing page database, processing all the pages that are
already reflected, then processing any pages not reflected in the database. If there
is no
existing database then a new one is synthesized before applying normal processing
to
it.
For debugging purposes the project includes a simple style sheet that generates an
HTML
view of the page history database.
A final challenge with the page history database is creating it initially for
publications being migrated from the old XPP system to the new system.
While it is possible to generate a history of the physical pages from the XPP data
there
is no easy way to correlate the sections, figures, and tables in the XPP version to
the
corresponding elements in the new XML version or to capture the page numbers those
elements
fell on in the XPP version. This means that the initial element history database must
be
populated by hand. This is a tedious manual process. While there may be ways to better
automate it the project scope did not allow us to explore them. For example, it might
be
possible to examine the last published PDF and correlate elements by their titles
and other
positional clues. We also considered the possibility of generating area trees from
PDFs and
then populating those area trees with the necessary start and end markers. But given
that
level automation there would still be a necessary quality assurance step to verify
that the
associations were correct. Fortunately, this is a one-time task for each publication
being
migrated.
Conclusions and Future Work
Given markup in the publication source that identifies the start and end of changed
content it is possible to implement generation of loose-leaf change packages with
automatically-generated page numbers using CSS for pagination and post processing
of the
initial area tree produced by the pagination engine.
While the implementation required developing a number of tricks for getting the
information needed for post processing into the area tree, the resulting data processing
was
not overly challenging. It did take us several iterations to work out the final processing
pipeline, with the usual false starts, but the final solution seems to have a level
of
complexity commensurate with the complexity of the problem itself.
The scope of the project did not include automating the identification of changes.
Given
an XML differencing tool like DeltaXML it should be possible to compare two area trees
and
determine that a sequence of pages has changed and then inject the appropriate markers
back
into the publication source for review and adjustment as needed.
Another potential area of work is automating the insertion of changed pages into an
area
tree that reflects the full update history of the publication. Municode currently
does this in
PDF manually but it should be possible to automate this using area trees from which
a PDF can
then be generated as needed.