Problem Statement
Loose-leaf publishing is the production of updates to previously-printed documents where the numbers of the previously-printed pages must be preserved. When an update to a document results in new pages those pages are given page numbers that reflect the last original page's number plus a modifier, i.e., "10.1", "10.2", etc. Such pages are often called "point pages". Updates to documents are produced that reflect only the new or changed pages, which are then manually inserted into copies of the target document, bringing those copies up to date with the latest version of the master document.
Loose leaf publishing was common place before the advent of low-cost printers and digital document delivery. In a world where you can regenerate a PDF or print a 1000 pages on a high-volume laser printer in minutes, the need for loose-leaf publication has all but disappeared.
One area where the requirement still exists is municipal code, a specialized area of legal publishing.
Legal documents, in particular codified municipal law and regulations, present several practical problems:
-
The documents tend to be large: 2000 pages for a city's municipal code is typical.
-
People and other documents make references to the previously-published page numbers.
-
Municipal code is updated frequently: every city council meeting will likely result in new or changed ordinances that cause changes to the codified municipal code.
-
The documents have very long life cycles. While cities may choose to "reflow" their code periodically, republishing it in its entirety with new page numbers, reflows may only be done once a decade or less.
-
City staffers and others who work with the city code maintain printed copies of the municipal code to support their day-to-day jobs. It would be disruptive to completely replace these working copies every time the code is updated.
The size of the publications, the frequency of update, and the number of actively-used printed copies would make reprinting the entire code for every update prohibitive, leaving aside the review and quality assurance implications of republishing a 2000-page document with critical legal implications.
Municipalities do not, as a rule, do their own codification. Codification and publication of municipal code is a service. One such service provider is Municode, one of the largest suppliers of municipal codes in the U.S.
Municode had been for decades using the Xyvision Parlance Publisher (XPP) product to produce loose-leaf pages for municipal code. XPP did the job but was being used as a traditional typesetting system, not as an SGML or XML publishing system. Codifiers authored directly in XPP's typesetting format ("gencode"), going right back to the beginnings of structured markup and computerized typesetting.
Municode realized they needed to replace their XPP system with a modern XML-based publishing pipeline. They developed an HTML5-based vocabulary for the source, decided to use CSSCSS pagination as the layout technology, and selected Antenna House Formatter (AHF)AHF as the pagination engine. Antenna House Formatter implements CSS pagination.
However, AHF does not itself do loose-leaf publishing, so loose-leaf processing would need to be implemented. In particular, AHF (and CSS generally), does not provide a direct way to generate point page numbers or references to them.
This author had previously designed an approach to using Antenna House area tree post processing to produce change pages in the context of a proposal to a U.S. federal agency for publishing updates to publications used by field agents. The proposal was not accepted but I had provided my design to Antenna House as part of the proposal development process. Antenna House recommended me to Municode and I was hired by Municode to implement loose leaf publishing.
Because Municode insisted on using CSS rather than XSL-FO for doing the pagination styling, I also ended up developing the pagination styles and the preprocessing needed to produce the required XML input to the CSS pagination process.
Antenna House Formatter was selected as the pagination engine because it was at the time the only CSS pagination implementation that met all of Municode's typesetting requirements. If Municode had been willing to use XSL-FO it probably would have been possible to use Apache FOP as FOP provides the necessary layout features and also produces an area tree, enabling the post processing required to generate the point page numbers.
Loose Leaf Challenges
One challenge inherent in loose-leaf publishing is determining which pages have changed between two versions of a document. Municode does this manually as an editorial activity, so automatic determination of pagination differences was not a requirement for the initial implementation. Determining changes automatically is a potential area for future study.
The general challenge then was to implement an automated process that takes XML source as input and produces "change packages" as output.
The high level processing pipeline is:
-
Editors prepare XML source, including marking the starts and ends of changed pages, where the start always reflects the start of a page in the previously-published version and the end is wherever the change ends.
-
The input XML source is preprocessed to generate XHTML, augmented as needed to enable both CSS pagination generally and change page generation specifically.
-
The augmented XHTML is rendered by AHF using the CSS styles to produce an initial area tree.
-
The initial area tree is processed to update the page numbers of pages that should have point page numbers and references to those pages. If a change package is being produced, all unchanged pages are filtered out, producing a result area tree that reflects just the changed pages and any other pages required for the package, such as a generated "update instructions" section, table of contents, cover, etc.
-
A master page history database is updated to reflect the page details in the updated version of the publication, including the mapping from elements with IDs to the their start and end pages.
-
The updated area tree is rendered by AHF to PDF.
The primary processing challenge in producing the point pages is knowing at which point within a sequence of changed pages point pages are required.
The incoming source marks the start of the changed pages and specifies the page number that the first page of the change had in the previously-published version, and, if necessary, the page number of the page that follows the changed pages.
At authoring time, the author has no way to know how many pages the change will produce and thus no way to know where the first point page will start or if point pages are required at all. If the change is at the end of a section where the following section starts a new page number sequence, then there is no need to specify the page number that follows the change. Otherwise, the page number following the change must be specified.
Given the start of the change and the page number of the page following the change, it is then possible to determine, once the pages are laid out, whether or not the number of pages is greater than the original set of pages for the same source range and thus determine which of the changed pages require point page numbers. It is then possible to update the page numbers on those pages and update any page number references to elements that start on those pages (for example, table of contents entries, cross references, etc.).
Another challenge is generating tables of contents for change packages.
When producing a change package, the entire publication's table of contents must be replicated. However, the only reliable page numbers in the rendered document are those for the changed pages themselves. All other page numbers are unreliable for a variety of reasons, not least of which is that the previously-published version may have been published with a completely different tool (XPP instead of AHF) or there could simply be differences in layout details between versions of AHF itself or there may be changes to the CSS that affect the pagination details.
Thus the page numbers reflected in the table of contents for elements on pages not in the current set of changes must reflect the page numbers those elements fell on in the last published version. This requires some form of pagination database that correlates elements to the page numbers they were published on in a given version in time of the publication.
A final challenge is generating the "update instructions" and "list of effective pages" sections of the change package.
The update instructions specify, for each set of changed pages in the current update package, what pages from the previously-published version to remove and what new pages to insert. This requires knowing, for a given page, what its page number was in the previous version and what its page number is in the current version.
The list of effective pages indicates, for each page in the publication, what update (version in time) of the publication it was last changed. This requires a historical record for each physical page that captures its update history.
With these two sets of historical data it is then possible to fully generate the update instructions and list of effective pages. Both of these artifacts had to be manually created in the old system. The tasks were both tedious and error prone, adding significant time to the production process and limiting the ability of Municode to produce updates in a timely fashion.
Finally, note that Municode also publishes municipal code on the web, so another business requirement was to have a single authored source that could then directly serve both print production and web delivery with no manual intervention in either production process.
In the old system, the XPP typesetting source was first converted to an intermediate XML format from which the final delivered HTML was generated. This process was not 100% automatic, did not always result in 100% correct online results, and was also time consuming.
CSS Pagination Challenges
CSS pagination[1] presents a number of challenges:
-
The CSS pagination specifications are not complete, either editorially or functionally
-
The CSS pagination specifications are spread across a large number of separate specifications
-
CSS itself is optimized for styling HTML
Compared to XSL-FO purely in terms of layout and typographic features defined in the standard, CSS lacks a number of important features, including complete control over header and footer geometry, no way to impose link behavior onto arbitrary elements or attributes, nothing comparable to XSL-FO's table marker feature, and other more esoteric and lesser-used features of XSL-FO, especially around support for Asian languages.
The clear advantage of CSS for pagination is the ease of specifying the styles themselves. CSS is objectively much easier to specify than XSL-FO for the simple reason that a CSS style sheet is completely separate from the source being styled, while XSL-FO is a source format that must be generated by a transform. The need to generate XSL-FO has the effect of conflating the data processing aspect with the styling aspect in a way that makes it hard to separate the two concerns.
Another advantage of using CSS for pagination is that it allows the same core styles to be used for browser and print delivery, if that is a requirement.
Because CSS is not itself a transformation language, it necessarily keeps separate the concerns of data processing and styling.
From a practical standpoint, it is much easier to find people who know CSS or are willing to learn it than it is to find people who know XSL-FO or are willing to learn it. Even if the two technologies were otherwise functionally equivalent, the simple ability to find people to do CSS work would be a significant advantage.[2]
As a style language, CSS can only do decoration, it cannot do reordering or creation of complex elements. This keeps CSS architecturally and syntactically simple, a requirement for use in browsers where rendering speed is paramount, but it means that typical source documents, even if authored in HTML, cannot be completely styled as authored.
Another important CSS limitation is selectors: CSS selectors cannot look ahead of the current element, meaning you cannot directly style an element based on properties of its descendants or following elements.
In addition, because CSS is specifically designed for styling HTML it cannot be used to fully style arbitrary XML without extensions.
Thus, using CSS for pagination effectively requires generating HTML that is suitable for rendering. While the same augmentation tasks could be applied to any XML, because CSS is optimized for HTML, it's easier and usually more appropriate to simply generate HTML. For many, if not most, XML vocabularies there will already be an HTML generation transform that can be repurposed to generate CSS pagination-ready HTML. Note that this also avoids the need for a completely separate XSL-FO generation transformation, which can be a significant savings.
This generation step is analogous to the generation often done by Javascript in browsers. In both cases the source HTML has to be extended or modified to meet the specific rendering requirements of the delivery environment.
The specific things that must be done to enable CSS pagination include:
-
Generating tables of contents, back-of-the-book indexes, and similar navigation structures.
-
Generating elements used to populate structured headers and footers. For example, a multi-line header where individual lines may have different formatting or there may be inline formatting requiring separate elements in the HTML.
-
Adding @class values or other clues to make CSS styling either possible (lookahead) or more convenient.
-
Reordering elements that are presented out of their source order, for example, moving a figure caption element from the top of the figure container to the end of the figure container or using metadata elements or attributes to synthesize displayed content, such as a copyright page or authorship for an individual article or chapter.
-
Adding wrapper structures to either enable specific formatting effects or to make styling easier.
-
Normalizing markup for elements that may have different markup patterns as authored, for example, adding paragraph elements to list items, to make the CSS style sheet simpler.
-
Generating text that would be difficult or impossible to generate with CSS alone.
If the source as authored is itself XHTML this transform can be relatively simple. If the source is some other vocabulary it is likely that there is an existing HTML generation transform that can be adapted to produce CSS-pagination-ready HTML.
In addition to the need to generate pagination-ready HTML, the CSS pagination specification lacks essential features needed by more challenging layouts. Thus any complete CSS pagination implementation will include proprietary extensions to fill this feature gap. AHF does this by essentially mapping every XSL-FO feature or AHF-extension to a corresponding CSS extension. For example, AHF provides additional page sequence types to enable creating last- and only- page sequences.
A final challenge is the lack of parameterization in the base CSS syntax. While tools like "less" provide a way to parameterize and modularize CSS style sheets, we did not find a suitable Java-based less compiler at the start of the project. Thus the CSS style sheets have a lot of redundancy, especially in the page master rules, but since this code did not change dramatically once developed, the lack of parameterization turned out to be a minor problem and solving it never became a priority.
When implementing the CSS style sheets the main challenges were:
-
Finding the relevant definitions in the appropriate W3C specification for a given layout feature.
-
Determining whether or not AHF implemented a given feature as defined in the specification.
-
For challenging layout requirements determining the best solution using AHF.
-
Controlling page breaks dynamically.
For most layout requirements, development of the CSS was a straightforward application of normal CSS techniques.
Challenging requirements included:
-
Managing counters and variables across element boundaries for running heads and feet that need to reflect first or last values on a page.
-
Managing page breaks. The CSS semantics for break control are not as definitive as for XSL-FO. In particular CSS does not have a "keep together always" or "keep with next always" control. Keeps in CSS are truly "hints". This sometimes resulted in unfortunate breaks, such as a between a section head that falls at the bottom of a page and the head for a subordinate section where there is no intervening content. It required using AHF extensions to get better control of page breaks.
-
Controlling the size and layout of wide page edge regions. The CSS design for page edge regions does not explicitly allow for a single region that takes up most or all of the edge region. This made it difficult to create right- or left-aligned headers that had long content (for example, a long section title).
Preparing XML For CSS Pagination
CSS is a declarative style language rather than a transformation language. This makes CSS relatively simple and provides a clear separation of concerns between the visual and behavioral style definition and the data processing applied to the source XML but means that most XML cannot be fully styled as authored. Thus the authored XML must be transformed into a form that can then be fully styled using CSS. In addition, CSS is optimized for styling HTML and therefore lacks certain features needed to style arbitrary XML. For example, CSS assumes that all links are represented by HTML <a> elements that use @href with URLs for addressing and does not provide a way to associate linking behavior with arbitrary elements or attributes.
While it is technically possible to apply CSS directly to arbitrary XML and get a reasonable result, that result will necessarily be incomplete for non-trivial page layouts. Even with HTML, if the HTML is authored in a typical way where authors are only responsible for content and not things that should be automatically generated or rendered, such as running heads and feet, the source HTML will still require some amount of augmentation to meet the page layout requirements.
Thus some amount of transformation is always required to create pagination-ready XML from whatever the source XML is, even if that source is HTML.
Compare this transformation to the transformation required to use XSL-FO: for XSL-FO the target vocabulary is always XSL-FO and the style details are embedded in the generated XSL-FO markup and content. For CSS pagination the target vocabulary can be any XML but is most often HTML and the transform need only modify the structural and semantic details of the source XML—all style details remain in the CSS style sheet.
If the authored source is HTML then the transform is really an "augmentation" task, adding things and doing some reordering if needed, but otherwise just an identity transform that preserves the HTML as authored.
If the authored source is not HTML then the transform is most effectively a transform to HTML with all necessary augmentation applied. If an HTML transformation already exists for the source XML vocabulary then all that is required is to add the augmentation required to meet the CSS pagination requirements.
In the context of full-featured page layout as typically required for technical documentation, legal documents, trade books, and other highly-designed publications that still lend themselves to 100% automated composition, the following layout requirements require augmenting the HTML as authored or as generated by a pre-existing HTML tranform:
-
Structured page edge content, such as multi-line headers or footers or content with typographic differences (i.e., bold or italic words in a title reflected in a running head or foot).
-
Elements that depend on properties of descendants or following elements to determine their style.
-
Content that is presented in an order or structure different from its authored format, for example, moving figure titles from the start of the figure container to the bottom or reflecting metadata elements in the main flow.
-
Generated text that cannot be produced easily or at all using CSS (for example, text that requires calculation or string manipulation of the source)
In addition, the CSS style definition can be made easier to create and maintain by adding additional elements or attributes that while not strictly required make the style definition easier, for example by simplifying the selectors required or by normalizing variable markup patterns into a single consistent pattern.
Providing Structured Page Edge Content
CSS provides two mechanisms for reflecting content in the source in different places via style declarations:
-
String variables
-
Element variables
String variables copy content or attribute values into named variables that can then
be
used in content:
properties:
chapterTitle {
string-set: chapterTitle content();
display: none;
}
The string-set:
property captures the the text content of the
<chapterTitle> element into a variable named "chapterTitle". The variable can then
be
used in
content:
@page portrait:right {
@top-center {
content: string(chapterTitle, last);
vertical-align: bottom;
margin-top: 0.25in;
margin-bottom: 2pc;
text-transform: uppercase;
}
...
}
Here
the last value set for the "chapterTitle" variable will be used as the content of
the
top-center page edge region for this type type.
Element variables remove elements from the source flow and capture them as variables that can then be used in page edge regions. The elements will be styled using the styles in effect at point where they occur in the source:
sectionTitlesMultiline {
position: running(runningHead);
}
Here the position:
property sets the "position" of the
<sectionTitlesMultiline> element as being in an element variable named "runningHead".
The
<sectionTitlesMultiline> element is removed from the document where it occurs.
The element variable can then be used in a content:
property:
@page landscape:left {
@top-left {
content: element(runningHead start);
border-bottom: 0.5pt solid black;
margin-top: 0.25in;
margin-bottom: 0.33in;
vertical-align: bottom;
text-align: left;
font-size: 10pt;
}
...
}
Here the content:
property uses the element()
function get the
value of the "runningHead" element variable and use it as the content of the top-left
page
area.
Note that use of position: running()
consumes the element to which it
applies, which means you cannot, for example, simply reflect title elements or title
element
containers in running heads—you must have separate source elements for the title in
the main
flow and for the running heading.
For example, the Municode HTML preprocessing transform has this rule to generate the <sectionTitleMultiline> elements:
<xsl:template mode="frills-detailed" match="xhtml:header"> <xsl:variable name="sectLevel" as="xs:integer" select="count(ancestor-or-self::xhtml:section)" /> <sectionTitlesMultiline class="sect-{$sectLevel}"> <!-- Process ancestors from highest to lowest --> <xsl:apply-templates select="(ancestor::xhtml:section)" mode="frills-get-multi-line-entries" /> </sectionTitlesMultiline> </xsl:template>Where the authored source is:
<section id="x88B6B95C248C" data-type="section"> <header> <h1>Sec. 1.1</h1> <p data-type="subtitle">Incorporation.</p> </header> ... </section>And the transformed result is:
<section id="x88B6B95C248C" data-type="section" class=""> <header> <sectionNumberVerso>§ 1.1</sectionNumberVerso> <sectionNumberRecto>§ 1.1</sectionNumberRecto> <sectionTitlesMultiline class="sect-4"> <headerLine class="charter"> <h1 class="">Charter</h1> </headerLine> <headerLine class="chapter"> <h1 class=" has-subtitle">Chapter 1</h1> <p data-type="subtitle" class="">General Provisions</p> </headerLine> <headerLine class="section"> <h1 class=" has-subtitle">Sec. 1.1</h1> <p data-type="subtitle" class="">Incorporation.</p> </headerLine> </sectionTitlesMultiline> <h1 class=" has-subtitle">Sec. 1.1</h1> <p data-type="subtitle" class="">Incorporation.</p> </header>
Many, if not most, publications will require element, rather than string, page edge content and thus will need to generate this type of additional markup to provide them.
Styling That Depends on Descendant or Following Element Properties
CSS selectors can only select on the current node or nodes that have come before, they cannot make reference to descendant or following elements.
In the Municode content there are at least two cases where the style of a paragraph depends in part on its descendants or content, which requires setting a specific class value on the <p> element:
<xsl:template match="xhtml:p"> <!-- This will always reflect the original @class value, if any --> <xsl:variable name="class-tokens" as="xs:string*"> <xsl:apply-templates mode="set-class-for-p" select="."/> </xsl:variable> <xsl:copy> <xsl:apply-templates select="@* except (@class)" mode="#current"/> <xsl:if test="exists($class-tokens)"> <xsl:attribute name="class" select="string-join($class-tokens, ' ')" /> </xsl:if> <xsl:apply-templates select="node()" mode="#current"/> </xsl:copy> </xsl:template> <xsl:template mode="set-class-for-p" as="xs:string+" priority="10" match="xhtml:p[.//xhtml:span[exists(@data-lf)]]"> <xsl:sequence select="'lf'"/> <xsl:next-match/> </xsl:template> <xsl:template mode="set-class-for-p" as="xs:string+" match="xhtml:p[ends-with(normalize-space(.), ':')][not(ancestor::xhtml:header)] "> <xsl:sequence select="'keepwithnext'"/> <xsl:next-match/> </xsl:template>
Synthesizing or Reordering Content
This is simply constructing the appropriate structures as required to achieve the required presentation result. It is not a workaround for limitations in CSS but is simply a requirement. For many publications the requirement likely exists for digital delivery as well.
Generated Text That Cannot Be Constructed Using CSS
The CSS content:
property combined with :before and :after pseudo elements
can do quite a bit but CSS does not provide functions for doing string manipulation
or more
complex calculations (for example, for converting dates and times into formatted values).
Thus it may be necessary to generate attributes or elements that contain text that
would
otherwise be generated purely as a matter of style.
CSS Pagination and Area Trees
Antenna House Formatter (AHF) can produce as output an "area tree", which is an XML representation of the composed pages. The area tree captures all the information needed to generate the rendered page, including all font details, placement details, and so on. AHF can take an area tree as input and produce the final deliverable, i.e., PDF.
In order to post-process the area tree it needs to include the following information:
-
The start and end of each change ("take" in Municode parlance)
-
The start and end of each element with an ID for which the page number needs to be captured (sections, tables, figures)
-
The page numbers as rendered in the edge regions of pages, including distinguishing any prefolio and postfolio text.
-
Boundaries or occurrence of other important elements, such as the update instructions, and things that need to be counted on a per-page basis (tables, images, etc.), and so on.
AHF does not provide a general way to inject arbitrary information into area trees. It does, however, preserve any @id values it finds on input elements.
We take advantage of this by constructing elements in the input HTML with @id values that are structured fields:
<areaTreeMarker id="take:take-begin:job=S138-U02:d20p60"/>
which
then becomes this area tree
element:<InlineArea id="take:take-begin:job=S138-U02:d20p60"
font-size="0pt"
width="0pt"
height="0pt"
baseline-after="0pt"
...
/>
Note that the InlineArea has an effective size of zero width and height, so it does not affect the rendering of the page on which it occurs.
These area tree marker elements are added during the HTML preprocessing step.
For page numbers, we use zero-width-space characters (\u200B) to separate the prefolio, folio, and postfolio components of the page number as rendered in any context where a page number occurs:
@page :blank {
size: 8.5in 11in;
margin-left: 7.5pc;
margin-right: 7.5pc;
margin-bottom: 6pc;
margin-top: 8pc;
@bottom-center {
content: string(prefolio, first) '\200B' counter(page) '\200B' string(postfolio, first);
margin-top: 1pc;
vertical-align: top;
font-family: "New Century Schoolbook", serif, 'Arial Unicode';
font-size: 10pt;
}
...
}
The zero-width-space character is not used in any other place in this content and has no visible result in the rendered pages.
Occurrences of page numbers in the area tree can be found reliably:
<!-- Gets all the text areas for the page's page number, including the prefolio and postfolio, if
present.
-->
<xsl:function name="at:get-page-number-text-areas" as="element(at:TextArea)*">
<xsl:param name="context" as="element()?"/><!-- Any element that is or is within a page viewport area -->
<xsl:variable name="page" select="$context/ancestor-or-self::at:PageViewportArea"/>
<xsl:variable name="margin-region" as="element(at:MarginRegionViewportArea)?"
select="$page/at:PageReferenceArea/at:MarginRegionViewportArea[.//at:TextArea[@text = '​']]"
/>
<!-- At least in the legacy style there should be exactly one block with exactly one line with three or more text areas. -->
<xsl:variable name="result" as="element(at:TextArea)*" select="$margin-region//at:TextArea"/>
<xsl:sequence select="$result"/>
</xsl:function>
Another challenge is recording the numeric, as opposed to display, page number for each page.
In XSL-FO the page number for a page is a property of the page formatting object and AHF records the ordinal page number in the area tree.
In CSS, however, page numbers are just counters indistinguishable from any other counter, so AHF does not record the page number.
The numeric page number is needed so that page numbers can be correctly calculated. A given page may not have a display page number or the display page number may be non-numeric (roman, alphabetical, etc.). Thus, for each page we need to know the numeric (ordinal) page number, the display page number, and the page number formal (roman, arabic, etc.). In particular, roman numerals cannot be distinguished from alphabetical page numbers for characters that are used in roman numerals, so it's not possible to determine the page number format by inspecting the page number itself.
To capture this information we use corner regions, which are otherwise not used for anything in the Municode page layouts.
In CSS, there are four corner regions and for each page edge three edge regions.
The numeric page number and page number format are captured in corner regions like so:
@page { size: 8.5in 11in; margin-left: 7.5pc; margin-right: 7.5pc; counter-reset: footnote; background-image: attr(background-graphic, url); background-repeat: no-repeat; background-size: contain; background-clip: border-box; background-position: 50% 50%; @bottom-left-corner { content: '^npm^' counter(page); visibility: hidden; font-size: 8pt; color: white; font-family: monospace, 'Arial Unicode'; } @bottom-right-corner { content: '^pnf:1'; visibility: hidden; font-size: 8pt; font-family: monospace, 'Arial Unicode'; } ... }
Note that this page rule applies to all pages (there is no page name qualifier). The visibility value of "hidden" means that the text will be in the area tree but not rendered on the actual page.
This results in area tree elements like so:
<MarginRegionViewportArea region-name="bottom-left-corner" visibility="hidden" ...> <MarginRegionReferenceArea ...> <BlockArea ...> <LineArea > <TextArea ... text="^npm^" /> <TextArea text="1" ... /> </LineArea> </BlockArea> </MarginRegionReferenceArea> </MarginRegionViewportArea>
The formatting of the corner regions makes them invisible
(visibility="hidden"
) but they are easily findable during post processing. A
similar technique is used to capture the page number format.
Side edge regions are also used for debugging and other purposes, for example, Municode puts details about the source in a side edge region for printing on draft versions of the document.
As a debugging aid, the CSS can be quickly modified to reflect the page master name in a side edge region:
@page :left { size: 8.5in 11in; margin-left: 7.5pc; margin-right: 7.5pc; margin-bottom: 6pc; margin-top: 8pc; /* Used for debugging page rule application. */ @left-bottom { content: 'Page Rule: :left'; /* content: none;*/ font-size: 10pt; font-family: "Courier New", monospace, 'Arial Unicode'; color: white; -ah-reference-orientation: 90; width: 4pc; height: 6in; } ... }By changing the color property from "white" to e.g., "cyan" in each page rule, the page rule name is shown on every page, which is useful for debugging.
Thus, through a combination of elements added to the input HTML
(<areaTreeMarker>
), structured content (display page numbers), and use and
abuse of page edge regions, and AHF's automatic capturing of element IDs, it is possible
to
inject into the area tree any information needed to support post processing of the
area
tree.
Modifying the Area Tree
The ultimate goal is to produce an area tree that represents an update package that reflects any required pages (covers, table of contents, update instructions, etc.) and the changed pages with point page numbers created for changed pages that require them. In addition, every sequence of changed pages must have an even number of pages. By Municode's editorial rules, a change set always starts on an odd (right hand) page and thus must end on an even (left-hand page).
Municode refers to sets of changed pages as "takes" and so the markers used in the source to mark the boundaries of changes are "take markers". The term "take" is reflected in the code examples in this section.
The area tree processing is implemented as a logical pipeline with the following stages:
-
Set page number and format
Uses the ordinal page number and page number format to update attributes on each page viewport area element.
It also applies any "page start" values that were specified on change markers (a change marker can specify what the actual page number of the first page of a change set should be, irrespective of what it is based on automatic page numbering). The initial pagination styling does not attempt to create new page sequences because the change markers can occur anywhere within the HTML hierarchy and thus are not easily used to trigger new page sequences. So it's easier to just update the page numbers in the area tree after the fact.
This output of this stage is an area tree where the page number details are reliable and easy to get by subsequent processing.
For debugging purposes it also allows inspection of the resulting area tree to verify that the page numbers have been set correctly.
-
Update page numbers
Marks pages as being either in the result pages (a required page or a page in a change set) or not.
For pages that are in change sets, updates the display page numbers to reflect any required point pages.
Generates any required blank even pages for change sets that do not naturally end on an even page or for which the following page is not already blank.
-
Filter pages
If an update package is requested, filters out all pages that are not marked as being in the update package.
-
Renumber absolute page numbers
AHF captures the absolute page number of each page and requires that they be correct, so the pages must be renumbered to reflect the absolute page numbers following any filtering.
-
Final update processing
Does any remaining update processing, primarily updating page number references to reflect the final display page numbers. Also filters out things in the area tree that were needed for post processing but that are not needed or wanted in the final PDF given to customers, such as the hidden corner regions. Users of the PDF often cut and paste from the PDF in order to draft new ordinances, and hidden content can be selected and copied, which is not good.
-
Update page number database
If the processing is producing a "final" update package, meaning a package that will be delivered to the client, then the page number database is updated to reflect the details of the new and changed pages for the publication.
Set Page Number and Format
This phase determines, for each page, what it's ordinal page number is in the current page sequence and what the display format of the number is (arabic, roman, etc.).
AHF does not capture this information because it is not available in the CSS model.
In XSL-FO page numbers are defined explicitly as properties of page sequences and then reflected where needed through dedicated page number reference formatting objects.
In CSS, by contrast, page numbers are simply counters like any other counters and are not explicit properties of pages or page sequences. CSS does define a built-in counter named "page" that is automatically incremented for each new page, but that is just a convenience: there is no requirement that the "page" counter be used to reflect page numbers.
On a given page, the page's own page number is reflected by reference to the page number counter in a normal content property:
@page portrait-first:right { counter-reset: footnote; counter-reset: page 1; @bottom-center { content: string(prefolio, first) '\200B' counter(page) '\200B' string(postfolio, first); margin-top: 1pc; vertical-align: top; font-family: "New Century Schoolbook", serif, 'Arial Unicode'; font-size: 10pt; } ... }
Thus there is no reliable way for a CSS processor know what the intended page number is for a given page.
The AHF area tree markup includes attributes for the page number on the pageViewportArea element:
<PageViewportArea ... abs-page-number="1" page-number="1" format="1" >The page-number attribute has a value but it is not reliable.
To work around this limitation in CSS, the CSS puts the ordinal page number and format value into corner regions as hidden text:
@page { size: 8.5in 11in; margin-left: 7.5pc; margin-right: 7.5pc; counter-reset: footnote; background-image: attr(background-graphic, url); background-repeat: no-repeat; background-size: contain; background-clip: border-box; background-position: 50% 50%; @bottom-left-corner { content: '^npm^' counter(page); visibility: hidden; font-size: 8pt; color: white; font-family: monospace, 'Arial Unicode'; } @bottom-right-corner { content: '^pnf:1'; visibility: hidden; font-size: 8pt; font-family: monospace, 'Arial Unicode'; } }
This results in easy-to-find data in the area tree from which the page number and format can be found and then set on the pageViewportArea elements as though it had always been there:
<MarginRegionViewportArea visibility="hidden" region-name="bottom-left-corner" ... > <MarginRegionReferenceArea ...> <BlockArea ...> <LineArea ...> <TextArea text="^npm^" ... /> <TextArea text="1" ... /> </LineArea> </BlockArea> </MarginRegionReferenceArea> </MarginRegionViewportArea>
Putting the page number and format on the pageViewportArea pages is not strictly necessary but it makes follow-on processing simpler and avoids the need to repeatedly search for the page number details for a given page.
Update Page Numbers
For a given sequence of changed pages it is necessary to update the page numbers of each page in the sequence to reflect both the initial page number of the sequence, if specified on the change start processing instruction, and to reflect any required point page numbers.
Updating the page numbers involves three main processing tasks:
-
Identifying the start and end pages of a change page set
-
Construct the sequence of new page numbers based on the starting page number and page number of the page that follows the change set, if specified (it can be unspecified if the change is followed by the start of a new page sequence where page numbers are reset, i.e., the end of a chapter).
-
Updating the display page number on each page in the change set. This may involve adjusting the horizontal position of the page number on the page to reflect a change in width of the page number as displayed.
Identifying Change Page Sets
Change boundaries are reflected in the area tree by inline objects with IDs that have a specific structure, created from areaTreeMarker elements generated as part of the HTML preprocess from the change marking processing instructions ("take markers") inserted by the authors.
The source as authored is:
<?pdf take-begin job="S01" firstpage="15"?> <section id="x88B6B95E7E6A" data-type="chapter" > ... <?pdf take-end job="S01" firstpage-ref="15"?> ... </section>
Which results in this HTML input to the pagination process:
<areaTreeMarker id="take:take-begin:job=S01:firstpage=15:d16p512"/> <section id="x88B6B95E7E6A" data-type="chapter" > ... <areaTreeMarker id="take:take-end:job=S01:firstpage=15:d16p512"/> ... </section>
And then these elements in the area tree:
<InlineArea id="take:take-begin:job=S01:firstpage=15:d16p512" width="0pt" height="0pt" ... /> ... <InlineArea id="take:take-end:job=S01:firstpage-ref=15:15:d16p512" width="0pt" height="0pt" ... />
These marker areas are then used by utility functions that find pages that are within change sets:
<!-- Determine if the context element is the start of a take for the specified update ID. A given page can have zero or more take starts for a given update ID. @param context A PageViewPortArea element (or other element that could be an ancestor of a take marker) @param jobID The update ID to check for @param doDebug Turns debugging on or off. @return true if a take begin marker is found for the specified update ID. --> <xsl:function name="at:is-take-start" as="xs:boolean"> <xsl:param name="context" as="element()"/> <xsl:param name="jobID" as="xs:string?"/> <xsl:param name="doDebug" as="xs:boolean"/> <xsl:variable name="page" as="element(at:PageViewportArea)" select="$context/ancestor-or-self::at:PageViewportArea" /> <xsl:variable name="take-specifier" as="element()*" select="at:get-take-specifier($context, $jobID)" /> <xsl:variable name="result" as="xs:boolean" select="exists($take-specifier)" /> <xsl:sequence select="$result"/> </xsl:function> <!-- Gets the element that specifies the start of a take for the the specified update. @context Element to look in for a take specifier @jobID The ID of the update to get the take specifier for @return The first element that starts a take for the specified update, if any. --> <xsl:function name="at:get-take-specifier" as="element()?"> <xsl:param name="context" as="element()"/> <xsl:param name="jobID" as="xs:string?"/> <xsl:variable name="take-specifier" as="element()*" select="($context//*[starts-with(@id, 'take:take-begin:')] [contains(@id, concat('job=', $jobID))])[1]" /> <xsl:sequence select="$take-specifier"/> </xsl:function>
These functions then make it easy to distinguish changed pages from unchanged pages:
<!-- Context is the area tree root element --> <xsl:template name="update-page-numbers"> <xsl:param name="doDebug" as="xs:boolean" tunnel="yes" select="false()"/> <xsl:for-each-group select="at:PageViewportArea" group-starting-with="at:PageViewportArea[at:is-take-start(., $jobID)]"> <xsl:choose> <xsl:when test="at:is-take-start(., $jobID)"> <xsl:call-template name="update-page-numbers-for-take-group"> <xsl:with-param name="doDebug" as="xs:boolean" select="$doDebug"/> <xsl:with-param name="pages" as="element(at:PageViewportArea)+" select="current-group()" /> </xsl:call-template> </xsl:when> <xsl:otherwise> <!-- Must be before first take --> <xsl:apply-templates select="current-group()" mode="update-page-numbers"> <xsl:with-param name="doDebug" as="xs:boolean" select="$doDebug"/> </xsl:apply-templates> <xsl:otherwise> </xsl:choose> </xsl:for-each-group> </xsl:template>
At this point we know the start of a set of changed pages (the take start) but we don't know the end page.
A set of changed pages must always be an even number of pages.
If the explicitly-marked last page of the change set happens to be even then we're done.
However, if the last page is not even then a blank backing page has to be added to the change sequence.
The blank page can either come from the area tree if the page following the last changed page happens to be a blank even page (for example, a page generated in order to force the following page onto an odd page), otherwise it is necessary to synthesize a blank page. This is done by copying the current page and adding 1 to the page's absolute and ordinal page numbers. Only those parts of the page that are needed on a blank page are retained, such as the page edge region that contains the display page number or other edge region components that need to be kept. The XSLT provides simple configuration to adjust this as needed. This also requires a function that can distinguish blank pages from non-blank pages, i.e., pages where the effective string value of the page body is only whitespace or matches a configured "This page intentionally left blank" marker.[3]
Another non-obvious complication is when a change set initially ends with a blank odd page. In this case, rather than adding a second blank even backing page, you simply omit the blank odd page. It should never be possible to have a change set that ends with three blank pages. Two ending blank pages can happen due to either author error where unnecessary page breaks have been forced or some other combination of factors results in two blank pages.
Constructing New Page Number Sequences
Given the set of pages in a change set we know number of pages in the set, the starting page number, and the page number of the page that follows the change set, if any.
With that it's a simple matter of calculating the difference between start + following and the actual count of pages. Any pages with an ordinal number greater than number of original pages must be a point page.
The XSLT then constructs a sequence of new page numbers and uses it to update the pages in the change set (some details omitted for brevity):
<xsl:template name="update-page-numbers-for-take-group"> <xsl:param name="pages" as="element(at:PageViewportArea)+"/> <xsl:variable name="page-numbers" as="xs:string*"> <xsl:variable name="first-page-number" as="xs:string?" select="at:get-folio($take-start)" /> <!-- All pages should have page numbers of one form or another but it could happen that a page has no number --> <xsl:if test="exists($first-page-number)"> <xsl:sequence select="at:calculate-page-numbers-for-take( $pages-in-take, $jobID)" /> </xsl:if> </xsl:variable> <xsl:for-each select="$pages-in-take"> <xsl:variable name="pos" as="xs:integer" select="position()"/> <xsl:variable name="new-page-folio" as="xs:string?" select="$page-numbers[$pos]" /> <xsl:apply-templates select="." mode="update-page-numbers"> <xsl:with-param name="new-page-folio" as="xs:string?" select="$new-page-folio" /> <xsl:with-param name="in-job" as="xs:boolean" select="true()"/> </xsl:apply-templates> </xsl:for-each> </xs:template> ... <xsl:function name="at:calculate-page-numbers-for-take" as="xs:string+"> <xsl:param name="pages" as="element()+"/> <xsl:param name="jobID" as="xs:string"/> <!-- The number of pages whose page number does not need to change --> <xsl:variable name="ordinal-page-count" as="xs:integer" select="$next-page-num - $first-page-num" /> <xsl:variable name="point-page-count" as="xs:integer" select="count($pages) - $ordinal-page-count" /> <xsl:variable name="point-page-count" as="xs:integer" select="if ($point-page-count lt 0) then 0 else $point-page-count" /> <!-- The number of pages whose page number does not need to change --> <xsl:variable name="ordinal-page-count" as="xs:integer" select="$next-page-num - $first-page-num" /> <xsl:variable name="point-page-count" as="xs:integer" select="count($pages) - $ordinal-page-count" /> <xsl:variable name="point-page-count" as="xs:integer" select="if ($point-page-count lt 0) then 0 else $point-page-count" /> <xsl:variable name="before-points" as="element()+" select="$pages[position() le $ordinal-page-count]" /> <xsl:variable name="point-pages" as="element()*" select="$pages[position() gt $ordinal-page-count]" /> <!-- Generate display page numbers for each page before the point pages --> <xsl:for-each select="$before-points"> <xsl:variable name="page-num-base" select="string($first-page-num + position() - 1)" as="xs:string"/> <xsl:number value="$page-num-base" format="{$page-number-format}"/> </xsl:for-each> <!-- Generate display page numbers for point pages --> <xsl:for-each select="$point-pages"> <xsl:variable name="point-number" as="xs:integer" select="position()"/> <xsl:sequence select="concat($point-page-base-formatted, '.', $point-number)"/> </xsl:for-each> </xsl:function>
At this point, the display page numbers of each page in each change set have been updated to reflect the application of starting page numbers and point page numbers. References to these pages have not yet been updated.
This process also marks every page as being "in the job" or "out of the job", which is then used by the filtering step. Any page that is part of a change set or is a page that is always included (cover pages, insertion instructions, etc.) are marked as in the job, all other pages are marked as out of the job.
Note that the code does not bother to adjust the horizontal position of point page numbers on the pages for the simple reason that there's no easy way to know how the page number alignment should be adjusted: centered, left aligned, or right aligned? At least for the Municode styles, the visual affect of adding a point page or changing a number from 1 to 2 digits is minimal and would not normally be noticed.
However, references to the pages in some contexts do need to be adjusted, for example in the table of contents, as the numbers are consistently right-aligned and therefore a change in horizontal placement will be noticeable. These documents do not normally use page numbers for references in normal flowed text, which simplifies the problem.
If page numbers were used it would probably be necessary to reserve extra space around the numeric part of the page number reference or some marker technique used to indicate what the current alignment and justification are. Given that information there should be no problem adjusting the horizontal position of the text before or after the page number reference. In the worst case, the code would need to apply word or character spacing adjustments to avoid having a long number cause the line to end pass a margin (for example, where the text would end up overlapping a border or text not tagged as being in the same line on the page.
Because the area tree includes all geometric information down to the individual text string level, it's always possible to adjust the layout or otherwise detect overlaps but that level of sophistication should not normally be needed.
In legal publications the normal practice is to refer only to a section or paragraph number or a figure or table number (and possibly title), which simply avoids the problem. Page numbers are usually limited to generated navigation structures like tables of contents and indexes.
Filter Pages
If a filtered update package is requested, then pages not in the update are filtered out. This is simply a matter of omitting any PageViewportArea that is not marked as being in the job, a value that is set on every page in the preceding phase:
<xsl:template name="process-takes"> <xsl:param name="doDebug" as="xs:boolean" tunnel="yes" select="false()"/> <xsl:for-each-group select="at:PageViewportArea" group-adjacent="@in-job eq 'true'"> <xsl:choose> <xsl:when test="self::at:PageViewportArea[@in-job eq 'true']"> <xsl:for-each-group select="current-group()" group-starting-with="*[at:is-take-start(., $jobID)]"> <xsl:sequence select="current-group()"/> </xsl:for-each-group> <!-- A set of take pages --> </xsl:when> <xsl:otherwise> <!-- Not take pages, ignore. --> </xsl:otherwise> </xsl:choose> </xsl:for-each-group> </xsl:template>
This code uses for-each-group, reflecting a refactoring of early logic that was not as simple and actually required grouping. This could be done with a simple apply-templates and a pair of templates, one that matches @in-job eq 'true' and one that does not.
Renumber Absolute Page Numbers
The area tree that is input to this phase reflects the final set of pages to be rendered. To meet AHF requirements the @absolute-page-number attribute on each page must correctly reflect the ordinal position of the page in the document:
<xsl:template mode="renumber-abs-pages" match="@abs-page-number">
<xsl:variable name="pageNumber" as="xs:integer"
select="count(../preceding-sibling::*) + 1"
/>
<xsl:attribute name="{name(.)}" select="$pageNumber"/>
<xsl:attribute name="orig-abs-page-number" select="."/>
</xsl:template>
The rest of this mode is just normal identity transform processing that handles PageViewportArea elements.
Final Update Processing
The final processing step is to update page number references to reflect the updated page numbers for changed pages and, if the area tree has been filtered, to reflect page numbers from the page history database for references to pages that are not changed.
This requires adjusting the horizontal position of text on the line in which a page number reference occurs in order to account for the change in width from the original page number. This requires knowing the difference in displayed width between the original page number and the new page number.
Page numbers for pages that are not changed are pulled from the page history database's mapping of element IDs to page numbers.
Page History Database
Conceptually the page history database consists of two separate data sets:
-
A record for each physical page (odd page and backing even page) by display page number capturing for each update it occurs in the absolute page number, the folio details for the front and back pages, and the first line of the odd page (which helps with debugging).
-
A record for each element ID capturing, for each update that element occurs in, the absolute page number and folio details of the page the element starts on, as well as a time stamp of when the record was created. The title of the element is also captured (elements with IDs must have titles or they would not be targets of page number references in this content).
The physical page records enable generation of the update instructions and the list of effective pages. The element history records enable generating page number references to elements on unchanged pages.
The history database is maintained over time for a given publication, updated when a given update is published for delivery to the client.
The page history database could be implemented in many different ways. For this project, the initial implement uses an XML file that is maintained along with the publication's source in the version control system that maintains the publication source. The file is read during area tree post processing and updated when a final publication is produced.
A more robust implementation might use a dedicated database application and a REST service to manage the database but the scope and infrastructure for that was not available in this project.
Capturing Element Target Details In The Area Tree
In CSS a page number reference is a counter reference scoped to a specific element:
div.toc-entry > span.page > a.body:before,
div.minitoc.pg div.minitoc-entry > span.page > a:before
{
content: target-counter(attr(href url), page);
}
And
the area tree result is just the resulting text unless something more is done.
The <a> element is also rendered as a navigable hyperlink using the AHF -ah-link: extension property:
a[href] { -ah-link: attr(href url) }
in order to mark references to page numbers the HTML preprocess adds spans with structured ID values that mark the text as a page number reference:
<div class="toc-entry"> <span class="title"> <a href="#x88B6B95B1C04"> <span class="h1"> <span class="uppercase">Officials </span> of the <span class="uppercase">Town of Lovettsville, Virginia At the Time of This Codification </span></span></a></span> <span class="page" id="page:d121e33"> <span class="page-number-marker">​</span> <a id="pageref:d121e36" class="frontmatter" href="#x88B6B95B1C04"></a> <span class="page-number-marker">​</span> </span> </div>
The span with a class of "page" marks its content as being a page number (full folio) using the structured @id value of "page:{generated-id}", while the <a> element's @id of "pageref:{generated-id}" marks it as a page number reference (just the page number part of the folio).
This results in reliably-findable markup in the area tree:
<InlineArea id="page:d121e49" ...> <InlineArea ...> <TextArea ... text="​" /> </InlineArea> <InlineArea internal-destination="x88B6B95B4E50" id="pageref:d121e52" ...> <InlineArea ...> <TextArea ... text-width="5.37pt" text="v" /> </InlineArea> </InlineArea> <InlineArea ...> <TextArea ... text="​" /> </InlineArea> </InlineArea>
The outer InlineArea is identified as a page (full folio). the InlineArea elements with content of ​ (non-breaking space) mark the boundaries between the prefolio, page number, and post folio. There is no prefolio or postfolio in this example.
The page number is identified by @id "pageref:d12e52" in this example. Note that the TextArea specifies both the text ("v") and the rendered width ("5.37pt").
In addition, the target ID is automatically captured by AHF on the @internal-destination attribute of the InlineArea that results from the -ah-link: extension property.
The final challenge is finding the target element's starting position within the area tree in order to find the physical page referenced in order to then get the new page number for that page. While AHF will capture the ID of any element in the area tree, it won't necessarily do so at the right position.
Thus the HTML preprocess generates markers that signal the start and end of elements to which page number reference might be made (sections, figures, and tables in the current content set):
<section id="x88B6B95B4E50" ...> <header> ... <h1 style="line-height:2em;"> <span class="uppercase">Boards and Commissions</span> <br /> of the <br /> <span class="uppercase">Town of <br /> Lovettsville, Virginia</span></h1> <p data-type="startpage">7</p> </header> <areaTreeMarker id="marker:section:officialsoriginal:start:narrow:startpage=7:id=x88B6B95B4E50" /> ... <areaTreeMarker id="marker:section:officialsoriginal:end:narrow:id=x88B6B95B4E50" /> </section>
Note that the marker is emitted after the header, which produces the section title in the final result. This ensures that the marker in the area tree occurs on the same page as the title and not on the page before, for example, if the CSS for the header creates a page break or the keep rules result in a forced break before the title.
This then results in inline areas in the area tree that serve as reliable markers for the start and end of the section. Only the start marker is needed for page number updating but the end marker is needed in order to count the pages taken up by some sections, such as the tables of contents and indexes, where the number of pages is needed, for example to distinguish the count of generated pages from authored pages for billing purposes.
To get the actual page number for a given target element the code uses the same logic used during initial page update processing to find the rendered page number on the page. That is, one you have a page in the area tree, getting that page's display page number is a single function call to the appropriate utility function:
<xsl:variable name="target-page" as="element()?" select="at:get-referenced-page($context)" /> <xsl:choose> <xsl:when test="at:is-within-take($target-page)"> <xsl:sequence select="at:get-folio($target-page)"/> </xsl:when> <xsl:otherwise> <xsl:sequence select="at:get-stored-page-number-for-pageref($pageListingDoc, $context, false())"/> </xsl:otherwise> </xsl:choose>
Updating Page Number References
The display page numbers are constructed with a format that makes it reliable to find them:
<LineArea text-altitude="8.87939pt" ...> <TextArea text="​" .../> <TextArea text="2" .../> <TextArea text="​" .../> </LineArea>
The marker is a LineArea containing TextArea elements where the text contains a zero-width space (\u200B). Zero-width spaces are not otherwise used for anything in this content and thus make ideal marker characters as they have no visual effect.
The XSLT provides helper functions that work with display page numbers: finding them, getting the folio parts, etc.
The practical challenge with updating page numbers is that the width of the new page number will usually be different from the width of the original, which means that the horizontal position of the text area has to be adjusted to account for the difference in width.
However, the post process does not have any direct way to know what the rendered width of the new page number is.
One solution would be to use extension functions and something like the Java2D libraries to render the characters using the font details available in the area tree.
For this project we took a more brute force but more reliable approach, namely to include in the area tree an instance of every possible character that could occur in a page number (0-9, a-z, A-Z, ".", "-", ":", etc.) in every font and size used in the publication in a form that enables lookup of the rendered characters. This set of "character samples" is generated as part of the HTML preprocessing.
The HTML is:
<char-samples id="util:char-samples">
<char-sample class="body">
<char-set class="number-set sz83" id="util:char-set:body-sz83">
<decimal>.</decimal><char>0</char><char>1</char><char>2</char><char>3</char><char>4</char><char>5</char><char>6</char>
<char>7</char><char>8</char><char>9</char><char>a</char><char>b</char><char>c</char><char>d</char><char>e</char><char>f</char>
<char>g</char><char>h</char><char>i</char><char>j</char><char>k</char><char>l</char><char>m</char><char>n</char><char>o</char>
<char>p</char><char>q</char><char>r</char><char>s</char><char>t</char><char>u</char><char>v</char><char>w</char><char>x</char>
<char>y</char><char>z</char><char>A</char><char>B</char><char>C</char><char>D</char><char>E</char><char>F</char><char>G</char>
<char>H</char><char>I</char><char>J</char><char>K</char><char>L</char><char>M</char><char>N</char><char>O</char><char>P</char>
<char>Q</char><char>R</char><char>S</char><char>T</char><char>U</char><char>V</char><char>W</char><char>X</char><char>Y</char>
<char>Z</char>
</char-set>
...
</char-sample>
...
</char-samples>
Because eash <char-set> specifies one of the font-setting classes from the CSS, the font and size will be present in the area tree. Those values then serve as the lookup key for any character, along with the character itself. The only maintenance task is keeping the generating XSLT in sync with the CSS styles so that all relevant font and size differences are represented.
The area tree markup is then:
<BlockViewportArea id="util:char-samples" ...> <BlockArea id="util:char-set:body-sz83" ...> <LineArea ...> <InlineArea ...> <TextArea ... width="2.224pt" font-family="NewCenturySchlbk LT Std" font-size="8pt" text-width="2.224pt" text="." /> </InlineArea> ... <LineArea> </BlockArea> </BlockViewportArea>
Because each source character is in a separate element it ensures that there is one text area for each character in the area tree. The text areas specify the exact width in points of the character.
This XSLT code then looks up the details of a character given the font details:
<xsl:key name="char-samples" match="at:BlockArea[starts-with(@id, 'util:char-set:')]//at:TextArea" use="at:make-char-sample-key(@font-family, @font-size, @text)" /> <!-- Try to find a character sample for the specified font family, size, and text value. --> <xsl:function name="at:get-char-sample" as="element(at:TextArea)?"> <xsl:param name="context" as="element()"/><!-- Area tree element --> <xsl:param name="font-family" as="xs:string"/> <xsl:param name="font-size" as="xs:string"/> <xsl:param name="text" as="xs:string"/> <xsl:variable name="key" as="xs:string" select="at:make-char-sample-key($font-family, $font-size, $text)" /> <xsl:variable name="result" as="element()?" select="key('char-samples', $key, root($context))[1]" /> <xsl:sequence select="$result"/> </xsl:function> <xsl:function name="at:make-char-sample-key" as="xs:string"> <xsl:param name="font-family" as="xs:string"/> <xsl:param name="font-size" as="xs:string"/> <xsl:param name="text" as="xs:string"/> <xsl:variable name="key" select="string-join(($font-family, $font-size, $text), ':')"/> <xsl:sequence select="$key"/> </xsl:function>
Then it's simply a matter of doing the math to the determine the amount of adjustment to apply to the text areas in the affected line area and update them, here for the case of page references within ToC entries that use leaders (the most common case):
<!-- Page references in lines with leaders (e.g., ToC pages). This will be the most common case. There should only be one page reference, which means we can simply calculate the adjustment to the leader and all inline and text nodes following the leader should be good to go. The prefolio and postfolio should always be the same. --> <xsl:template match="at:LineArea[at:is-within-page-reference-line(.)]"> <xsl:variable name="pagerefs" as="element(at:InlineArea)+" select=".//at:InlineArea[starts-with(@id, 'pageref:')]" /> <!-- There should be exactly one TextArea in the pageref InlineArea --> <xsl:variable name="pageref" as="element(at:InlineArea)" select="$pagerefs[1]" /> <xsl:variable name="target-page" as="element()?" select="at:get-referenced-page($pageref)" /> <xsl:variable name="page-ref-text-area" as="element(at:TextArea)?" select="($pageref//at:TextArea)[1]" /> <xsl:variable name="orig-page-number" as="xs:string?" select="$page-ref-text-area/@text" /> <!-- NOTE: This is the formatted page number, e.g., 'xix', not 19 --> <xsl:variable name="page-number" as="xs:string?" select="at:get-referenced-page-number($pageref)" /> <xsl:choose> <xsl:when test="empty($page-number)"> <!-- Nothing to do, just use what we have --> <xsl:next-match/> </xsl:when> <xsl:otherwise> <xsl:variable name="current-width" as="xs:double" select="at:get-length-value($page-ref-text-area/@text-width)" /> <xsl:variable name="new-number-width" as="xs:double" select="at:get-string-width($page-ref-text-area, $page-number)" /> <xsl:variable name="width-difference" as="xs:double" select="$new-number-width - $current-width" /> <xsl:choose> <xsl:when test="$width-difference ne 0.0"> <!-- Now figure out how many dots to remove from the leader to account for the added space: --> <xsl:variable name="leader-area" as="element(at:LeaderArea)" select="$pageref/preceding::at:LeaderArea[1]" /> <xsl:variable name="leader-string" as="xs:string" select="$leader-area/at:TextArea/@text" /> <xsl:variable name="leader-width" as="xs:double" select="at:get-length-value($leader-area/at:TextArea/@text-width)" /> <xsl:variable name="dots" as="xs:string*" select="tokenize($leader-string, ' ')" /> <xsl:variable name="dot-width" as="xs:double" select="$leader-width div count($dots)" /> <xsl:variable name="sign" as="xs:integer" select="if ($width-difference lt 0) then 1 else -1" /> <xsl:variable name="dot-count-diff" as="xs:integer" select="if (number($dot-width)) then (($width-difference idiv $dot-width) + 1) * $sign else 0 " /> <xsl:variable name="dot-count" as="xs:integer" select="$dot-count-diff + count($dots)" /> <xsl:variable name="new-leader-string" as="xs:string?" select="string-join(for $n in 1 to $dot-count return $dots[1], ' ')" /> <xsl:variable name="new-leader-width" as="xs:double" select="$dot-width * $dot-count " /> <xsl:copy> <xsl:apply-templates select="@*"/> <xsl:apply-templates> <xsl:with-param name="new-leader-string" as="xs:string?" tunnel="yes" select="$new-leader-string" /> <xsl:with-param name="new-leader-width" as="xs:double?" tunnel="yes" select="$new-leader-width" /> <xsl:with-param name="left-adjust" as="xs:double" tunnel="yes" select="$width-difference" /> <xsl:with-param name="page-ref" as="element(at:InlineArea)?" tunnel="yes" select="$pageref" /> </xsl:apply-templates> </xsl:copy> </xsl:when> <xsl:otherwise> <xsl:copy> <xsl:apply-templates select="@*"/> <xsl:apply-templates> <xsl:with-param name="doDebug" as="xs:boolean" tunnel="yes" select="$doDebug" /> <xsl:with-param name="new-leader-string" as="xs:string?" tunnel="yes" select="()" /> <xsl:with-param name="new-leader-width" as="xs:double?" tunnel="yes" select="()" /> <xsl:with-param name="left-adjust" as="xs:double" tunnel="yes" select="0.0" /> <xsl:with-param name="page-ref" as="element(at:InlineArea)?" tunnel="yes" select="$pageref" /> </xsl:apply-templates> </xsl:copy> </xsl:otherwise> </xsl:choose> </xsl:otherwise> </xsl:choose> </xsl:template>
Update Page Number Database
The final task is to update the page number database if the pages being produced are the final delivered version of the update package (although the database can be updated at any time for given update until the update is published, at which point the entries for that update are fixed and should not be modified).
The processing reads the page history database from its location in the source version control working copy and writes a new one to a temporary location, avoiding the restriction in XSLT on reading and writing to the same file. A separate processing script then copies the new version of the page history database to the working copy, if update of the page history database has been requested as part of the processing run.
The logic for updating the page listing is simply a matter of processing each page to create or update a physical page entry and then processing each marked ID to create or update the element-to-page entry for the element.
Processing starts with the existing page database, processing all the pages that are already reflected, then processing any pages not reflected in the database. If there is no existing database then a new one is synthesized before applying normal processing to it.
For debugging purposes the project includes a simple style sheet that generates an HTML view of the page history database.
A final challenge with the page history database is creating it initially for publications being migrated from the old XPP system to the new system.
While it is possible to generate a history of the physical pages from the XPP data there is no easy way to correlate the sections, figures, and tables in the XPP version to the corresponding elements in the new XML version or to capture the page numbers those elements fell on in the XPP version. This means that the initial element history database must be populated by hand. This is a tedious manual process. While there may be ways to better automate it the project scope did not allow us to explore them. For example, it might be possible to examine the last published PDF and correlate elements by their titles and other positional clues. We also considered the possibility of generating area trees from PDFs and then populating those area trees with the necessary start and end markers. But given that level automation there would still be a necessary quality assurance step to verify that the associations were correct. Fortunately, this is a one-time task for each publication being migrated.
Conclusions and Future Work
Given markup in the publication source that identifies the start and end of changed content it is possible to implement generation of loose-leaf change packages with automatically-generated page numbers using CSS for pagination and post processing of the initial area tree produced by the pagination engine.
While the implementation required developing a number of tricks for getting the information needed for post processing into the area tree, the resulting data processing was not overly challenging. It did take us several iterations to work out the final processing pipeline, with the usual false starts, but the final solution seems to have a level of complexity commensurate with the complexity of the problem itself.
The scope of the project did not include automating the identification of changes. Given an XML differencing tool like DeltaXML it should be possible to compare two area trees and determine that a sequence of pages has changed and then inject the appropriate markers back into the publication source for review and adjustment as needed.
Another potential area of work is automating the insertion of changed pages into an area tree that reflects the full update history of the publication. Municode currently does this in PDF manually but it should be possible to automate this using area trees from which a PDF can then be generated as needed.
References
[AHF] Antenna House Formatter, Antenna House, Inc, http://antennahouse.com
[CSS] W3C CSS Working Group Standards and Drafts, https://www.w3.org/TR/#tr_Cascading_Style_Sheets_(CSS)_Working_Group
[CCSPAGE] Kimber, W. Eliot, CSS Pagination Book, https://drmacro.github.io/css-pagination-book/, 2019
[2] The author was a contributor to the XSL-FO specification and one of the early users of the standard to do serious production (we first used it to produce mobile phone manuals for Nokia in every language Nokia published phone manuals in). I continue to do significant amounts of XSL-FO implementation work today.
However, given a choice, I would choose CSS pagination whenever it was an option.
Unfortunately, at the time of writing, only Antenna House Formatter is sufficiently complete in its implementation of CSS pagination to satisfy the layout and typographic demands of the types of documents I normally work with, including the requirements of municipal code. This means, in part, that there is no open-source solution for CSS pagination comparable to Apache FOP for XSL-FO. Thus, while CSS pagination is otherwise a compelling solution for page document production, it requires investment in commercial software. The commercial software offers tremendous value but the lack of an open-source solution limits the ability to suggest CSS pagination as a potential solution.
[3] It occurs to me as I write this that it would also be possible to use ":blank" page rules to explicitly mark pages as being blank.