Graham, Tony. “Copy-fitting for Fun and Profit.” Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21 (2018). https://doi.org/10.4242/BalisageVol21.Graham01.
Abstract
Copy-fitting is the fitting of words into the space available for them or, sometimes,
adjusting the space available to fit the words. Copy-fitting is included in “Extensible
Stylesheet Language (XSL) Requirements Version 2.0” and is a common feature of making
real-world documents. This talk describes an ongoing internal project for automatically
finding and
fixing problems that can be fixed by copy fitting in XSL-FO.
Copy-fitting has two meanings: it is both the “process of estimating the amount of
space
typewritten copy will occupy when converted into type” White and the process of adjusting the formatted text to fit the available space. Since
automated formatting, such as with an XSL-FO formatter, is now so common, a lot of
the manual processes for estimating the amount of space are superfluous.
Copy-fitting as estimating
There are multiple, and sometimes conflicting, aspects to the relationship between
copy-fitting and profit. For the commercial publisher, there is tension between more
pages costing more money and more pages providing a better reading experience or even
a better “shelf appeal”, for want of a better term. We tell ourselves do not judge a book by its cover, but we do still sometimes judge a book by the width of its spine when we look at
it on the shelf in the bookstore. Copy-fitting, in this first sense, is part of the
process of making the text fill the number of pages (the ‘extent’) that has been decided
in advance by the publisher. This may mean increasing the font-size, leading, other
spaces, and the margins to fill more pages, or it may mean reducing them so that more
text fits on a page.
Figure 1: Different approaches to filling pages
The six steps to copyfitting from “How to Spec Type” White are:
Count manuscript characters
Select the typeface and type size you want.
Cast of to determine the number of set characters per line
Divide set characters per line into total manuscript character count. Result is number
of lines of set type.
Add leading to taste.
If too short: add leading or increase type size or decrease line length.
If too long: remove leading or decrease type size or increase line length.
However:
Most important of all, decisions should be made with the ultimate aim of benefiting
the reader.
— “Book Typography: A Designer’s Manual”, Mitchell & Wightman
Copy-fitting as adjustment
‘Copy-fitting’ as adjustment is now the more common use for the term.
Copy-fitting for fun
It’s an interesting problem to solve.
Copy-fitting for profit
Books
The total number of pages in a book is generally a multiple of 16 or 32, and any variation
can involve additional cost:
It must be remembered that books are normally printed and bound in sheets of 16, 32,
64 (or multiples upwards) pages, and this exact figure for any job must be known by
the designer at the point of designing: it will be the designer’s job to see that
the book makes the required number of pages exactly. If a book is being printed and
bound in sheets of 32 pages (16 on each side of the sheet), it is generally possible
to add one leaf of 2 pages by ‘tipping in’, or 4, 8, or any other multiple of 4 pages
extra by additional binding operation, but the former will mean hand-work, and even
the latter will involve disproportionately higher cost.
— “The Thames and Hudson Manual of Typography”, Ruari McLean, 1980
The role here for copy-fitting (in the second sense) is to ensure that the overall
page count is at or close to a multiple of the number of pages in a signature.
Even Print-On-Demand (POD) printing has similar constraints. For example, the Blurb
POD service requires that books in “trade” formats – e.g. 6×9inches – are a multiple
of six pages Blurb. In a simplistic example, suppose that the document with the default styles applied
formats as 15 pages:
Figure 2: Document with default styles and six-page signatures
The (self-)publisher has the choice of three alternatives:
Leave the layout unchanged. However, the publisher is paying for the blank pages,
and the number of blank pages may be more than the house style allows.
Copy-fit to reduce the page count to the next lower multiple of the signature size.
This, obviously, is cheaper than paying for blank pages, but “the public still has
a tendency to judge the value of a book by its thickness” Williamnos.
Figure 3: After copy-fitting to reduce page count
Copy-fit to increase the page count to better fill a multiple of the signature size.
This, just as obviously, costs more than reducing the page count, but it has the potential
for helping sales.
Figure 4: After copy-fitting to increase page count
Manuals and other documentation
A manufacturer who provides printed documentation along with their product faces a
different set of trade-offs. The format for the documentation may be constrained by
the size of the product and its packaging, regulatory requirements, or the house style
for documentation of a particular type. A manufacturer may need to print thousands
or even millions[1] of documents. Users expect clear documentation yet may be unwilling to pay extra
when the documentation is improved, yet unclear documentation can lead to increased
support costs or, in some cases, to fatalities.
Suppose, for example, that the text for a document has been approved by the subject
matter experts – and, in some cases, also by the company’s lawyers and possibly the
government regulator – and has to be printed on a single standard-size page, or possibly
a single side of a standard-size page, yet there is too much text for the space that
is available. To let the editorial staff rewrite the text so that it fits the available
space is definitely not an option, so it becomes necessary to apply copy-fitting to
adjust the formatting until the text does fit. Figure 5 shows an information sheet that was included with the Canon MP830 when sold in EMEA.
The same information in 24 languages is printed on a single-sided sheet of paper.
Most users would look at the information once, at most. If the information had been
allowed to extend to the back of the sheet, it would have considerably increased the
cost of providing the information for no real benefit. Some form of copy-fitting was
probably used to make sure that the information could fit on one side of the sheet.
Figure 5: Multiple warnings on a single-sided sheet
Multilingual text has other complications. Figure 6 shows two corresponding pages from the English and Brazilian Portuguese editions
of the Canon MP830 Quick Start Guide. The same information is presented on both pages. However, translations of English
are typically longer than the corresponding English, and this is no exception. The
Brazilian Portuguese page has more lines of text on it and, as Figure 7 shows, the font-size and leading has also been reduced in the Brazilian Portuguese
page. A copy-fitting process could have been used to adjust the font-size and leading
across the whole document by the minimum necessary so that no text overflowed its
page.
Figure 6: English and Brazilian Portuguese pages
Figure 7: English and Brazilian Portuguese text
Standards for copy-fitting of XML or HTML
There is currently no standards for how to specify copy-fitting for either XML or
HTML markup. However, copy-fitting was covered in the requirements for XSL 2.0, and
the forward-looking “List of CSS features required for paged media” by Bert Bos has
an extensive section on copy-fitting.
Extensible Stylesheet Language (XSL) 1.1
XSL 1.1 does not address copy-fitting, but it does define multiple properties for
controlling aspects of formatting that, if the properties are not applied, could lead
to problems that would need to be corrected using copy-fitting:
hyphenation-keep
Controls whether the last line of a column or page may end with a hyphen.
hyphenation-ladder-count
Limits the number of successive hyphenated lines.
orphans
The minimum number of lines that must be left at the bottom of a page.
widows
The minimum number of lines that must be left at the top of a page.
Figure 8: Orphans and widows
Extensible Stylesheet Language (XSL) Requirements Version 2.0
“Extensible Stylesheet Language (XSL) Requirements Version 2.0" XSLReq2.0 includes:
2.1.4 Copyfitting
Add support for copyfitting, for example to shrink or grow content (change properties
of
text, line-spacing, ...) to make it constrain to a certain area. This is going to
be managed
by a defined set of properties, and in the stylesheet it will be possible to define
the
preference and priority for which properties should be changed. That list of properties
that
can be used for copyfitting is going to be defined.
Additionally, multiple instances of alternative content can be provided to determine
best fit.
This includes copyfitting across a given number of pages, regions, columns etc, for
example to constrain the number of pages to 5 pages.
Add the ability to keep consistency in the document, e.g. when a specific area is
copyfitted with 10 pt fonts, all other similar text should be the same.
List of CSS features required for paged media
“List of CSS features required for paged media” (https://www.w3.org/Style/2013/paged-media-tasks)
by Bert Bos has a ‘Copyfitting’ section. Part of it is relevant to fitting content
into specified pages.
20. Copyfitting
Copyfitting is the process of selecting fonts and other parameters such that text
fits a given space. This may range from making a book have a certain number of pages,
to making a word fit a certain box.
20.1 Micro-adjustments
If a page has enough content, nicer-looking alignments and line breaks can often be
achieved by “cheating” a little: instead of the specified line height, use a fraction
of a point more or less. Instead of the normal letter sizes, make the letters a tiny
bit wider or narrower…
This can also help in balancing columns: In a newspaper, e.g., it may look better
to have all columns of an article the same height at the cost of a slightly bigger
line height in the last column, than to have all lines aligned but with a gap below
the last column.
The French newspaper “Le Canard enchainé” is an example of a publication that favors
full columns over equal line heights.
20.2 Automatic selection of font size
One common case is choosing a font size such that a headline exactly fills the width
of the page.
A variant is the case where each individual line of the text may be given a different
font size, as small as possible above a certain minimum.
Two models suggested for CSS are to see copyfitting either as one of several algorithms
available for justification, and thus as a possible value for ‘text-justify’; or as
a way to treat overflow, and thus as a possible value for ‘overflow-style’. Both can
be useful and they can co-exist:
The first rule could mean that in each line of the block, rather than shrinking or
stretching the interword space to fill out the line, the font size of each letter
is decreased or increased by a certain factor so that the line is exactly filled out.
The latter could mean that the font size of all text in the block is decreased or
increased by a common factor so that the font size is as large as possible without
causing the text to overflow. (As the example shows, this type of copyfitting requires
the block’s width and height to be set.)
Figure 9: The title of the chapter is one word that exactly fills the width of the page
20.3 Alternative content or style
If line breaks or page breaks turn out very bad, a designer may go back to the author
and ask if he can’t replace a word or change a sentence somewhere, or add or remove
an image.
In CSS, we assume we cannot ask the author, but the author may have proposed alternatives
in advance.
Alternatives can be in the style sheet (e.g., an alternative layout for some images)
or in the source (e.g., alternative text for some sentence).
In the style sheet, those alternatives would be selected by some selector that only
matches if that alternative is better by some measure than the first choice.
Some alternatives may be provided in the form of an algorithm instead of a set of
fixed alternatives. E.g., in the case of alternative image content, the alternative
may consist of progressively cropping and scaling the image up to a certain limit
and in such a way that the most important content always remains visible.
E.g., an image of a group of people around two main characters can be divided into
zones that are progressively less important: the room they are in, people’s feet,
the less important people, up to just the heads of the two main characters, which
should always be there.
Existing Extensions
Print & Page Layout Community Group
The Print and Page Layout Community Group developed a series of open-source extensions
for XSLT processors so you can run any number of iterations of your XSL-FO processor
from within your XSLT transformation, which allows you to make decisions based on
formatted sizes of areas.
The extensions are currently available for Java and DotNet and use either the Apache
FOP XSL formatter or Antenna House AH formatter to produce the area trees.
To date, stylesheets that use the extensions have been bespoke: writing a stylesheet
that uses the extensions has required knowledge of the source XML, and the stylesheet
for transforming the XML into XSL-FO is the stylesheet that uses the XSLT extensions.
Figure 10: Balisage 2014 poster (detail)
AH Formatter
AH Formatter, from Antenna House, extends the overflow property. When text overflows the area defined for it, the text may either be replaced
or one of a set of properties – including font-size and font-stretch – can be automatically reduced (down to a defined lower limit) to make the text fit
into the defined area.
FOP
FOP provides fox:orphan-content-limit and fox:widow-content-limit extension properties for specifying a minimum length to leave behind or carry forward,
respectively, when a table or list block breaks over a page.
Copy-fitting Implementation
The currently implemented processing paths are shown in the following figure. The
simplest processing path is the normal processing of an XSL-FO file to produce formatted
pages as PDF. The copy-fitting processes require the XSL-FO to instead be formatted
and output as Area Tree XML (an XML representation of the formatted pages) that is
analyzed to detect error conditions. As currently implemented, each of the supported
error conditions is implemented as a XSLT 2.0 template defined in a separate XSLT
file. A separate XSLT stylesheet uses the Area Tree XML as input and imports the
error condition stylesheets. The simplest version of this stylesheet outputs an XML
representation of the errors found. This XML can be processed to generate a report
detailing the error conditions. Alternatively, the error information can be combined
with the Area Tree XML to generate a version of the formatted document that has the
errors highlighted. Since copy-fitting involves modifying the document, another alternative
stylesheet uses the XSLT extension functions from the Print & Page Layout Community
Group at the W3C to run the XSL-FO formatter during the XSLT transformation to iteratively
adjust selected aspects of the XSL-FO until the Area Tree XML does not contain any
errors (or the limits of either adjustment tolerance or maximum iterations have been
reached).
Figure 11: Processing paths
Error condition XSLT
The individual XSLT file for an error condition consists of an XSLT template that
matches on a node with that specific error. The result of the template is an XML
node encoding the error condition and its location. The details of how to represent
the information are not part of the template (and are still in flux anyway).
The error XML can be processed to generate a report. It is, of course, also possible
to augment the Area Tree XML to add indications of the errors to the formatted result,
as in the simple example below.
Figure 12: Error report PDF (detail)
Copy-fitting instructions
The copy-fitting instructions consist of sets of contexts and changes to make in that
context. The sets are applied in turn until either the current formatting round does
not generate any areas or the sets are exhausted,in which case the results from the
round with the least number of errors are used. Within each set of contexts and changes,
the changes can either be applied in sequence or all together. Like the rest of the
processing, the XML format is still in flux.
When the XSLT extensions from the Print & Page Layout Community Group are used, the
changes instruction indicates a range of values. The XSLT initially uses the .start value and, if errors are found, does a binary search between the .start and .end values. Iterations continue until no errors occur, the maximum number of iterations
is reached, or the difference between iterations is less than the allowed tolerance.
The copy-fitting instructions are transformed into XSLT that is executed by the XSLT
processor, similarly to how Schematron files and XSpec files are transformed into
XSLT that is then executed.
Future Work
There should be an XML format for selecting which error tests to use and what threshold
values to use for each test. That XML would be converted into the XSLT that is run
when checking for errors.
There is currently only a limited number of properties that can be matched on. The
range is due to be expanded as we get the hang of doing copy-fitting. The match conditions are transformed into match attributes in the generated XSLT, so there is a lot scope for improvement.
The range of correction actions is due to be increased to include, for example, supplying
alternate text.
Conclusion
Automated detection and correction of formatting problems can solve a set of real
problems for real documents. There is a larger set of formatting problems that can
be recognized automatically and reported to the user in a variety of ways but which
so far are not amenable to automatic correction. Work is ongoing to extend both the
set of formatting problems that can be recognized and the set of problems that can
be corrected automatically.
[Williamnos] Williamson, Hugh, Methods of Book Design, 3ed., Yale University Press, 1983, ISBN 0-300-03035-5
[1] A prescription or over-the-counter medication comes with a printed package insert, and, for example, nearly 4 billion prescriptions were written in the U.S. in 2010,
with the most-prescribed drug prescribed over 100 million times WebMD.
Tony Graham is a Senior Architect with Antenna House, where he works on their XSL-FO
and CSS formatter, cloud-based authoring solution, and related products. He also provides
XSL-FO and XSLT consulting and training services on behalf of Antenna House.
Tony has been working with markup since 1991, with XML since 1996, and with XSLT/XSL-FO
since 1998. He is Chair of the Print and Page Layout Community Group at the W3C and
previously an invited expert on the W3C XML Print and Page Layout Working Group (XPPL)
defining the XSL-FO specification, as well as an acknowledged expert in XSLT. Tony
is the developer of the ‘stf’ Schematron testing framework and also Antenna House’s
‘focheck’ XSL-FO validation tool, a committer to both the XSpec and Juxy XSLT testing
frameworks, the author of “Unicode: A Primer”, and a qualified trainer.
Tony’s career in XML and SGML spans Japan, USA, UK, and Ireland. Before joining Antenna
House, he had previously been an independent consultant, a Staff Engineer with Sun
Microsystems, a Senior Consultant with Mulberry Technologies, and a Document Analyst
with Uniscope. He has worked with data in English, Chinese, Japanese, and Korean,
and with academic, automotive, publishing, software, and telecommunications applications.
He has also spoken about XML, XSLT, XSL-FO, EPUB, and related technologies to clients
and conferences in North America, Europe, Japan, and Australia.