Murakami, Shinyu, and Johannes Wilm. “Vivliostyle - Open source, web browser based CSS typesetting engine.” Presented at Balisage: The Markup Conference 2015, Washington, DC, August 11 - 14, 2015. In Proceedings of Balisage: The Markup Conference 2015. Balisage Series on Markup Technologies, vol. 15 (2015). https://doi.org/10.4242/BalisageVol15.Wilm01.
Balisage: The Markup Conference 2015 August 11 - 14, 2015
Balisage Paper: Vivliostyle - Open source, web browser based CSS typesetting engine
Shinyu Murakami is the founder of Vivliostyle Inc. Previously, he was the lead developer
of the Antenna House Formatter. He started the Vivliostyle project with the economic
support of Antenna House.
Johannes Wilm has developed Pagination.js and simplePagination.js and is now working
with Vivliostyle. He has been working on a range of LaTeX and HTML-based text layout
and editing solutions for academic texts in the social sciences and humanities since
the early 2000s. Wilm holds a PhD in anthropology from Goldsmiths College, University
of London.
We are working on a new typesetting engine using CSS for styling implemented in JavaScript.
In this article we argue why such a project is needed, and why we think this is the
most fitting for the digital publishing era as it can unify web, ebook and print publishing.
Publishing textual content to several media, such as printed books, ebooks and web
faces a number of challenges. Most workflows seem to incorporate parts or all of one
of the common publishing solutions:
Small-scale, non-academic text publishing generally relies on text production in word
processing applications (namely Microsoft Word), which is exported as HTML. For a
web version it is then cleaned and converted to use the tags, attributes and classes
used within the site that the text is embedded into. If an ebook version is created
at all, it is often created as an EPUB by converting the HTML file further, using
a tool such as Calibre. To obtain a print version, the text is imported into and set up within a desktop
publishing application, such as Adobe InDesign. If resources for this kind of conversion are lacking, a PDF may be created directly
from Microsoft Word, which leads to suboptimal output quality.
Small-scale, academic text publishing will alternatively at times be done using tools
such as LaTeX which convert human-readable source text into good-looking PDFs which are well-suited
for print and which are much better at additional features such as consistent bibliography
management or mathematical formulas than word processors. Runtimes such as pdfTeX convert the LaTeX source files into printable PDF-files. For ebook and web output,
a stage of transformation to HTML has to occur first, and although conversion tools
such as HEVEA, latex2html and TeX4ht exist, conversions seldom go smoothly, and cleanup by hand is mostly required. Similarly
problematic is the conversion of the input: unless the author directly offers the
text in LaTeX format, it needs to be converted from a word processor, which seldom
can be done automatically
Text publishing by larger companies and organizations is oftentimes done via a step
of XML in which the original text is first converted from a word processor to an XML
format, it is then cleaned up manually. It is then converted to PDF, HTML and EPUB
using one of a number of different chain of conversion tools. For example, PDFs can
be obtained by applying an XSLT stylesheet to an XML file using an XSLT processor
the output of which is then parsed through an XSL-FO formatter. An HTML file can be
obtained from the XML file by applying another XSLT stylesheet using the XSLT processor.
An EPUB can be obtained by converting the HTML file. In theory these processes could
be entirely automated, but in practice, oftentimes, a lot of manual and by hand editing
is required at some stage, because the contents contain elements very specific to
the type of publication in question, that had not been anticipated by the creators
of the conversion software.
A slightly different workflow is also XML-centered, but instead of converting the
XML directly, the XML is imported into InDesign where it is then styled and adjusted
for print. The problem is that if the XML file has been changed and the output file
needs to be updated, changes made in InDesign will have to be reapplied.
All these conversion systems have in common that they are rather labor intensive and
that separate and different workflow steps are needed for the different output formats.
While the most professional solution, involving XML, at least in theory can work with
just one source file which can be updated along the way, XML is not easily editable,
and it seems as if XHTML is being replaced by the non-XML-conforming HTML5 in the
context of much web publishing.
Additionally, the most common way of styling XML-files, using XSL-FO, is running into
trouble: While the number of print products created with XSL-FO is still increasing
and it continues to have some features that are more advanced than CSS that are used
in print products, further standardization of XSL-FO seems to have halted indefinitely
due to lack of interest, with the W3C believing that CSS will replace it Graham 2014Kelly 2014.
The need for a web-based content solution
Going beyond the currently existing publishing solutions, it was clear to us that
none of them function perfectly, nor automatically. We also noted that the central
place that XML currently has in many publishing workflows is likely mainly a historical
artifact from the period when HTML was to be replaced by XML in the form of XHTML
around the turn of the century Simpson 2000. Because XML seldom is the final output format, and just about absent from the web
Berjon 2014, and there are much fewer editors to edit XML in a rich text WYSIWYG fashion than
is the case for HTML, it creates largely unneeded conversion steps.
If, on the other hand, one used HTML as the main content file format, some steps of
conversion could be made much smaller or eliminated entirely:
In the case of EPUB files, the most common ebook file type, the textual content comes
in the form of files containing a restricted version of HTML. And the styling of these
pages is defined through restricted CSS, the same language used to define the styling
of web pages. Conversion from a HTML source file to an EPUB could therefore largely
be done automatically. If one is able to restrict the tags, attributes and CSS rules
used in the source files, the conversion should in most cases be entirely automatic.
When publishing for the web, the source file will in itself already be presentable.
If further changes are required, these can be added with simple converters. Such converters
can even be written in JavaScript and be executed in the browser of the end user.
The contents of an article or chapter can in this way be made to fit the style of
the website that it is presented on. Should the same source text be used on different
sites or presented on different media with different settings (desktop computer vs.
tablet with different, user adjusted zooming), the styling can be adjusted to fit.
Because JavaScript is a relatively easily learned language, it makes it possible for
a much wider community of developers to convert the source text to the final output
format.
While solutions for web- and EPUB-publishing would not have to be changed a lot, the
situation in print is quite different. As we have seen above, none of the standard
print typesetting workflows are centered around HTML and print does not require for
the text to ever be converted into a web-centric format. Source files will be a mix
of Microsoft Word, Adobe InDesign and in the case of large publishers, XML files.
We believe that a lot of publishers could have benefits from switching their workflow
to HTML, while some publishers will still have benefits from using XML. Independently
thereof, they will find benefits from switching from styling defined through XSL-FO
to CSS for print, because it allows them to use the same or similar style definitions
for all types of outputs.
Existing HTML-centric print formatters
This is not the first project that will provide print processing functionality using
HTML/CSS. Two such formatters are the Antenna House Formatter and PrinceXML.
Both of these are stand-alone executables that allow for CSS and HTML input and will
output printable PDFs, and at least two major publishers have switched to HTML and
CSS for book publishing: O'Reilly Media McKesson 2012McKesson 2013Kleinfeld 2013 and the Hachette Book Group Cramer 2012. Even though the formatters accept fairly common HTML elements, the implementation
of each HTML formatter differs slightly. Those creating web-based content and editors
to create web-based content not only try to comply to existing web specifications,
but also to the most common web browsers actual implementations of those standards,
which means they test their content's rendering in Google Chrome, Apple Safari, Mozilla
Firefox and Internet Explorer, but not on formatters solely meant for print. Web content
that renders without problems in all major browsers will need extra attention before
it can be converted by the above-mentioned tools, both due to the slight differences
in how features are implemented in the formatters in comparison to the browsers, and
because the formatters are relatively slow to support new CSS features since they
implement the core engines on their own with their much smaller development teams
than what the browsers have. This is one of the difficulties in current CSS typesetting
that the print-publishing industry is facing. Other difficulties are that standard
CSS does not include rules for everything needed for book styling, and that those
extensions that are concerned with adding styling features that are important for
book printing are at a very early development stage Bos 2013.
Using web browsers with JavaScript to create print output
Pagination.js[1] and simplePagination.js, developed 2012–14, were first trials in creating a HTML-based print layout system
that runs in standard browsers. They are written in JavaScript and add styling and
content features that are specific to printed books, such as table-of-contents, running
headers, footnotes, word indexes, margin notes and page numbers, and they permit the
user to style the main content using CSS. Pagination.js has been used in the production
of books for the past two years, but is limited in that it only works in Apple Safari
using the browser's print-to-pdf feature.
The two JavaScript packages were made exclusively for the printing of books that have
general layouts that repeat according to specific rules across all pages, and not
for magazines or books that need page-specific layouts. They are configured through
JavaScript function arguments rather than CSS, which means that the end user will
have to configure layout options in two different ways, depending on whether it is
part of the layout controlled by the browser or the JavaScript package. Also, both
packages have been optimized to work with current browsers, using bugs in the rendering
engine to obtain better results.
As proof of concept and for very specific print layout types, the existing JavaScript
packages currently work fine. But when browsers change or new browsers come along,
or should the user wish to do something slightly different than print a book according
to the offered rules, they are rather useless.
Things needed: common styling specifications for print
Because we believe all styles should be configurable through CSS, part of our focus
lies in ensuring that the extra elements that are only important for print and other
page based media are sufficiently defined in web specifications to ensure interoperability
with other projects.
One of the more important specifications is the CSS Paged Media module. There are already several typesetting engines supporting CSS Paged Media,
the Antenna House Formatter and PrinceXML being among them.
Browsers have implemented ways for users to create PDFs of web pages. Unfortunately,
support for the CSS Paged Media specification has not been implemented in the main
browsers. The same is true for most ebook display solutions.
Additionally, the typesetting engines supporting CSS Paged Media contain proprietary
and incompatible vendor extensions which means that source files cannot easily be
moved between engines.
With the Vivliostyle project we prioritize advancing the development of web standards
so that Vivliostyle.js will be interoperable with other and future web-based print-solutions.
We have started to work with the World Wide Web Consortium (W3C) to enhance and promote
specifications such as CSS Paged Media and other related specifications such as the
CSS Page Floats or the CSS Generated Content for Paged Media specifications.
Things needed: JavaScript-based general print layout implementations
The two JavaScript book printing solutions have inspired us, yet we believe that something
more general that uses CSS rules to define styles is needed. We believe that a JavaScript-based
solution is needed for general print layout. Different from the existing solutions,
these should work in all major browsers and they should read the associated CSS to
define page layout, so that layout can be defined entirely in CSS. Such a solution
should be usable both to prepare web content for print, and to use the browser as
an ebook-reader.
Vivliostyle.js has been coded for half a year[2] and continues to be developed. It parses page-related CSS properties that are ignored
by the regular browser. So far it is able to do basic page styling including footnotes,
page numbering, floats and page headers.
Implementing new features in JavaScript that are in the processes of being standardized
in the form of a W3C specification fits well with the Extensible Web Manifesto, a document signed by a number of leading web visionaries who are trying to speed
up the development of new features on the web. By trying out whether and how things
work in JavaScript, we can help the relevant specifications move further and, if successful,
will help to result in the finalization of print-related CSS specifications and the
implementations of some print-related features in browsers.
Conclusion
The world of text publishing is fairly fragmented. Several approaches have been used
in the past to try to unify these, and XML was the most promising for a long time.
In a publishing world in which more publishers seek alternatives to labor-intensive
DTP-based workflows, print solutions that involve automatic content conversion with
as few steps as possible from input to final output formats need to be developed.
As we have shown, there is a strong case to be made that a combination of CSS and
HTML, despite their inherent shortcomings, represent the most promising path to a
unified publishing workflow. However, for HTML and CSS to become a viable alternative
for more publishers, time will need to be invested in development of CSS standards
and early implementations in JavaScript.
References
[Berjon2014] Berjon, Robin. Mending Fences and Saving Babies. Presented at Symposium on HTML5 and XML, Washington, DC, August 4, 2014. In Proceedings of the Symposium on HTML5 and XML. Balisage Series on Markup Technologies, vol. 14 (2014). doi:https://doi.org/10.4242/BalisageVol14.Berjon01.
[Kleinfeld2013] Kleinfeld, Sanders. The Case for Authoring and Producing Books in (X)HTML5. Presented at Balisage: The Markup Conference 2013, Montréal, Canada, August 6–9,
2013. In Proceedings of Balisage: The Markup Conference 2013. Balisage Series on Markup Technologies, vol. 10 (2013). doi:https://doi.org/10.4242/BalisageVol10.Kleinfeld01.
Berjon, Robin. Mending Fences and Saving Babies. Presented at Symposium on HTML5 and XML, Washington, DC, August 4, 2014. In Proceedings of the Symposium on HTML5 and XML. Balisage Series on Markup Technologies, vol. 14 (2014). doi:https://doi.org/10.4242/BalisageVol14.Berjon01.
Bos, Bert. Can you typeset a book with CSS? Presented at 2nd W3C Workshop on Electronic Books and the Open Web Platform, Tokyo,
Japan, June 4, 2013. http://www.w3.org/Talks/2013/0604-CSS-Tokyo/.
Kleinfeld, Sanders. The Case for Authoring and Producing Books in (X)HTML5. Presented at Balisage: The Markup Conference 2013, Montréal, Canada, August 6–9,
2013. In Proceedings of Balisage: The Markup Conference 2013. Balisage Series on Markup Technologies, vol. 10 (2013). doi:https://doi.org/10.4242/BalisageVol10.Kleinfeld01.