Background
Since we published our transpect framework under an Open Source license in 2013, it had become evident that we need to improve documentation to facilitate the use of our software for other developers. At this time no common standards for documentation existed in our XML team. Each developer followed it's own approach towards documentation up to doubtful “paradigms” that the code would be documentation enough. Generating a documentation seemed to be impractical due to different coding practices. These issues are outlined below.
Component Model
Before we moved to XProc, our XSLT stylesheets were executed from Make and Ruby. These scripts were also used to perform non-XML data processing, for example basic file operations, running command-line utilities or to query databases. Although these technologies fullfilled their purpose, they came with some disadvantages.
The code was difficult to deploy on other systems because it required typically an Unix-based environment and a number of preinstalled software tools. Reusability was very limited as many scripts expected resources at specific locations on the file system.
XProc helped to overcome these deficiencies. XProc is platform-neutral and provides
a declarative XML vocabulary to specify XML workflows. As a positive side effect,
the XML syntax allows us to analyze and transform XProc pipelines with Schematron
and XSLT. XProc allows to re-use pipelines in other pipelines with p:import
statements. For file operations and other tasks which are not part of the XProc standard,
we use build-in extension steps of the XProc processor XML Calabash. For specific
tasks such as unzipping archives and image analysis, we developed own extension steps.
Transpect shares some basic principles with the DAISY Pipeline project, Romain Deltour presented at XML Prague 2013kraet01: To avoid explicit file system paths in XSLT or XProc import statements, the components are identified by canonical URIs. The idea is to identify external resources by virtual addresses and to have a catalog resolver to rewrite the virtual addresses to physical locations.
In practice, each transpect module includes an XML catalog which contains its base URI. If an module is used in a project, the module catalog needs to be referenced in the project catalog as well.
Due to the modular nature of transpect, it was necessary to develop a reference which depicts not only the properties of the modules but also their dependencies. The complexity of this issue was reinforced by the heterogeneous ways in which the code was stored.
Repository Layout
Unfortunately, there was no standard how code should be organized. Transpect modules were stored in several publicly available Subversion (SVN) repositories. Some included various modules in subdirectories, other repositories just consisted of one module. Sometimes, the repository or directory names did not match the base URI in the XML catalog.
For this reason, it was not easy to derive the dependencies of a certain module. Some
repositories implemented the concept of a “project root” with three subdirectories
trunk
, branches
, and tags
. Others just provided the code at the root directory. Inside the repository the code
was either organized by language, functional logic or for historical reasons. Hence,
it could not be assumed that stylesheets or pipelines which belonged logically together
were also stored at the same location.
Naming Conventions
Following a consistent set of naming conventions not only improves the usability of a framework but also facilitates the building of tools to analyze the code. Previously, naming conventions differed among transpect modules. Directories and files which served the same purpose were named differently. Three namespaces existed to represent core transpect modules.
Lack of Documentation
Just a few transpect modules provided an extensive documentation. Although there was a verbose setup guide laid out in DocBookkraet02, a lot of other modules were poorly documented: Even for complex templates and functions, comments were missing. Some variable and function names were not self-explanatory and Readme files were rare.
In addition, there was no standard on how a transpect module should be described. The setup guide mentioned earlier was laid out in DocBook, other projects included plaintext Readme files or larger XML comments at the top of an XML document.
Establishing Standards
It became obvious that the lack is that the lack of an extensive documentation was connected to the lack of common development standards. So we identified a number of major issues and created standards addressing these issues.
Creating a Styleguide
First, we included some basic guidelines on how to write transpect code in a styleguidekraet03. These conventions provide guidance on encoding, indent style, whitespace, new lines and some recommendations on writing XProc and XSLT code.
Moving to GitHub
To facilitate the ways of using and contributing to transpect, we moved its entire codebase from our public SNV server to GitHub. Only customer-specific projects are still stored in SVN repositories including the GitHub transpect modules as SVN externals. We created a GitHub organization and stored each transpect module in a particular repository. To minimize dependency management, we tried to combine modules with two-way dependencies into one module. GitHub already provides built-in functions how to create branches, releases and tags.
Restructuring the Code Base
Taking into account XProc and XSLT only, transpect counts about 50.000 lines of code. The code was reviewed and namespaces, attributes and functions were renamed. Canonical URIs were unified and common directories e.g. for XML catalogs were named consistently. With our continuous integration system, it could be ensured our customer projects still produce the same output.
After the migration to GitHub, the repository layout for each module follows the same guidelines. The code is stored separated by its language. XML catalogs can always be found at the same location.
Creating Documentation
GitHub already provides an HTML rendering for Markdown-based Readme files. Therefore,
a Readme file named README.md
must be stored in a directory of the repository. When opening this directory with
GitHub's directory browser, the file is rendered at the bottom of the page. Since
moving to GitHub, we have been used frequently this method to provide a brief documentation
to developers.
However, it's not feasible to include all technical aspects of a module in a Readme. For example, an XProc pipeline may consist of input and output ports, a set of options and various import statements. Incorporating this information within a separate Readme for every transpect module wouldn't be particularly practicable. So we came up with the idea of generating the documentation directly from the source code.
XProc already features a p:documentation
element intended for adding documentation. This element can be used anywhere in the
pipeline and doesn't affect its execution. p:documentation
can be used as child element of p:input
to describe which kind of document is expected by this port. Moreover, you can add
the HTML namespace to use HTML markup in your documentation. This method proved to
be useful to include markup code blocks or use hyperlinks to reference other resources.
Our first project which employed p:documentation
tags was transpectdoc.
Transpectdoc is basically an XProc pipeline which generates an HTML documentation
for all XProc pipelines used
in a project. Transpectdoc starts from a frontend XProc pipeline and parse all imported
pipelines and libraries.
For each of the pipelines’ input and output ports, a linked list will be generated
that points to ports in other
pipelines that are connected to these ports.
With transpectdoc it's possible to easily create a technical documentation for a specific project. However, it was not intended to replace tutorials or an user guide. Nevertheless, transpectdoc works for projects with one front-end pipeline only. It was not supposed to generate a reference for the entire transpect framework.
Requirements
After various efforts had been made to develop common standards, we identified major requirements for a global documentation of transpect.
-
The documentation should be comprehensive and extensible. In this sense, it should include all necessary resources for each module. Furthermore, it should be possible to easily integrate new repositories.
-
Currently, the entire transpect framework includes 52 repositories on GitHub. It would not be convenient to clone each directory to your local machine to generate a documentation. Only a minimum set of data should be downloaded. Furthermore, it should be possible to easily update and extend the documentation.
-
The automatically generated reference should be accompanied by other kinds of documentation, such as tutorials.
-
The documentation should be both online and offline readable.
Implementation
Requesting GitHub Repositories
The entire process starting with requesting the transpect repositories up to generating the documentation is implemented in XProc. The documentation is written in XML and hosted as GitHub page at http://transpect.io
Initially, an method had to be established to retrieve the repository information
hosted on GitHub.
Fortunately, GitHub provides a public HTTP APIkraet04. The data is sent and received as JSON.
So we decided to use XProc's p:http-request
to request the GitHub API. The process comprises three
stages: First, an XProc step requests a list of all repositories of our GitHub organization.
Second, lists the
content of each repository recursivelykraet05. Third, XML Catalogs, XSLT stylesheets and XProc
pipelines are downloaded from GitHub.
To convert JSON to XML, we use an extension of XML Calabash, called the “transparent
JSON”kraet06. We may replace this later with a p:xslt
step that implements the function parse-json()
introduced in XSLT 3.0. Below you can find an XML output from a HTTP request of a
GitHub repository by using XProc and XML Calabash.
Generating the Documentation
All the information is combined together with XProc pipelines, XSLT stylesheets and XML catalogs in one XML document. Then an XSLT stylesheet is used to convert this intermediate format to DocBook to provide a more descriptive reference.
For each transpect module a DocBook chapter is generated. Every chapter includes general
information about the repository and sections for all XProc steps that can be imported
by other pipelines. Within the documentation, an XProc step is specified by its @type
attribute and its canonical import URI. If the step imports
steps from other repositories, references to the repositories are specified as dependencies.
If an XProc step contains a p:documentation
tag following below the root element, its content is taken as a brief description
of the step and rendered as well. Furthermore, input and output ports, options and
their default values are specified. Currently just a few steps provide p:documentation
tags within port and
option declarations. If their number is increasing in the future, we think of integrating
these contents, too.
Despite the automatically generated reference, there are other kinds of documentation
such as setup guides and tutorials. These documents are authored in DocBook and stored
together with the module reference within the same directory. They are merged later
with XProc's p:xinclude
step. After the merge, the XML document is split into chunks, each of which is converted
to a single HTML document later.
The generated HTML chunks just include a basic HTML markup and are injected into a global HTML template which provides the layout information. The entries of the navigation bar are also generated and inserted into the template.
In this way, it's possible to change the general appearance without changing the content.
The previous XSLTs just rely on some specific id
attributes in the template to add the content. However, if you want to use specific
CSS features, such as a multi-column layout or individual colors, you have to add
the corresponding HTML class
values in your DocBook role
attributes.
The process ends with the generation of the HTML snippets of the documentation. From this point, the only thing left to do is to commit and push the changed HTML files to GitHub.
Discussion and Outlook
There are various benefits in generating vital parts of the reference directly from the code base. The use of inline documentation enriched with HTML markup facilitates adding larger portions of text. Redundant information like port declarations and dependencies is automatically generated. It's no problem to keep the documentation up-to-date if new repositories were added or code has changed.
Naturally, the generated documentation is just as good as the quality of the inline
documentation. But this approach is also suitable to expose poor documentation which
was previously hidden. Furthermore, a generated technical reference cannot replace
other more descriptive types of user documentation such as tutorials, FAQs or How-tos
and it's not convenient to incorporate complex documents in p:documentation
tags. But it seems reasonable to combine these different kinds of documentation in
a single XML source that can then be published in a variety of formats.
It seems like a natural choice to use GitHub services to both generate and host documentation for code which is already hosted on GitHub. But it also involves the risk of beeing too dependent on a specific vendor. If it would be necessary to move away from GitHub, just the pipeline which requests the repositories needs to be changed. The source of the documentation is stored as XML, so another pipeline just had to generate the expected XML structure.
The project is still in a early state. As of now, the initial requirements have been implemented and additional features such as EPUB output and a graphical representation of the XProc steps are planned. Expected data types and XML schemas at XProc ports could provide a deeper level of information. Sometimes it would be convenient to have a brief overview of the structure of the pipeline. In this sense, the main focus is just on adding more informative documentation.
References
[kraet01] Deltour, Romain: XProc at the heart of an ebook production framework. The approach of the DAISY Pipeline project. XML Prague 2013 Conference Proceedings. http://archive.xmlprague.cz/2013/files/xmlprague-2013-proceedings.pdf [cited 19 Apr 2016]
[kraet02] le-tex publishing services: transpect Setup Manual. https://subversion.le-tex.de/common/transpect-demo/content/le-tex/setup-manual/en/transpect-setup.xml
[kraet03] le-tex publishing services: Styleguide. Guidelines for writing XProc, XSLT, XPath… http://transpect.github.io/styleguide.html [cited 19 Apr 2016]
[kraet04] GitHub. API Overview. https://developer.github.com/v3/ [accessed 19 Apr 2016]
[kraet05] transpect.io. GitHub-API. XProc steps that implement the GitHub API. https://github.com/transpect/github-api [accessed 19 Apr 2016]
[kraet06] Norman Walsh. Language Extensions. In XML Calabash Reference. http://xmlcalabash.com/docs/reference/langext.html [accessed 19 Apr 2016]
GitHub. API Contents. https://developer.github.com/v3/repos/contents/ [accessed 19 Apr 2016]