Note
I would like to thank my colleague Max Zhaloba for his valuable comments and clarifications related to XProc and XSLT, and also Damian Gibbs, Kate Prentice, and Anupama Wimalasooriya for helping me to review the content. Finally, I would like to thank Typefi for the motivation and support.
Introduction
This solution transforms a Microsoft Excel Open XML Spreadsheet (XLSX) file into a shallow-structured XML file used at Typefi, called Content XML (CXML), using XSLT and XProc an XML- based programming language for processing documents in pipelines.
Previously in Typefi, the XLSX to CXML transformations were performed using Java packages, usually the Apache POI package, which provides APIs for manipulating Office Open XML documents. This approach required a significant amount of Java knowledge, and for someone with little Java experience, handling the available functionalities of the Apache POI was not convenient in comparison to the XSLT transformations. Many systems convert XLSX to other XML schema transformations using Java language, but identifying the internal structure of an XLSX file and processing them is the real challenge.
A solution could use several of these technologies, but the approach outlined in this paper uses XSLT—mainly XSLT 3.0 features. While the amount of code needed to filter out all the Excel formatting information is significant, the goal is to make the code reusable. This goal is accomplished by providing reusable functions to access features within the XLSX elements easily. Moreover, the solution includes an 'xlsx-to-cals' mode that transforms Excel table information into a CALS table. The "CALS Table Model" is the most widely adopted industry standard of SGML/XML representation for tables. Further, this research provides modules that transform charts (bar chart, pie chart and line chart) into SVG files. This general approach helps to transform XLSX tables and charts into XML schemas other than CXML because CALS tables and SVG files are well-known formats. The XSLT solution outlined in this paper is executed from an XProc pipeline, with ANT scripts used to call the pipeline and XML Calabash running the XProc pipeline..
In this solution, the final output is Content XML (CXML) which is a DocBook-like XML scheme used internally in Typefi products. It is an intermediary format to facilitate converting any input file format to another format, such as Adobe InDesign. CXML is optimised for page layout (i.e. InDesign/PDF) or digital production (i.e. EPUB, HTML, DAISY).
XLSX files can contain tables, charts and graphics, which all play a significant role. Therefore, this paper focuses on processing these elements and transforming them into the corresponding CXML elements according to the CXML schema 3.2.
This paper examines how the XProc pipeline reads the .xlsx file along with the information from the various XML files and how the XSLT transforms the content using the elements and attributes into a different XML schema—CXML.
Please note: This paper uses @ in front of names to denote attributes of an element.
Figure 1 illustrates the high-level design of this system.
XLSX overview
The XLSX file type was introduced in 2007 as the disk form of Excel spreadsheets when Office 2007 was released. XLSX is a ZIP package that contains many XML files and is organised according to the Open Packaging Conventions (OPC). The XLSX system follows this folder structure with several functions to read the corresponding files such as relationship files, sheet.xml, and chart.xml.
Table I
XLSX folder structure example | Folder/sub-folder titles | Information contained within these folders |
---|---|---|
_rels | Relationships to the primary XML documents and other key parts are included here. | |
[Content_Types].xml |
|
|
docProps |
Overall document properties are contained here.
|
|
xl |
Details about the content of the file are listed here.
|
XProc pipeline
Calabash is used to implement an XProc pipeline that reads the files from the input XLSX file and passes the information as a collection() to the XSLT. More information about XProc is available in "XProc 3.0: An XML Pipeline Language" [23].
Using the ANT script we can extract the Excel file and pass the disk location to the
XSLT.
However, the advantage of Calabash is that everything flows through the pipeline and
does not
require to be saved to a disk unless we use p:store
explicitly.
The first solution writes the unzipped file into the disk. Having
cx:decode="true"
in the p:store
is important, otherwise the XSLT
function collection()
could not detect the extracted content as a valid XML
file, instead returnining an xs:base64Binary
object, while the expected output was a
document-node().
<pxp:unzip> <p:with-option name="href" select="$input.file.path"/> <p:with-option name="file" select="$filename"/> <p:with-option name="charset" select="UTF-8"/> <p:with-option name="content-type" select="xml"/> </pxp:unzip> <p:store cx:decode="true"> <p:with-option name="href" select="concat($output-dir, '/', encode-for-uri($filename))"/> </p:store>
The following code segment demontrates how to pass the Excel file as a collection to the XSLT program. This idea is proposed by one of our team members, Max Zhaloba. This is the recommended way, as it does not need to write the intermediate content into the disk.
This XProc pipeline follows these steps:
-
Get the input parameters from the ANT script (input file path and the XSLT stylesheet).
-
Unzip the XLSX file
-
Iterate over the ZIP file and add the file path as an attribute. This will be used from the XSLT side to read the file content.
-
Call the XSLT program and pass the unzip file information as a parameter.
-
Write the main output file to the disk.
-
Write the other result documents to the disk.
<p:declare-step version="3.0" xmlns:c="http://www.w3.org/ns/xproc-step" xmlns:pxp="http://exproc.org/proposed/steps" xmlns:p="http://www.w3.org/ns/xproc" name="main"> <p:input port="param" kind="parameter"/> <p:output port="result" primary="true" sequence="true"> <p:pipe port="result" step="secondary-storage"/> </p:output> <p:declare-step type="pxp:unzip"> <p:output port="result"/> <p:option name="href" required="true"/> <p:option name="file"/> <p:option name="content-type"/> <p:option name="charset"/> </p:declare-step> <p:variable name="input.file.path" select="replace(/c:param-set/c:param[@name eq 'input']/@value, '\\', '/')"> <p:pipe step="main" port="param"/> </p:variable> <p:variable name="stylesheet.file.path" select="replace(/c:param-set/c:param[@name eq 'xsl']/@value, '\\', '/')"> <p:pipe step="main" port="param"/> </p:variable> <p:load name="load-stylesheet"> <p:with-option name="href" select="$stylesheet.file.path"/> </p:load> <pxp:unzip name="unzip-excel-file"> <p:with-option name="href" select="$input.file.path"/> </pxp:unzip> <p:for-each name="iterate-zip-directory"> <p:iteration-source select="/c:zipfile/c:file[ends-with(@name, '.xml') or ends-with(@name, '.rels')]"> <p:pipe step="unzip-excel-file" port="result"/> </p:iteration-source> <p:output port="result"> <p:pipe step="unzipped-xml-files" port="result"/> </p:output> <p:variable name="unzipped-file-subpath" select="/c:file/@name"/> <pxp:unzip> <p:with-option name="href" select="$input.file.path"/> <p:with-option name="file" select="$unzipped-file-subpath"/> </pxp:unzip> <p:add-attribute name="unzipped-xml-files" attribute-name="xml:base" match="/*"> <p:with-option name="attribute-value" select="$unzipped-file-subpath"/> </p:add-attribute> </p:for-each> <p:xslt name="apply-transform" template-name="main"> <p:input port="stylesheet"> <p:pipe step="load-stylesheet" port="result"/> </p:input> <p:input port="source"> <p:pipe step="iterate-zip-directory" port="result"/> </p:input> <p:input port="parameters"> <p:inline> <c:param-set> <c:param name="sheet_ui_name" value="sheet1.xml"/> <c:param name="debug" value="true"/> </c:param-set> </p:inline> </p:input> </p:xslt> <p:store href="../output/output.cxml"/> <p:for-each name="secondary-storage"> <p:iteration-source select="."> <p:pipe port="secondary" step="apply-transform"/> </p:iteration-source> <p:output port="result"> <p:pipe port="result" step="store"/> </p:output> <p:store name="store"> <p:with-option name="href" select="document-uri(.)"/> </p:store> </p:for-each> </p:declare-step>
The following code segment can be used to read the inner files of the unzipped file from the XSLT side. All files will be accessed using the corresponding file name.
<xsl:variable name="worksheet" select="collection()[*/base-uri() eq 'xl/worksheets/sheet1.xml']" as="document-node()"/>
XLSX to CXML Methodology
The transformation to CXML begins with an XLSX input file which is processed to access the inner files. As explain in the previous section, this program is initialised using ANT scripts; then an XProc pipeline is executed using XML Calabash.
Most transformations are triggered from the sheet.xml which resides in the xl/worksheets
folder. The sheetData
element in the sheet.xml contains the information needed to
be converted to a tps:table
element and the following sections will discuss this
in more detail.
The XLSX to CXML solution was developed out of two major research areas:
-
Map the XLSX table information to the CALS table elements and attributes.
-
Convert the chart.xml into an SVG and map it to
tps:image
elements and attributes.
The XLSX to CALS transformation happens as a pre-processing mode, then in the default mode, it will be converted to CXML schema; SVG files are referred to as images in this schema. The solution provides a CXML file as the final output. The XLSX to CALS transformation adds flexibility to this allowing mapping to XML schemas other than CXML since it uses CALS table and SVG formats.
Solutions to the above scenarios are developed using XSLT 3.0.
Input XML
Workbook.xml contains information about the available worksheets. The attribute value of the r:id maps to the related sheet.xml. This XSLT transformation starts to trigger the templates from the sheet.xml.
The following is a sample sheets element in the workbook.xml
<sheets> <sheet name="Instructions" sheetId="1" r:id="rId1"/> <sheet name="DBRS Tape" sheetId="37" r:id="rId2"/> </sheets>
The following code segment can be used to find the corresponding sheet.xml using the sheet name which we get as an input parameter.
<xsl:variable name="sheet.name" select="substring-after(tps:get-workbook-rels()/rel:Relationships/rel:Relationship[@Id = $workbook/e:sheets/e:sheet[@name = $sheet]/@r:id]/@Target, '/')" as="xs:string?"/>
The following shows a sample sheetData
element from a XLSX input XML.
<sheetData> <row r="1" spans="1:2" x14ac:dyDescent="0.2"> <c r="A1" t="s"> <v>0</v> </c> <c r="B1" t="s"> <v>1</v> </c> </row> . . . </sheetData>
The system architecture
The following diagram shows the functions created to access the XLSX features. Other
than these functions, matching templates are triggered to convert the sheetData
into a CALS table as well as for the chart conversion.
Table 2 explains the available functions and the purpose of each.
Table II
Function | Description |
---|---|
tps:get-worksheet |
Read the sheet.xml |
tps:get-stylesheet |
Read the styles.xml |
tps:get-sharedstring-table |
Read the shared string table from the sharedStrings.xml |
tps:get-theme-xml |
Read the theme.xml |
tps:get-workbook-rels |
Read the workbook.xml.rels |
tps:get-worksheet-rels |
Read the sheet#.xml.rels |
tps:get-drawing-rels |
Read the drawing#.xml.rels |
tps:get-chart-xml |
Read the chart.xml |
tps:get-cellxfs |
Get the /styleSheet/cellXfs/xf elements
|
tps:get-font |
Get font elements from the style.xml |
tps:get-border-style |
Get the style of the bottom, top, left, right borders. |
tps:get-pattern-fill |
Get the /styleSheet/fills/fill/patternFill/@patternType from the styles.xml
|
tps:get-theme-colour |
Get the /styleSheet/numFmts/numFmt[2]/@formatCode from the styles.xml
|
tps:get-align |
Get the horizontal alignment of the cell |
tps:get-valign |
Get the vertical alignment of the cell |
tps:get-rotation |
Get the rotation information from the styles.xml |
tps:get-foreground-colour |
Get the font colour |
tps:get-style |
Get the style applied to text, bold, italic, underline, strikethrough, superscript and subscript. This can be individual or a combination of them. |
tps:get-background-colour |
Get background colour of the cell |
tps:get-total-col-width |
This function is used to calculate the total column width |
tps:get-time-from-the-value |
Convert the decimal value to the correct time format |
tps:get-date-from-the-serial-no |
Convert the serial no to a date |
tps:is-hidden-column |
Check whether a column is hidden |
XLSX tables
This section considers XLSX files with tables. The XSLT transformations to convert XLSX into CALS table markup (with example XPath) is explained, and guidelines to apply user-defined XSLT functions are provided.
Table 3 displays the corresponding CALS table element mapping for the Excel elements of the sheet.xml.
Table III
XLSX element | CALS table element |
---|---|
Header rows, footer rows are defined by the user through the application interface, see Figure 11. |
<table> <tgroup cols="number of columns"> <colspec/> <thead> </thead> <tbody> </tbody> <tfoot> </tfoot> </tgroup> </table> |
col
|
<colspec>
|
row
|
<row>
|
c
|
|
The following shows a sample CALS table output corresponding to the input given in section 2.1.
<table> <tgroup cols="2"> <colspec colname="1" colwidth="8"/> <colspec colname="2" colwidth="8"/> <thead> <row> <entry namest="1" nameend="1" align="left" valign="bottom">Year</entry> <entry namest="2" nameend="2" align="left" valign="bottom">Total</entry> </row> <row> <entry namest="1" nameend="1" align="left" valign="bottom">Year 2020</entry> <entry namest="2" nameend="2" align="left" valign="bottom">20 </entry> </row> </thead> <tbody> <row> <entry namest="1" nameend="1" align="left" valign="bottom">Year 2022</entry> <entry namest="2" nameend="2" align="left" valign="bottom">80</entry> </row> </tbody> </tgroup> </table>
Cell (entry)
The <entry>
element is used in this solution to create table cells and has several attributes,
so further discussion about cell-related information is important. Therefore, the
following section explains about XLSX cell elements and related features and attributes.
Cell range
In order to access the cell range, tps:get-worksheet()
will return the selected worksheet as the $worksheet
variable. This is either user-specified by using an input parameter, or it considers
the tabSelected
value to read the specific sheet (i.e <sheetView tabSelected="1" workbookViewId="0"/>
tabSelected="1" specifies that the corresponding sheet is currently selected).
The xlsx-to-cxml transformation should limit the processing of cells to the given cell range. This is defined by the dimension element for a particular table.
<dimension ref="A1:B3"/> <xsl:variable name="cell.range" select="$worksheet//excel:dimension/@ref" as="xs:string?"/>
XLSX Cells (c)
This element represents a cell in the Excel worksheet, and the information about the cell’s reference (@r) , formatting (@s), data type (@t), value (is- rich text inline, v - cell value) and the formula (f) are expressed. In this application, the formula-related transformation has been ignored.
Cell reference (@r)
The solution identifies the location of a particular cell using cell reference information, and analyses whether a cell is within the cell range or whether it is within a merged cell range.
Possible data types
-
n - number (the default)
-
b - boolean
-
d - date
-
e - error
-
inlineStr - an inline string
-
s - a shared string
-
str - a formula string
The following section describes a shared string data type (s).
Shared String Table (sst)
The Shared String Table serves as a collection of individual String Items (<si>
). It contains two attributes: @count - total number of text strings in the workbook, @uniqueCount - total number of unique text strings.
When the data type (@t) of the cell (c) indicates a 's' then the corresponding value should be read from the sharedStrings.xml
<c r="A7" s="14" t="s"> <v>6</v> </c>
The above sample cell element value is 6, therefore, it maps to the 7th string item
of the shared string table (/sst/si[7]/t
).
Merge cells
The sample mergeCells element
<mergeCells count="3"> <mergeCell ref="A7:C7"/> <mergeCell ref="A8:B8"/> <mergeCell ref="B9:C9"/> <mergeCell ref="A4:A5"/> </mergeCells>
In CALS, @nameend and @namest attributes define the horizontally-merged cells. The @r value is used to identify the cell reference, then the system has defined several functions to identify whether that particular cell is within a merged cell range.
<xsl:function name="tps:is-cell-merged" as="xs:boolean"> <xsl:param name="cell" as="xs:string"/> <xsl:variable name="merge-cells" select="$worksheet/e:worksheet/e:mergeCells/e:mergeCell" as="element(e:mergeCell)*"/> <xsl:sequence select="some $c in $merge-cells satisfies tps:is-within-cell-range($cell, $c/@ref)"/> </xsl:function>
This function tps:is-within-cell-range($cell, $ref)
is used to check the whether a given cell reference is within a cell range, the same
function is used to process the merged cells and it needs the current cell and the
mergeCell/@ref value as parameters. The following code segment demonstrates how it processes the
@nameend attribute
<xsl:variable name="merge-cells" select="sum(tps:get-merged-cell-count(@r))" as="xs:integer"/> <xsl:attribute name="nameend"> <xsl:value-of select="$merge-cells + count(preceding-sibling::excel:c) + 1"/> </xsl:attribute>
The function tps:get-merged-cell-count(@r)
gets the cell reference as a parameter, and if that cell is within a merge cell range,
then it will get the difference between the columns. For instance if the mergeCell/@ref is A7:C7 string-to-codepoints($last-col) - string-to-codepoints($first-col)
will return 2. The @namest defines the cell position corresponding to a particular row. So if the @namest is 1, then the @nameend value will be 3.
The <xsl:template
match="excel:c[not(normalize-space())][tps:is-cell-merged(@r)]"
priority="10"/>
template will remove the blank cells related to the merged
cell range.
The above mergeCells element <mergeCell ref="A4:A5"/>
indicates vertically merged cells, and the solution defines them using @morerows in the CALS tables.
Style overrides
Figure 3 can be used to explain more about these cell formats, it has applied different kinds of formatting. These can be considered in the following sections to discuss how this solution considers such input and applies cell format attributes to the final output.
-
A2 cell value is left aligned and it is styled as bold.
-
A3 cell value has applied red as the font colour.
-
A4 cell has borders in it.
-
C3 cell has a background colour.
-
the 7th row has some italic and underline text.
-
C5 cell value has a number format applied to it.
For instance, cell A3 can be considered and the following is how it is displayed in the worksheet.xml.
<c r="A3" s="9"> <v>4</v> </c>
Note: All the style-related numbers are 0-based indexes (i.e., 0 maps to the 1st element in the style.xml).
The @s maps to the /styleSheet/cellXfs/xf[10]
,
<xf numFmtId="0" fontId="2" fillId="0" borderId="0" xfId="0" applyFont="1"
applyAlignment="1">
, and Table 4 provides details about the available
style information in the styles.xml. The @applyFont and @applyAlignment attributes
store
boolean values indicating whether the corresponding formatting specified for this
xf
should be applied or not.
Table IV
Attribute | Sample XPath | Description |
---|---|---|
Background colour |
/styleSheet/fills/fill[4]/patternFill/fgColor/@rgb
|
cellXfs/xf/ gives the fillid , which maps to the
fill and the corresponding fillpattern.
|
Foreground colour |
/styleSheet/fonts/font[3]/colour/@rgb
|
cellXfs/xf/ gives the fontId which maps to the font
element /styleSheet/fonts/font .
|
Border type and Border pattern |
Border type : Border pattern: |
cellXfs/xf/ gives the fillId="0" borderId="0" which maps to the required information.
|
Number format e.g: 80.50% |
/styleSheet/numFmts/numFmt[@numFmtId eq 171]/@formatCode
|
numFmtId maps to the formatCode.
|
Bold, Italic, Underline e.g: "bold_italic" |
/styleSheet/fonts/font[8]/b
/styleSheet/fonts/font[8]/i
/styleSheet/fonts/font[8]/u
/styleSheet/fonts/font[20]/vertAlign/@val
/styleSheet/fonts/font[1]/strike
|
fontId maps to the corresponding font element,
vertAlign/@val can be either superscript or subscript, strike element
relates to the Strikethrough style.
|
The mapping between the XLSX cell-format attributes, number format and the style
attributes of the XLSX XML are listed in Table 4 with sample input elements and
XPaths. The CALS table format does not support styling attributes; therefore, the
information related to CXML style overrides (tps:c type="cell-format"
and the
tps:style
) are parsed from XLSX to CALS during the xlsx-to-cals processing
mode by using processing instructions. The information available with these processing
instructions is converted to CXML style overrides and the number format is done during
the stage where it processes the cell value-related information.
Here is an example of a table cell (<entry>
) with the style and cell-format information which are represented using processing
instructions:
<entry type="normal" namest="4" nameend="4" align="left" valign="bottom"> <?style bold?> Heading 1 <?cell-format border-bottom:thick; background-color:FFFFC7CE; foreground-color:FF9C0006;?> </entry>
The corresponding XSLT template which creates the CXML style overrides is implemented as follows;
<xsl:template match="processing-instruction('cell-format')"> <tps:c type="cell-format"> <xsl:value-of select="."/> </tps:c> </xsl:template>
Alignment
<xf numFmtId="0" fontId="2" fillId="0" borderId="0" xfId="0" applyFont="1" applyAlignment="1"> <alignment horizontal="right"/> <alignment vertical="top"/> <alignment textRotation="90"/> </xf>
The @align and @valign attributes of the CALS table are filled from the above cellXfs/xf/alignment/@horizontal
and cellXfs/xf/alignment/@vertical
attributes and in the absence of these elements it considers left and bottom respectively as the default values.
The cell text rotation is defined using the @textRotation value of the above <alignment>
element.
Font style
In the above xf element, fontId="2" maps to the following font element /styleSheet/fonts/font[3].
<font> <b/> <i/> <u/> <sz val="12"/> <color rgb="FFFF0000"/> <vertAlign val="superscript"/> <name val="Calibri"/> <family val="2"/> <scheme val="minor"/> </font>
The font element can contain a collection of style information such as bold (b), italic (i) and underline (u). Furthermore, it provides the foreground colour and other information related to the font style. Superscript and subscript formatting details are defined using the vertAlign/@val. If a text is styled with strikethrough, this element will contain an element called strike.
Background colour
Background colour is defined by the fill element of the styles.xml, fillId="3" maps
to the following fill[4] element and
/styleSheet/fills[1]/fill[4]/patternFill[1]/fgColor[1]/@rgb
gives the
background colour information.
<fill> <patternFill patternType="solid"> <fgColor rgb="FFFFFF00"/> <bgColor indexed="64"/> </patternFill> </fill>
Both of these fgColor and bgColor relate to the background colour of a cell. If a pattern is applied to a cell then bgColor defines the background of the pattern and fgColor defines the foreground of the pattern. For instance if the cell pattern contains 'dots', fgColor is the colour of those dots. However, if it is a solid pattern the fgColor colour can be considered as the background colour and it does not have anything to do with the bgColor as demonstrated in the above example.
Border style
Border style is defined by the border element of the styles.xml, borderId="0" maps
to the corresponding border element /styleSheet/borders/border[1]
. Border pattern is defined by the /styleSheet/fills[1]/fill[4]/patternFill[1]/@patternType
value.
In the example illustrated in Figure 3, the A4 cell has borders in it. The border element defines the left, right, top, bottom and the diagonal border styles and the colour of it.
Number format
In the xf element of the styles.xml it contains the numFmtId attribute which defines
the number format of a particular cell value. For instance, numFmtId="171"
relates to the <numFmt numFmtId="171" formatCode="0.000%"/>
element. The format-number XSLT function can be used with this formatCode to format
the cell-value (format-number($value, $format-code)
).
Open xml documentation lists a set of format codes, the corresponding numFmt id is not explicitly save in the file, so when the styles.xml defines, a numFmtId and when the corresponding numFmt element is not present it needs to access the predefined format codes.
Column width
It is important to calculate the column width based on the cols element details in
the input XML; it specifies the column width as the colwidth
attribute of the colspec element in the output CXML.
The XLSX col element is not a mandatory element in the input sheet.xml. The following is an sample input of a table which has 11 columns, This means that from column number 3 to 6 the width is 10.8125 and when it does not mention the column width for 8 to 10 columns, the system considers the default column width which is 11.
<sheetFormatPr defaultColWidth="11" defaultRowHeight="30" customHeight="1"/> <cols> <col min="1" max="1" width="28.6875" customWidth="1"/> <col min="2" max="2" width="12.6875" customWidth="1"/> <col min="3" max="6" width="10.8125" customWidth="1"/> <col min="7" max="7" width="15.3125" customWidth="1"/> <col min="11" max="11" width="10.8125" customWidth="1"/> </cols>
The custom columns are defined as above, and all other columns use the default column width which is defined by the @defaultColWidth or the @baseColWidth in the sheet.xml.
The calculation of the colwidth = (column width/ total width) *100 % (eg: first column :- 28.6875/143.75 * 100% = 19.96). This calculation is done using an XSLT function which applies the xsl:iterate feature of the XSLT 3.0. The following function can be reused to calculate the total column width which needs to be considered in a particular table dimension.
<xsl:function name="tps:get-total-col-width" as="xs:decimal?"> <xsl:param name="cols" as="element(excel:col)*"/> <xsl:param name="default-col-width" as="xs:decimal"/> <xsl:iterate select="$cols"> <xsl:param name="total" as="xs:decimal" select="0"/> <xsl:param name="next-min" as="xs:decimal" select="0"/> <xsl:on-completion> <xsl:sequence select="$total"/> </xsl:on-completion> <!--Calculate the total width for a particular col element--> <xsl:variable name="cur-width" select="xs:decimal(@width * (@max - @min + 1))" as="xs:decimal"/> <!--When the custom width is not available consider the default column width--> <xsl:variable name="default-col-width-total" select="xs:decimal(if($next-min != @min and $next-min != 0) then $default-col-width * (@min - $next-min) else 0)" as="xs:decimal"/> <xsl:variable name="new-total" as="xs:decimal"> <xsl:sequence select="$total + $default-col-width-total + $cur-width"/> </xsl:variable> <xsl:next-iteration> <xsl:with-param name="total" select="$new-total"/> <xsl:with-param name="next-min" select="xs:decimal(@max + 1)"/> </xsl:next-iteration> </xsl:iterate> </xsl:function>
The sample output for these columns:
<tps:colspec colname="1" colwidth="19.96%"/> <tps:colspec colname="2" colwidth="8.83%"/> <tps:colspec colname="3" colwidth="7.52%"/> <tps:colspec colname="4" colwidth="7.52%"/> <tps:colspec colname="5" colwidth="7.52%"/> <tps:colspec colname="6" colwidth="7.52%"/> <tps:colspec colname="7" colwidth="10.65%"/> <tps:colspec colname="8" colwidth="7.65%"/> <tps:colspec colname="9" colwidth="7.65%"/> <tps:colspec colname="10" colwidth="7.65%"/> <tps:colspec colname="11" colwidth="7.52%"/>
When some sheet.xml files do not contain cols elements, the total column width is calculated as default column width * number of columns.
Custom number formats
The capability of adding custom number formats provides greater flexibility for an Excel user. The following example can be considered to explain this further, ';' is used to separate the format string into different sections.
The following can be considered as a sample customer number format string _("$"* #,##0.00_);_("$"* \(#,##0.00\);_("$"* "-"??_);_(@_)
This number format can include;
-
One section: this will be applicable for all the values.
-
Two sections: first section applies for positive numbers or zeros, and the second section applies negative numbers.
-
Four sections: first section applies to positive numbers, the second section applies to negative numbers, the third section applies to zeros, and the fourth section is for text.
Therefore, the above example has four sections and according to the cell value it considers the corresponding section. The following code segment demonstrates this implementation using XSLT.
<xsl:variable name="num-formats" select="tokenize(tps:get-number-format($style-index), ';')" as="xs:string*"/> <xsl:variable name="section-count" select="count($num-formats)" as="xs:integer"/> <xsl:when test="$section-count eq 4 and (. gt 0)"> <xsl:value-of select="format-number(., $num-formats[1])"/> </xsl:when> <xsl:when test="$section-count eq 4 and (. < 0)"> <xsl:value-of select="format-number(., $num-formats[2])"/> </xsl:when> <xsl:when test="$section-count eq 4 and (. eq 0)"> <xsl:value-of select="format-number(., $num-formats[3])"/> </xsl:when> <xsl:when test="$section-count eq 2 and (. >= 0)"> <xsl:value-of select="format-number(., $num-formats[1])"/> </xsl:when> <xsl:when test="$section-count eq 2 and (. < 0)"> <xsl:value-of select="format-number(., $num-formats[2])"/> </xsl:when>
Excel provides support for basic colours in custom number format. The following eight colours are specified using the name: [black][white][red][green][blue][yellow][magenta][cyan]. Other colours are considered as index numbers in a custom number format, but this application does not consider those.
For instance in the "$"#,##0.00_);[Red]\("$"#,##0.00\)
, the second section specifies ‘[Red]\("$"#,##0.00\)
, this indicates if the cell contains a negative number it should be formatted as
($0.00) with a foreground-color=#FF0000 (the hex value corresponding to 'Red').
Date and time formatting
In the Excel input XML, the dates are indicated as a serial number, it considers 1899-12-31 as the base date and adds the specified serial number to get the date. Further, it uses decimal numbers to specify the time, for instance, 0.5 indicates 06:00. In order to process date and time information, this solution provides two functions as explained in Table 2.
Conditional formatting
This section explains how conditional formatting is applied using the XSLT transformation. In the following example, it applies conditional formatting where it compares the value of B22 according to the operator (greaterThan) with the formula value; this formula can be a static value as well. If the value matches the given conditions it will read the dxfs[0] according to the dxfId then apply the given foreground-color and the background-color to the B22 cell.
<conditionalFormatting sqref="B22"> <cfRule type="cellIs" dxfId="0" priority="1" operator="greaterThan"> <formula>$E$12</formula> </cfRule> </conditionalFormatting>
This dxfId refers to the dxfs element of the styles.xml.
<dxfs count="1"> <dxf> <font> <color rgb="FF9C0006"/> </font> <fill> <patternFill> <bgColor rgb="FFFFC7CE"/> </patternFill> </fill> </dxf> </dxfs>
XLSX with charts and graphs
In order to process charts and graphs, the XLSX chart element is first converted into
an SVG file which is then inserted using the @ref in the tps:image
element of the CXML.
Figure 4 illustrates how the solution navigates through the available files to identify the corresponding chart.xml to be processed. When charts and graphs are inserted to the sheet.xml, the 'drawing' element can be found within the elements. It directs the _rels/sheet1.xml.rels to link with the corresponding drawing.xml. Drawing.xml contains the graphic element with the relevant relationship ID to link to the chart.xml. This solution provides separate functions to read each of these files when needed.
<a:graphic> <a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/chart"> <c:chart xmlns:c="http://schemas.openxmlformats.org/drawingml/2006/chart" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" r:id="rId2"/> </a:graphicData> </a:graphic>
This research focuses on the chart.xml where it reads the available information to understand which conversion needs to be actioned—pie chart to SVG, line chart to SVG, or bar chart to SVG. Further, it is required to read the theme.xml to get the used colour information.
Table 5 demonstrates the common elements in a XLSX which are related to charts.
Table V
Xpath | Description |
---|---|
/c:chartSpace/c:chart/c:title/c:tx/c:rich/a:p/a:r/a:t |
Chart title |
/c:chartSpace/c:chart/c:plotArea/c:pieChart |
Pie chart |
/c:chartSpace/c:chart/c:plotArea/c:barChart |
Bar chart |
/c:chartSpace/c:chart/c:plotArea/c:lineChart |
Line chart |
/c:barChart/c:ser/c:cat/c:strRef/c:strCache/c:pt |
Categorical data |
/c:barChart/c:ser/c:val/c:numRef/c:numCache/c:pt/c:v |
Numeric values |
/c:barChart/c:ser/c:val/c:numRef/c:numCache/c:ptCount/@val |
Point count |
/c:barChart/c:ser/c:spPr/a:solidFill/a:schemeClr/@val /c:pieChart/c:ser/c:dPt/c:spPr/a:solidFill/a:schemeClr/@val |
Colour value |
The following is a sample solidFill element and it is used to read the colour information.
<a:solidFill> <a:schemeClr val="accent1"/> </a:solidFill>
The val="accent1" corresponds to the /a:accent1/a:srgbClr/@val
in the
theme.xml. This system provides access to the theme.xml and to the specific color
details.
SVG (Scalable Vector Graphics)
Table 6 explains the elements that are available in an output .svg file.
Table VI
Element | Description |
---|---|
<defs>
|
This is used to store the graphical element to be used in the following segments. |
<path>
|
|
M x, y | M x, y |
A |
Elliptical arc: This command will draw an elliptical arc from the current point to x, y.
|
L x, y | Line to x, y |
<rotate>
|
Rotate (<a> [<x> <y>]) - rotation by <a> degrees about a given point. If [<x> <y>] are not supplied, the rotation is about the origin of the current user coordinate
system. y
|
<clipPath>
|
A clipping path specifies the region to which paint should be applied. Then the parts of the drawing that lie outside of the region bounded by the clipping path are not drawn. |
<circle cx="50" cy="50" r="50"/>
|
This command will draw circles based on a centre point and a radius. |
<use>
|
This element takes nodes within the SVG document and duplicates them somewhere else. |
<rect>
|
This draws rectangles, defined by their position, width, and height. |
<text>
|
This element helps to draw graphics elements consisting of text |
<line>
|
This creates a line connecting two points |
To investigate the possibility of supporting XLSX graphs and charts using XSLT, the solution has considered an input file with a simple pie chart, a line chart, and a bar chart.
XLSX with a pie chart
The XSLT transformation that creates the pie chart uses elliptical arcs and fills them with specific colours. The feature available with clipPath in the SVG is used here so only the information within the specified region is visualised to the user. The following demonstrates the sample input (Figure 5) chart used to implement the pie chart functionality and the output (Figure 6 - .svg file) it provides from the proposed XSLT transformation.
XLSX with a line chart
The information available in Table 5 can be used to map the input chart.xml (Figure 7) into the output SVG (Figure 8) which represents a line chart.
XLSX with a bar chart
The following source code shows that the conversion of the bar chart and the line chart are similar, except in the line chart it uses a path and in the bar chart it uses a rect.
<xsl:template name="chart"> <xsl:param name="is-line-chart" as="xs:boolean"/> <xsl:variable name="max-val" select="xs:decimal(max(c:ser/c:val/c:numRef/c:numCache/c:pt/c:v))" as="xs:decimal"/> <xsl:variable name="data-count" select="count(c:ser/c:val/c:numRef/c:numCache/c:pt)" as="xs:integer"/> <xsl:variable name="chart-height" select="$max-val + 50" as="xs:decimal"/> <svg> <g> <--Chart title--> <text x="{$data-count*50}" y="10" font-size="14px" text-anchor="end"> <xsl:value-of select="ancestor::c:chart/c:title/c:tx/c:rich/a:p/a:r/a:t"/> </text> <--Chart axis--> <line id="axis-y" x1="40" y1="{$chart-height}" x2="40" y2="20" style="fill:none;stroke:rgb(0,0,0);stroke-width:2"/> <line id="axis-x" x1="40" y1="{$chart-height}" x2="{$data-count*100}" y2="{$chart-height}" style="fill:none;stroke:rgb(0,0,0);stroke-width:2"/> <xsl:for-each select="c:ser/c:val/c:numRef/c:numCache/c:pt"> <xsl:variable name="pt" select="." as="element(c:pt)"/> <xsl:variable name="color" select="'#' || tps:get-theme-xml()/a:theme/a:themeElements/a:clrScheme/a:*[local-name() eq tps:get-theme-color(current()/(ancestor::c:lineChart[$is-line-chart], ancestor::c:barChart)[1], $pt/@idx)]/a:srgbClr/@val" as="xs:string"/> <--Y axis labels--> <text x="30" y="{$chart-height - c:v}" font-size="12px" text-anchor="end"> <xsl:value-of select="c:v"/> </text> <xsl:choose> <--Line chart--> <xsl:when test="$is-line-chart"> <path d="M {50 + (position() - 1)*100},{$chart-height - c:v} L {50 + (position())*100}, {$chart-height - following-sibling::c:pt/c:v} Z" stroke="{$color}"/> </xsl:when> <--Bar chart--> <xsl:otherwise> <rect x="{40 + (position() - 1)*100}" y="{$chart-height - c:v}" width="50" height="{c:v}" style="fill:{$color};stroke:rgb(0,0,0);stroke-width:0"/> </xsl:otherwise> </xsl:choose> </xsl:for-each> <--X axis labels--> <xsl:for-each select="c:ser/c:cat/c:strRef/c:strCache/c:pt"> <text x="{100 + (position()-1)*100}" y="{$chart-height + 20}" font-size="12px" text-anchor="end"> <xsl:value-of select="c:v"/> </text> </xsl:for-each> </g> </svg> </xsl:template>
Application of the converter
The use case for Typefi is embedded Excel files within Word documents, but can also be used for standalone Excel documents.
This application considers some input parameters (Table 7.), and the following user interface prompts the user to fill them.
The following element will be added into the docx file with the user-provided information.
<w:instrText>Sheet="Historical Performance" CellRange="C6:J22" FooterRows="0" HeaderRows="2"
PreferredWidth="100" PreserveCellStyles="False" PreserveLocalOverrides="False"</w:instrText>
Table VII
Category | Sample data | Description |
---|---|---|
Sheet | Historical Performance | Sheet name |
CellRange | C6:J22 | Docx file contains the cell range of the table, which cells were included in the docx file as a table. |
FooterRows | 0 | Number of footer rows |
HeaderRows | 2 | Number of header rows |
PreferredWidth | 100 | This relates to the amount that the table can take up when it is positioned in InDesign. |
PreserveCellStyles | False | Using the Excel cell styles, this maps to the @type attribute on the <entry> in CXML which in turn maps to Cell styles in InDesign.
|
PreserveLocalOverrides | False | Add cell-format values to the entry/@style attribute. |
The sheet name, cell range, footer/header rows and the referred width values are passed to the XSLT as parameters and will be used during the transformation.
<xsl:param name="sheet" as="xs:string" select="'HistoricalMonthly Performance'"/> <xsl:param name="cell_range" as="xs:string" select="'C6:J22'"/> <xsl:param name="footer_rows" as="xs:integer" select="0"/> <xsl:param name="header_rows" as="xs:integer" select="2"/> <xsl:param name="preferred_width" as="xs:integer" select="100"/>
The proposed solution will convert the table information to a tps:table
element, and charts and image references to tps:image
elements in the CXML. Images and charts are transformed into separate image files
and the tps:image
element will reference the image file.
Discussion
The previous approach, with the Apache POI, read the XLSX file using the streaming, as it was memory intensive when handling large or memory-intensive files. Ultimately, limitations with this approach prevented us from implementing specific features and maintaining the code with bug fixes. The existing application supports Word documents that are processed using XSLT features. Therefore, our goal is to use a similar approach in the Excel converters.
Meanwhile, the XSLT approach provides the flexibility to read the required cell range of a particular sheet.xml without considering the entire XLSX file. In other words, this custom XSLT implementation will read the XLSX files at the lowest level possible, giving the maximum control and flexibility possible to help bypass the restrictions of the Apache POI API.
Compared to the Java solution, an XSLT developer would find it easier to use XSLT to do these transformations, adding more flexibility by mapping the information into the output XML schema. Finally, XSLT is the globally accepted methodology for XML transformations. Therefore Java code is less suitable than XSLT for document format transforms and conversion to XML.
The XProc pipeline was a new research area that delivered a unique solution to read an Excel file and pass it as a collection to the XSLT without writing into the disk. An inexperienced user of XML Calabash has to spend a considerable amount of time grasping the concepts of XProc language with the available online resources. Even though there are several other alternatives to achieve the same objective, we believe this was an ideal problem to demonstrate the application of XProc in such a domain. The listed resources related to XProc provide guidelines to troubleshoot various errors which occurred in several instances during the development of the final XProc pipeline. Several processors implement XProc 3.0 features, but XML Calabash was compatible with the ANT script so this solution runs the pipeline using it.
This application provides a general approach by delivering the flexibility of converting the XLSX tables into any other XML schema since this research provides a solution around CALS tables. Moreover, the XLSX chart to SVG conversion stores the output chart.svg files in a separate chart folder so it can be easily reused with any other transformation or as a standalone application. Currently, the proposed solution converts the pie chart, line chart and bar chart to an SVG output. It can be improved to support any other chart formats and with complex data; however, as a proof of concept, this solution considers only these types.
Further, the solution provides a separate set of functions to read each file, such
as the styles.xml, theme.xml, sharedStrings.xml, and relationship files. For instance,
tps:get-total-col-width
function is implemented to calculate the total column width in order to calculate
the width proportion. In this case, it has used the XSLT 3.0 xsl:iterate
feature, and it showcases a great application of that.
This paper presents the possibility of transforming an XLSX file into a different XML schema; however, this paper also identifies several limitations and future improvements. For example, the current solution does not consider shrinkToFit kind of alignment feature and formulas, but this will be implemented as a future improvement. Also, the current implementation considers the common input XLSX file structure to read the files in corresponding order, but an arbitrary file structure (considering relationship files) will be considered in future versions.
Conclusion
This research covers three research areas— XProc pipeline, XLSX table to CALS table transformation, and XLSX chart to SVG conversion. The solution uses XProc, XSLT, XPath and ANT scripts and provides a CXML file as the final output. This paper explains the input file element information and the required information to be fetched to fill the required XML output. CALS tables are derived from the information available in the sheet.xml and other related files. It further demonstrates how the solution converts the chart.xml into an SVG file by explaining the available features of the SVG schema.
The Excel file structure becomes too complex to be processed—with the size and available information of the input file. However, based on the customer files, we decided which features were required to be implemented. Ultimately, the solution will increase the scope in the future.
References
[1] Microsoft Corporation. February 15, 2022. Office Implementation Information for ISO/IEC 29500 Standards Support. https://interoperability.blob.core.windows.net/files/MS-OI29500/%5bMS-OI29500%5d.pdf
[2] Spreadsheet styles. http://officeopenxml.com/SSstyles.php
[3] Microsoft Office, API Reference. https://docs.microsoft.com/en-us/dotnet/api/documentformat.openxml.spreadsheet.sheetformatproperties?view=openxml-2.8.1
[4] "Retrieving data from Excel XML using XSLT". https://stackoverflow.com/questions/33422652/retrieving-data-from-excel-xml-using-xslt
[5] "How to turn an XML file into SVG using XSL?" https://stackoverflow.com/questions/8056671/how-to-turn-an-xml-file-into-svg-using-xsl
[6] XSLT to generate SVG tutorial. 2016. http://edutechwiki.unige.ch/en/XSLT_to_generate_SVG_tutorial
[7] Florent Georges. 2008. Simple SVG chart generation with XSLT. http://fgeorges.blogspot.com/2008/04/simple-svg-chart-generation-with-xslt.html
[8] Max Froumentin, Vincent Hardy, W3C. Using XSLT and SVG together: a survey of case studies. https://www.w3.org/People/maxf/papers/2002-07-SVGOpen/svgopen.html
[9] SVG attribute reference. https://developer.mozilla.org/en-US/docs/Web/SVG/Attribute
[10] A pure XSLT/SVG pie chart. https://www.sleepingdog.org.uk/svg/chart/pie/
[11] SvgCharts4XSL. https://franklinefrancis.github.io/SvgCharts4Xsl/
[12] Adrian McMenamin. 2011. Using XSLT to manipulate an SVG file. https://cartesianproduct.wordpress.com/2011/07/16/using-xslt-to-manipulate-an-svg-file/
[13] SVG: Scalable Vector Graphics. https://developer.mozilla.org/en-US/docs/Web/SVG
[14] Ben Hauser. Content XML 3.0 and earlier. https://help.typefi.com/hc/en-us/articles/228240428-Content-XML-3-0-and-earlier
[15] Laura Powers. 2021. "Local style overrides". https://help.typefi.com/hc/en-us/articles/360001491415#h_3be5e014-f255-49b8-8454-6c7681861b1a
[16] Erik Bruchez. "Unzipping an Excel 2007 .xlsx file and extracting data in XPL/XSLT". https://gist.github.com/ebruchez/1245692
[17] Dave Bruns. "Excel customer number formats". https://exceljet.net/custom-number-formats
[18] Appendix A: Full XML schema. https://docs.microsoft.com/en-us/openspecs/office_standards/ms-xlsx/6624db33-496c-47f7-a562-a54cb01b133f
[19] Apache POI - HSSF and XSSF Limitations. https://poi.apache.org/components/spreadsheet/limitations.html
[20] Harvey Bingham. OASIS Technical Resolution TR 9503:1995, Exchange Table Model Document Type Definition. https://www.oasis-open.org/specs/a503.htm
[21] Ari Nordström. 2020. "Pipelined XSLT Transformations". Presented at Balisage: The Markup Conference 2020, Washington, DC, July 27 - 31, 2020. In Proceedings of Balisage: The Markup Conference 2020. Balisage Series on Markup Technologies, vol. 25. https://www.balisage.net/Proceedings/vol25/print/Nordstrom01/BalisageVol25-Nordstrom01.html. doi:https://doi.org/10.4242/BalisageVol25.Nordstrom01
[22] David Maus. 2018. XProc Step by Step: Implementing a DOCX to TEI step. https://dmaus.name/blog/2018.14/index.html
[23] XProc 3.0: An XML Pipeline Language, Community Group Report 1 July 2022. https://spec.xproc.org/master/head/xproc/
[24] Norman Walsh, Achim Berndzen. 2019. "XProc 3.0". Presented at Balisage: The Markup Conference 2019, Washington, DC, July 30 - August 2, 2019. In Proceedings of Balisage: The Markup Conference 2019. Balisage Series on Markup Technologies, vol. 23. https://www.balisage.net/Proceedings/vol23/print/Walsh02/BalisageVol23-Walsh02.html. doi:https://doi.org/10.4242/BalisageVol23.Walsh02
[25] Erik Siegel. 2020. XProc 3.0 - Strategies for merging documents. https://www.xml.com/articles/2020/11/16/xproc-30-strategies-merging-documents/#sect-pass-as-collection
[26] "How do I convert Excel Serial Date Numbers in an XML file to mm/dd/yyyy for SQL Server in an SSIS Package?" https://stackoverflow.com/questions/64018372/how-do-i-convert-excel-serial-date-numbers-in-an-xml-file-to-mm-dd-yyyy-for-sql
[27] "How to change date format in Excel and create custom formatting". https://www.ablebits.com/office-addins-blog/2015/03/11/change-date-format-excel/#excel-date-format