Note

I would like to thank my colleague Max Zhaloba for his valuable comments and clarifications related to XProc and XSLT, and also Damian Gibbs, Kate Prentice, and Anupama Wimalasooriya for helping me to review the content. Finally, I would like to thank Typefi for the motivation and support.

Introduction

This solution transforms a Microsoft Excel Open XML Spreadsheet (XLSX) file into a shallow-structured XML file used at Typefi, called Content XML (CXML), using XSLT and XProc an XML- based programming language for processing documents in pipelines.

Previously in Typefi, the XLSX to CXML transformations were performed using Java packages, usually the Apache POI package, which provides APIs for manipulating Office Open XML documents. This approach required a significant amount of Java knowledge, and for someone with little Java experience, handling the available functionalities of the Apache POI was not convenient in comparison to the XSLT transformations. Many systems convert XLSX to other XML schema transformations using Java language, but identifying the internal structure of an XLSX file and processing them is the real challenge.

A solution could use several of these technologies, but the approach outlined in this paper uses XSLT—mainly XSLT 3.0 features. While the amount of code needed to filter out all the Excel formatting information is significant, the goal is to make the code reusable. This goal is accomplished by providing reusable functions to access features within the XLSX elements easily. Moreover, the solution includes an 'xlsx-to-cals' mode that transforms Excel table information into a CALS table. The "CALS Table Model" is the most widely adopted industry standard of SGML/XML representation for tables. Further, this research provides modules that transform charts (bar chart, pie chart and line chart) into SVG files. This general approach helps to transform XLSX tables and charts into XML schemas other than CXML because CALS tables and SVG files are well-known formats. The XSLT solution outlined in this paper is executed from an XProc pipeline, with ANT scripts used to call the pipeline and XML Calabash running the XProc pipeline..

In this solution, the final output is Content XML (CXML) which is a DocBook-like XML scheme used internally in Typefi products. It is an intermediary format to facilitate converting any input file format to another format, such as Adobe InDesign. CXML is optimised for page layout (i.e. InDesign/PDF) or digital production (i.e. EPUB, HTML, DAISY).

XLSX files can contain tables, charts and graphics, which all play a significant role. Therefore, this paper focuses on processing these elements and transforming them into the corresponding CXML elements according to the CXML schema 3.2.

This paper examines how the XProc pipeline reads the .xlsx file along with the information from the various XML files and how the XSLT transforms the content using the elements and attributes into a different XML schema—CXML.

Please note: This paper uses @ in front of names to denote attributes of an element.

Figure 1 illustrates the high-level design of this system.

Figure 1: High-level project architecture

XLSX overview

The XLSX file type was introduced in 2007 as the disk form of Excel spreadsheets when Office 2007 was released. XLSX is a ZIP package that contains many XML files and is organised according to the Open Packaging Conventions (OPC). The XLSX system follows this folder structure with several functions to read the corresponding files such as relationship files, sheet.xml, and chart.xml.

Table I

The XLSX file structure

XLSX folder structure example Folder/sub-folder titles Information contained within these folders
_rels Relationships to the primary XML documents and other key parts are included here.
[Content_Types].xml
  • All XML file references included in the package are referenced here.

  • Content types for parts within the package are also listed in this file.

docProps

Overall document properties are contained here.

  • app.xml: information about the file content.

  • core.xml: information about the author, created and modified dates etc.

xl

Details about the content of the file are listed here.

  • charts: Information relating to inserted charts.

  • drawings: graphics related information.

  • media: Image files.

  • theme: Theme-related details.

  • worksheets: worksheet-related information.

XProc pipeline

Calabash is used to implement an XProc pipeline that reads the files from the input XLSX file and passes the information as a collection() to the XSLT. More information about XProc is available in "XProc 3.0: An XML Pipeline Language" [23].

Using the ANT script we can extract the Excel file and pass the disk location to the XSLT. However, the advantage of Calabash is that everything flows through the pipeline and does not require to be saved to a disk unless we use p:store explicitly.

The first solution writes the unzipped file into the disk. Having cx:decode="true" in the p:store is important, otherwise the XSLT function collection() could not detect the extracted content as a valid XML file, instead returnining an xs:base64Binary object, while the expected output was a document-node().

                  
     <pxp:unzip>
         <p:with-option name="href" select="$input.file.path"/>
         <p:with-option name="file" select="$filename"/>
         <p:with-option name="charset" select="UTF-8"/>
         <p:with-option name="content-type" select="xml"/>
     </pxp:unzip>
      
     <p:store cx:decode="true">
         <p:with-option name="href" select="concat($output-dir, '/', encode-for-uri($filename))"/>
     </p:store>

The following code segment demontrates how to pass the Excel file as a collection to the XSLT program. This idea is proposed by one of our team members, Max Zhaloba. This is the recommended way, as it does not need to write the intermediate content into the disk.

This XProc pipeline follows these steps:

  • Get the input parameters from the ANT script (input file path and the XSLT stylesheet).

  • Unzip the XLSX file

  • Iterate over the ZIP file and add the file path as an attribute. This will be used from the XSLT side to read the file content.

  • Call the XSLT program and pass the unzip file information as a parameter.

  • Write the main output file to the disk.

  • Write the other result documents to the disk.

                  
     <p:declare-step version="3.0"
                        xmlns:c="http://www.w3.org/ns/xproc-step"
                        xmlns:pxp="http://exproc.org/proposed/steps"
                        xmlns:p="http://www.w3.org/ns/xproc"
                        name="main">
     
         <p:input port="param" kind="parameter"/>

         <p:output port="result" primary="true" sequence="true">
             <p:pipe port="result" step="secondary-storage"/>
         </p:output>
     
     
         <p:declare-step type="pxp:unzip">
             <p:output port="result"/>
             <p:option name="href" required="true"/>
             <p:option name="file"/>
             <p:option name="content-type"/>
             <p:option name="charset"/>
         </p:declare-step>
     
     
         <p:variable name="input.file.path" select="replace(/c:param-set/c:param[@name eq 'input']/@value, '\\', '/')">
             <p:pipe step="main" port="param"/>
         </p:variable>
     
     
         <p:variable name="stylesheet.file.path" select="replace(/c:param-set/c:param[@name eq 'xsl']/@value, '\\', '/')">
             <p:pipe step="main" port="param"/>
         </p:variable>
     
     
         <p:load name="load-stylesheet">
             <p:with-option name="href" select="$stylesheet.file.path"/>
         </p:load>
     
     
         <pxp:unzip name="unzip-excel-file">
             <p:with-option name="href" select="$input.file.path"/>
         </pxp:unzip>
     
     
         <p:for-each name="iterate-zip-directory">
             <p:iteration-source select="/c:zipfile/c:file[ends-with(@name, '.xml') or ends-with(@name, '.rels')]">
                 <p:pipe step="unzip-excel-file" port="result"/>
             </p:iteration-source>
     
             <p:output port="result">
                 <p:pipe step="unzipped-xml-files" port="result"/>
             </p:output>
     
             <p:variable name="unzipped-file-subpath" select="/c:file/@name"/>
     
             <pxp:unzip>
                 <p:with-option name="href" select="$input.file.path"/>
                 <p:with-option name="file" select="$unzipped-file-subpath"/>
             </pxp:unzip>
     
             <p:add-attribute name="unzipped-xml-files" attribute-name="xml:base" match="/*">
                 <p:with-option name="attribute-value" select="$unzipped-file-subpath"/>
             </p:add-attribute>
         </p:for-each>
     
     
         <p:xslt name="apply-transform" template-name="main">
             <p:input port="stylesheet">
                 <p:pipe step="load-stylesheet" port="result"/>
             </p:input>
     
             <p:input port="source">
                 <p:pipe step="iterate-zip-directory" port="result"/>
             </p:input>
     
             <p:input port="parameters">
                 <p:inline>
                     <c:param-set>
                         <c:param name="sheet_ui_name" value="sheet1.xml"/>
                         <c:param name="debug" value="true"/>
                     </c:param-set>
                 </p:inline>
             </p:input>
     
         </p:xslt>
         
     
         <p:store href="../output/output.cxml"/>
         
     
         <p:for-each name="secondary-storage">
             <p:iteration-source select=".">
                 <p:pipe port="secondary" step="apply-transform"/>
             </p:iteration-source>
     
             <p:output port="result">
                 <p:pipe port="result" step="store"/>
             </p:output>
     
             <p:store name="store">
                 <p:with-option name="href" select="document-uri(.)"/>
             </p:store>
         </p:for-each>
     
     </p:declare-step>

The following code segment can be used to read the inner files of the unzipped file from the XSLT side. All files will be accessed using the corresponding file name.

                  
     <xsl:variable name="worksheet"  select="collection()[*/base-uri() eq 'xl/worksheets/sheet1.xml']" as="document-node()"/>

XLSX to CXML Methodology

The transformation to CXML begins with an XLSX input file which is processed to access the inner files. As explain in the previous section, this program is initialised using ANT scripts; then an XProc pipeline is executed using XML Calabash.

Most transformations are triggered from the sheet.xml which resides in the xl/worksheets folder. The sheetData element in the sheet.xml contains the information needed to be converted to a tps:table element and the following sections will discuss this in more detail.

The XLSX to CXML solution was developed out of two major research areas:

  • Map the XLSX table information to the CALS table elements and attributes.

  • Convert the chart.xml into an SVG and map it to tps:image elements and attributes.

The XLSX to CALS transformation happens as a pre-processing mode, then in the default mode, it will be converted to CXML schema; SVG files are referred to as images in this schema. The solution provides a CXML file as the final output. The XLSX to CALS transformation adds flexibility to this allowing mapping to XML schemas other than CXML since it uses CALS table and SVG formats.

Solutions to the above scenarios are developed using XSLT 3.0.

Input XML

Workbook.xml contains information about the available worksheets. The attribute value of the r:id maps to the related sheet.xml. This XSLT transformation starts to trigger the templates from the sheet.xml.

The following is a sample sheets element in the workbook.xml

                     
         <sheets>
             <sheet name="Instructions" sheetId="1" r:id="rId1"/>
             <sheet name="DBRS Tape" sheetId="37" r:id="rId2"/>
         </sheets>

The following code segment can be used to find the corresponding sheet.xml using the sheet name which we get as an input parameter.

                     
         <xsl:variable name="sheet.name" select="substring-after(tps:get-workbook-rels()/rel:Relationships/rel:Relationship[@Id = $workbook/e:sheets/e:sheet[@name = $sheet]/@r:id]/@Target, '/')" as="xs:string?"/>

The following shows a sample sheetData element from a XLSX input XML.

                     
         <sheetData>
             <row r="1" spans="1:2" x14ac:dyDescent="0.2">
                 <c r="A1" t="s">
                     <v>0</v>
                 </c>
                 <c r="B1" t="s">
                     <v>1</v>
                 </c>
             </row>
             .
             .
             .
         </sheetData>

The system architecture

The following diagram shows the functions created to access the XLSX features. Other than these functions, matching templates are triggered to convert the sheetData into a CALS table as well as for the chart conversion.

Figure 2: Overview of XSLT functions

Table 2 explains the available functions and the purpose of each.

Table II

Function definitions

Function Description
tps:get-worksheet Read the sheet.xml
tps:get-stylesheet Read the styles.xml
tps:get-sharedstring-table Read the shared string table from the sharedStrings.xml
tps:get-theme-xml Read the theme.xml
tps:get-workbook-rels Read the workbook.xml.rels
tps:get-worksheet-rels Read the sheet#.xml.rels
tps:get-drawing-rels Read the drawing#.xml.rels
tps:get-chart-xml Read the chart.xml
tps:get-cellxfs Get the /styleSheet/cellXfs/xf elements
tps:get-font Get font elements from the style.xml
tps:get-border-style Get the style of the bottom, top, left, right borders.
tps:get-pattern-fill Get the /styleSheet/fills/fill/patternFill/@patternType from the styles.xml
tps:get-theme-colour Get the /styleSheet/numFmts/numFmt[2]/@formatCode from the styles.xml
tps:get-align Get the horizontal alignment of the cell
tps:get-valign Get the vertical alignment of the cell
tps:get-rotation Get the rotation information from the styles.xml
tps:get-foreground-colour Get the font colour
tps:get-style Get the style applied to text, bold, italic, underline, strikethrough, superscript and subscript. This can be individual or a combination of them.
tps:get-background-colour Get background colour of the cell
tps:get-total-col-width This function is used to calculate the total column width
tps:get-time-from-the-value Convert the decimal value to the correct time format
tps:get-date-from-the-serial-no Convert the serial no to a date
tps:is-hidden-column Check whether a column is hidden

XLSX tables

This section considers XLSX files with tables. The XSLT transformations to convert XLSX into CALS table markup (with example XPath) is explained, and guidelines to apply user-defined XSLT functions are provided.

Table 3 displays the corresponding CALS table element mapping for the Excel elements of the sheet.xml.

Table III

XLSX elements map to the CALS table elements

XLSX element CALS table element

sheetData

Header rows, footer rows are defined by the user through the application interface, see Figure 11.

                                 
         <table>
             <tgroup cols="number of columns">
                 <colspec/>
                 <thead>
                 </thead>
                 <tbody>
                 </tbody>
                 <tfoot>
                 </tfoot>
             </tgroup>
         </table>
col <colspec>
row <row>
c

<entry>

  • @namest specifies the starting cell number for horizontally merging cells.

  • @nameend specifies the ending cell number for horizontally merging cells.

  • @morerows specifies the number of additional rows for vertically merging cells.

  • @valign specifies the vertical alignment for the cell contents.

  • @align specifies the horizontal alignment for the cell contents.

The following shows a sample CALS table output corresponding to the input given in section 2.1.

                  
     <table>
         <tgroup cols="2">
             <colspec colname="1" colwidth="8"/>
             <colspec colname="2" colwidth="8"/>
             <thead>
                 <row>
                     <entry namest="1" nameend="1" align="left" valign="bottom">Year</entry>
                     <entry namest="2" nameend="2" align="left" valign="bottom">Total</entry>
                 </row>
                 <row>
                     <entry namest="1" nameend="1" align="left" valign="bottom">Year 2020</entry>
                     <entry namest="2" nameend="2" align="left" valign="bottom">20 </entry>
                 </row>
             </thead>
             <tbody>
                 <row>
                     <entry namest="1" nameend="1" align="left" valign="bottom">Year 2022</entry>
                     <entry namest="2" nameend="2" align="left" valign="bottom">80</entry>
                 </row>
             </tbody>
         </tgroup>
     </table>

Cell (entry)

The <entry> element is used in this solution to create table cells and has several attributes, so further discussion about cell-related information is important. Therefore, the following section explains about XLSX cell elements and related features and attributes.

Cell range

In order to access the cell range, tps:get-worksheet() will return the selected worksheet as the $worksheet variable. This is either user-specified by using an input parameter, or it considers the tabSelected value to read the specific sheet (i.e <sheetView tabSelected="1" workbookViewId="0"/> tabSelected="1" specifies that the corresponding sheet is currently selected).

The xlsx-to-cxml transformation should limit the processing of cells to the given cell range. This is defined by the dimension element for a particular table.

                        
     <dimension ref="A1:B3"/>
     <xsl:variable name="cell.range" select="$worksheet//excel:dimension/@ref" as="xs:string?"/>

XLSX Cells (c)

This element represents a cell in the Excel worksheet, and the information about the cell’s reference (@r) , formatting (@s), data type (@t), value (is- rich text inline, v - cell value) and the formula (f) are expressed. In this application, the formula-related transformation has been ignored.

Cell reference (@r)

The solution identifies the location of a particular cell using cell reference information, and analyses whether a cell is within the cell range or whether it is within a merged cell range.

Possible data types

  • n - number (the default)

  • b - boolean

  • d - date

  • e - error

  • inlineStr - an inline string

  • s - a shared string

  • str - a formula string

The following section describes a shared string data type (s).

Shared String Table (sst)

The Shared String Table serves as a collection of individual String Items (<si>). It contains two attributes: @count - total number of text strings in the workbook, @uniqueCount - total number of unique text strings.

When the data type (@t) of the cell (c) indicates a 's' then the corresponding value should be read from the sharedStrings.xml

                           
         <c r="A7" s="14" t="s">
             <v>6</v>
         </c>

The above sample cell element value is 6, therefore, it maps to the 7th string item of the shared string table (/sst/si[7]/t).

Merge cells

The sample mergeCells element

                           
         <mergeCells count="3">
             <mergeCell ref="A7:C7"/>
             <mergeCell ref="A8:B8"/>
             <mergeCell ref="B9:C9"/>
             <mergeCell ref="A4:A5"/>
         </mergeCells>

In CALS, @nameend and @namest attributes define the horizontally-merged cells. The @r value is used to identify the cell reference, then the system has defined several functions to identify whether that particular cell is within a merged cell range.

                           
         <xsl:function name="tps:is-cell-merged" as="xs:boolean">
             <xsl:param name="cell" as="xs:string"/>
     
             <xsl:variable name="merge-cells" select="$worksheet/e:worksheet/e:mergeCells/e:mergeCell" as="element(e:mergeCell)*"/>  
             <xsl:sequence select="some $c in $merge-cells satisfies tps:is-within-cell-range($cell, $c/@ref)"/>
         </xsl:function>

This function tps:is-within-cell-range($cell, $ref) is used to check the whether a given cell reference is within a cell range, the same function is used to process the merged cells and it needs the current cell and the mergeCell/@ref value as parameters. The following code segment demonstrates how it processes the @nameend attribute

                           
     <xsl:variable name="merge-cells" select="sum(tps:get-merged-cell-count(@r))" as="xs:integer"/>
     <xsl:attribute name="nameend">
         <xsl:value-of select="$merge-cells + count(preceding-sibling::excel:c) + 1"/>
     </xsl:attribute>

The function tps:get-merged-cell-count(@r) gets the cell reference as a parameter, and if that cell is within a merge cell range, then it will get the difference between the columns. For instance if the mergeCell/@ref is A7:C7 string-to-codepoints($last-col) - string-to-codepoints($first-col) will return 2. The @namest defines the cell position corresponding to a particular row. So if the @namest is 1, then the @nameend value will be 3.

The <xsl:template match="excel:c[not(normalize-space())][tps:is-cell-merged(@r)]" priority="10"/> template will remove the blank cells related to the merged cell range.

The above mergeCells element <mergeCell ref="A4:A5"/> indicates vertically merged cells, and the solution defines them using @morerows in the CALS tables.

Style overrides

Figure 3 can be used to explain more about these cell formats, it has applied different kinds of formatting. These can be considered in the following sections to discuss how this solution considers such input and applies cell format attributes to the final output.

  • A2 cell value is left aligned and it is styled as bold.

  • A3 cell value has applied red as the font colour.

  • A4 cell has borders in it.

  • C3 cell has a background colour.

  • the 7th row has some italic and underline text.

  • C5 cell value has a number format applied to it.

Figure 3: Sample Excel table

For instance, cell A3 can be considered and the following is how it is displayed in the worksheet.xml.

                        
         <c r="A3" s="9">
             <v>4</v>
         </c>

Note: All the style-related numbers are 0-based indexes (i.e., 0 maps to the 1st element in the style.xml).

The @s maps to the /styleSheet/cellXfs/xf[10], <xf numFmtId="0" fontId="2" fillId="0" borderId="0" xfId="0" applyFont="1" applyAlignment="1">, and Table 4 provides details about the available style information in the styles.xml. The @applyFont and @applyAlignment attributes store boolean values indicating whether the corresponding formatting specified for this xf should be applied or not.

Table IV

Cell format attributes

Attribute Sample XPath Description
Background colour /styleSheet/fills/fill[4]/patternFill/fgColor/@rgb cellXfs/xf/ gives the fillid, which maps to the fill and the corresponding fillpattern.
Foreground colour /styleSheet/fonts/font[3]/colour/@rgb cellXfs/xf/ gives the fontId which maps to the font element /styleSheet/fonts/font.
Border type and Border pattern

Border type : /styleSheet/borders/border[3]/top/@style

Border pattern: /styleSheet/fills/fill[4]/patternFill/@patternType

cellXfs/xf/ gives the fillId="0" borderId="0" which maps to the required information.
Number format e.g: 80.50% /styleSheet/numFmts/numFmt[@numFmtId eq 171]/@formatCode numFmtId maps to the formatCode.
Bold, Italic, Underline e.g: "bold_italic" /styleSheet/fonts/font[8]/b /styleSheet/fonts/font[8]/i /styleSheet/fonts/font[8]/u /styleSheet/fonts/font[20]/vertAlign/@val /styleSheet/fonts/font[1]/strike fontId maps to the corresponding font element, vertAlign/@val can be either superscript or subscript, strike element relates to the Strikethrough style.

The mapping between the XLSX cell-format attributes, number format and the style attributes of the XLSX XML are listed in Table 4 with sample input elements and XPaths. The CALS table format does not support styling attributes; therefore, the information related to CXML style overrides (tps:c type="cell-format" and the tps:style) are parsed from XLSX to CALS during the xlsx-to-cals processing mode by using processing instructions. The information available with these processing instructions is converted to CXML style overrides and the number format is done during the stage where it processes the cell value-related information.

Here is an example of a table cell (<entry>) with the style and cell-format information which are represented using processing instructions:

                        
     <entry type="normal" namest="4" nameend="4" align="left" valign="bottom">
         <?style bold?>
             Heading 1
         <?cell-format border-bottom:thick; background-color:FFFFC7CE; foreground-color:FF9C0006;?>
     </entry>

The corresponding XSLT template which creates the CXML style overrides is implemented as follows;

                        
         <xsl:template match="processing-instruction('cell-format')">
             <tps:c type="cell-format">
                 <xsl:value-of select="."/>
             </tps:c>
         </xsl:template>
Alignment
                           
     <xf numFmtId="0" fontId="2" fillId="0" borderId="0" xfId="0" applyFont="1" applyAlignment="1">
         <alignment horizontal="right"/>
         <alignment vertical="top"/>
         <alignment textRotation="90"/>
     </xf>

The @align and @valign attributes of the CALS table are filled from the above cellXfs/xf/alignment/@horizontal and cellXfs/xf/alignment/@vertical attributes and in the absence of these elements it considers left and bottom respectively as the default values.

The cell text rotation is defined using the @textRotation value of the above <alignment> element.

Font style

In the above xf element, fontId="2" maps to the following font element /styleSheet/fonts/font[3].

                           
         <font>
             <b/>
             <i/>
             <u/>
             <sz val="12"/>
             <color rgb="FFFF0000"/>
             <vertAlign val="superscript"/>
             <name val="Calibri"/>
             <family val="2"/>
             <scheme val="minor"/>
         </font>

The font element can contain a collection of style information such as bold (b), italic (i) and underline (u). Furthermore, it provides the foreground colour and other information related to the font style. Superscript and subscript formatting details are defined using the vertAlign/@val. If a text is styled with strikethrough, this element will contain an element called strike.

Background colour

Background colour is defined by the fill element of the styles.xml, fillId="3" maps to the following fill[4] element and /styleSheet/fills[1]/fill[4]/patternFill[1]/fgColor[1]/@rgb gives the background colour information.

                           
         <fill>
             <patternFill patternType="solid">
                 <fgColor rgb="FFFFFF00"/>
                 <bgColor indexed="64"/>
             </patternFill>
         </fill>

Both of these fgColor and bgColor relate to the background colour of a cell. If a pattern is applied to a cell then bgColor defines the background of the pattern and fgColor defines the foreground of the pattern. For instance if the cell pattern contains 'dots', fgColor is the colour of those dots. However, if it is a solid pattern the fgColor colour can be considered as the background colour and it does not have anything to do with the bgColor as demonstrated in the above example.

Border style

Border style is defined by the border element of the styles.xml, borderId="0" maps to the corresponding border element /styleSheet/borders/border[1]. Border pattern is defined by the /styleSheet/fills[1]/fill[4]/patternFill[1]/@patternType value.

In the example illustrated in Figure 3, the A4 cell has borders in it. The border element defines the left, right, top, bottom and the diagonal border styles and the colour of it.

Number format

In the xf element of the styles.xml it contains the numFmtId attribute which defines the number format of a particular cell value. For instance, numFmtId="171" relates to the <numFmt numFmtId="171" formatCode="0.000%"/> element. The format-number XSLT function can be used with this formatCode to format the cell-value (format-number($value, $format-code)).

Open xml documentation lists a set of format codes, the corresponding numFmt id is not explicitly save in the file, so when the styles.xml defines, a numFmtId and when the corresponding numFmt element is not present it needs to access the predefined format codes.

Column width

It is important to calculate the column width based on the cols element details in the input XML; it specifies the column width as the colwidth attribute of the colspec element in the output CXML.

The XLSX col element is not a mandatory element in the input sheet.xml. The following is an sample input of a table which has 11 columns, This means that from column number 3 to 6 the width is 10.8125 and when it does not mention the column width for 8 to 10 columns, the system considers the default column width which is 11.

                        
     <sheetFormatPr defaultColWidth="11" defaultRowHeight="30" customHeight="1"/>
     <cols>
         <col min="1" max="1" width="28.6875" customWidth="1"/>
         <col min="2" max="2" width="12.6875" customWidth="1"/>
         <col min="3" max="6" width="10.8125" customWidth="1"/>
         <col min="7" max="7" width="15.3125" customWidth="1"/>
         <col min="11" max="11" width="10.8125" customWidth="1"/>
     </cols>

The custom columns are defined as above, and all other columns use the default column width which is defined by the @defaultColWidth or the @baseColWidth in the sheet.xml.

The calculation of the colwidth = (column width/ total width) *100 % (eg: first column :- 28.6875/143.75 * 100% = 19.96). This calculation is done using an XSLT function which applies the xsl:iterate feature of the XSLT 3.0. The following function can be reused to calculate the total column width which needs to be considered in a particular table dimension.

                        
     <xsl:function name="tps:get-total-col-width" as="xs:decimal?">
         <xsl:param name="cols" as="element(excel:col)*"/>
         <xsl:param name="default-col-width" as="xs:decimal"/>
 
         <xsl:iterate select="$cols">
             <xsl:param name="total" as="xs:decimal" select="0"/>
             <xsl:param name="next-min" as="xs:decimal" select="0"/>
         
             <xsl:on-completion>
                 <xsl:sequence select="$total"/>
             </xsl:on-completion>
                         
             <!--Calculate the total width for a particular col element-->
             <xsl:variable name="cur-width" select="xs:decimal(@width * (@max - @min + 1))" as="xs:decimal"/>
                         
             <!--When the custom width is not available consider the default column width-->
             <xsl:variable name="default-col-width-total"
                              select="xs:decimal(if($next-min != @min and $next-min != 0)
                                                 then  $default-col-width * (@min - $next-min) else 0)" as="xs:decimal"/>
             <xsl:variable name="new-total" as="xs:decimal">
                 <xsl:sequence select="$total + $default-col-width-total + $cur-width"/>
             </xsl:variable>
         
             <xsl:next-iteration>
                 <xsl:with-param name="total" select="$new-total"/>
                 <xsl:with-param name="next-min" select="xs:decimal(@max + 1)"/>
             </xsl:next-iteration>
         </xsl:iterate>
     </xsl:function>

The sample output for these columns:

                        
         <tps:colspec colname="1" colwidth="19.96%"/>
         <tps:colspec colname="2" colwidth="8.83%"/>
         <tps:colspec colname="3" colwidth="7.52%"/>
         <tps:colspec colname="4" colwidth="7.52%"/>
         <tps:colspec colname="5" colwidth="7.52%"/>
         <tps:colspec colname="6" colwidth="7.52%"/>
         <tps:colspec colname="7" colwidth="10.65%"/>
         <tps:colspec colname="8" colwidth="7.65%"/>
         <tps:colspec colname="9" colwidth="7.65%"/>
         <tps:colspec colname="10" colwidth="7.65%"/>
         <tps:colspec colname="11" colwidth="7.52%"/>

When some sheet.xml files do not contain cols elements, the total column width is calculated as default column width * number of columns.

Custom number formats

The capability of adding custom number formats provides greater flexibility for an Excel user. The following example can be considered to explain this further, ';' is used to separate the format string into different sections.

The following can be considered as a sample customer number format string _("$"* #,##0.00_);_("$"* \(#,##0.00\);_("$"* "-"??_);_(@_)

This number format can include;

  • One section: this will be applicable for all the values.

  • Two sections: first section applies for positive numbers or zeros, and the second section applies negative numbers.

  • Four sections: first section applies to positive numbers, the second section applies to negative numbers, the third section applies to zeros, and the fourth section is for text.

Therefore, the above example has four sections and according to the cell value it considers the corresponding section. The following code segment demonstrates this implementation using XSLT.

                        
     <xsl:variable name="num-formats" select="tokenize(tps:get-number-format($style-index), ';')" as="xs:string*"/>
     <xsl:variable name="section-count" select="count($num-formats)" as="xs:integer"/>

     <xsl:when test="$section-count eq 4 and (. gt 0)">
         <xsl:value-of select="format-number(., $num-formats[1])"/>
     </xsl:when>
     <xsl:when test="$section-count eq 4 and (. < 0)">
         <xsl:value-of select="format-number(., $num-formats[2])"/>
     </xsl:when>
     <xsl:when test="$section-count eq 4 and (. eq 0)">
         <xsl:value-of select="format-number(., $num-formats[3])"/>
     </xsl:when>
     <xsl:when test="$section-count eq 2 and (. >= 0)">
         <xsl:value-of select="format-number(., $num-formats[1])"/>
     </xsl:when>
     <xsl:when test="$section-count eq 2 and (. < 0)">
         <xsl:value-of select="format-number(., $num-formats[2])"/>
     </xsl:when>

Excel provides support for basic colours in custom number format. The following eight colours are specified using the name: [black][white][red][green][blue][yellow][magenta][cyan]. Other colours are considered as index numbers in a custom number format, but this application does not consider those.

For instance in the "$"#,##0.00_);[Red]\("$"#,##0.00\), the second section specifies ‘[Red]\("$"#,##0.00\), this indicates if the cell contains a negative number it should be formatted as ($0.00) with a foreground-color=#FF0000 (the hex value corresponding to 'Red').

Date and time formatting

In the Excel input XML, the dates are indicated as a serial number, it considers 1899-12-31 as the base date and adds the specified serial number to get the date. Further, it uses decimal numbers to specify the time, for instance, 0.5 indicates 06:00. In order to process date and time information, this solution provides two functions as explained in Table 2.

Conditional formatting

This section explains how conditional formatting is applied using the XSLT transformation. In the following example, it applies conditional formatting where it compares the value of B22 according to the operator (greaterThan) with the formula value; this formula can be a static value as well. If the value matches the given conditions it will read the dxfs[0] according to the dxfId then apply the given foreground-color and the background-color to the B22 cell.

                        
     <conditionalFormatting sqref="B22">
         <cfRule type="cellIs" dxfId="0" priority="1" operator="greaterThan">
             <formula>$E$12</formula>
         </cfRule>
     </conditionalFormatting>

This dxfId refers to the dxfs element of the styles.xml.

                        
         <dxfs count="1">
             <dxf>
                 <font>
                     <color rgb="FF9C0006"/>
                 </font>
                 <fill>
                     <patternFill>
                         <bgColor rgb="FFFFC7CE"/>
                     </patternFill>
                 </fill>
             </dxf>
         </dxfs>

XLSX with charts and graphs

In order to process charts and graphs, the XLSX chart element is first converted into an SVG file which is then inserted using the @ref in the tps:image element of the CXML.

Figure 4: A flowchart to map the chart.xml

Figure 4 illustrates how the solution navigates through the available files to identify the corresponding chart.xml to be processed. When charts and graphs are inserted to the sheet.xml, the 'drawing' element can be found within the elements. It directs the _rels/sheet1.xml.rels to link with the corresponding drawing.xml. Drawing.xml contains the graphic element with the relevant relationship ID to link to the chart.xml. This solution provides separate functions to read each of these files when needed.

                  
     <a:graphic>
         <a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/chart">
             <c:chart xmlns:c="http://schemas.openxmlformats.org/drawingml/2006/chart"
                         xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" r:id="rId2"/>
         </a:graphicData>
     </a:graphic>

This research focuses on the chart.xml where it reads the available information to understand which conversion needs to be actioned—pie chart to SVG, line chart to SVG, or bar chart to SVG. Further, it is required to read the theme.xml to get the used colour information.

Table 5 demonstrates the common elements in a XLSX which are related to charts.

Table V

Description of the important chart element

Xpath Description
/c:chartSpace/c:chart/c:title/c:tx/c:rich/a:p/a:r/a:t Chart title
/c:chartSpace/c:chart/c:plotArea/c:pieChart Pie chart
/c:chartSpace/c:chart/c:plotArea/c:barChart Bar chart
/c:chartSpace/c:chart/c:plotArea/c:lineChart Line chart
/c:barChart/c:ser/c:cat/c:strRef/c:strCache/c:pt Categorical data
/c:barChart/c:ser/c:val/c:numRef/c:numCache/c:pt/c:v Numeric values
/c:barChart/c:ser/c:val/c:numRef/c:numCache/c:ptCount/@val Point count
/c:barChart/c:ser/c:spPr/a:solidFill/a:schemeClr/@val/c:pieChart/c:ser/c:dPt/c:spPr/a:solidFill/a:schemeClr/@val Colour value

The following is a sample solidFill element and it is used to read the colour information.

                  
         <a:solidFill>
             <a:schemeClr val="accent1"/>
         </a:solidFill>

The val="accent1" corresponds to the /a:accent1/a:srgbClr/@val in the theme.xml. This system provides access to the theme.xml and to the specific color details.

SVG (Scalable Vector Graphics)

Table 6 explains the elements that are available in an output .svg file.

Table VI

SVG element descriptions

Element Description
<defs> This is used to store the graphical element to be used in the following segments.
<path>
M x, y M x, y
A

Elliptical arc: A rx ry rotate large_arc_flag sweep_flag x y

This command will draw an elliptical arc from the current point to x, y.

  • rx, ry - radius (major/minor axis)

  • rotate - in degrees clockwise

  • large_arg_flag and sweep_flag - control which section of the ellipse to use (1, 0)

L x, y Line to x, y
<rotate> Rotate (<a> [<x> <y>]) - rotation by <a> degrees about a given point. If [<x> <y>] are not supplied, the rotation is about the origin of the current user coordinate system. y
<clipPath> A clipping path specifies the region to which paint should be applied. Then the parts of the drawing that lie outside of the region bounded by the clipping path are not drawn.
<circle cx="50" cy="50" r="50"/> This command will draw circles based on a centre point and a radius.
<use> This element takes nodes within the SVG document and duplicates them somewhere else.
<rect> This draws rectangles, defined by their position, width, and height.
<text> This element helps to draw graphics elements consisting of text
<line> This creates a line connecting two points

To investigate the possibility of supporting XLSX graphs and charts using XSLT, the solution has considered an input file with a simple pie chart, a line chart, and a bar chart.

XLSX with a pie chart

The XSLT transformation that creates the pie chart uses elliptical arcs and fills them with specific colours. The feature available with clipPath in the SVG is used here so only the information within the specified region is visualised to the user. The following demonstrates the sample input (Figure 5) chart used to implement the pie chart functionality and the output (Figure 6 - .svg file) it provides from the proposed XSLT transformation.

Figure 5: Input XLSX pie chart

Figure 6: Sample Output SVG of a pie chart

XLSX with a line chart

The information available in Table 5 can be used to map the input chart.xml (Figure 7) into the output SVG (Figure 8) which represents a line chart.

Figure 7: Sample XLSX line chart

Figure 8: Sample output SVG of a line chart

XLSX with a bar chart

Figure 9: Input XLSX bar chart

Figure 10: Sample output SVG of a bar chart

The following source code shows that the conversion of the bar chart and the line chart are similar, except in the line chart it uses a path and in the bar chart it uses a rect.

                     
     <xsl:template name="chart">
         <xsl:param name="is-line-chart" as="xs:boolean"/>
      
         <xsl:variable name="max-val" select="xs:decimal(max(c:ser/c:val/c:numRef/c:numCache/c:pt/c:v))" as="xs:decimal"/>
         <xsl:variable name="data-count" select="count(c:ser/c:val/c:numRef/c:numCache/c:pt)" as="xs:integer"/>
         <xsl:variable name="chart-height" select="$max-val + 50" as="xs:decimal"/>
      
         <svg>
             <g>
                 <--Chart title-->
                 <text x="{$data-count*50}" y="10" font-size="14px" text-anchor="end">
                     <xsl:value-of select="ancestor::c:chart/c:title/c:tx/c:rich/a:p/a:r/a:t"/>
                 </text>
      
                 <--Chart axis-->
                 <line id="axis-y" x1="40" y1="{$chart-height}" x2="40" y2="20" style="fill:none;stroke:rgb(0,0,0);stroke-width:2"/>
                 <line id="axis-x" x1="40" y1="{$chart-height}" x2="{$data-count*100}" y2="{$chart-height}"  style="fill:none;stroke:rgb(0,0,0);stroke-width:2"/>
      
                 <xsl:for-each select="c:ser/c:val/c:numRef/c:numCache/c:pt">
                     <xsl:variable name="pt" select="." as="element(c:pt)"/>
                     <xsl:variable name="color" select="'#' || tps:get-theme-xml()/a:theme/a:themeElements/a:clrScheme/a:*[local-name() eq tps:get-theme-color(current()/(ancestor::c:lineChart[$is-line-chart], ancestor::c:barChart)[1], $pt/@idx)]/a:srgbClr/@val" as="xs:string"/>
      
                     <--Y axis labels-->
                     <text x="30" y="{$chart-height - c:v}" font-size="12px" text-anchor="end">
                         <xsl:value-of select="c:v"/>
                     </text>
      
                     <xsl:choose>
                         <--Line chart-->
                         <xsl:when test="$is-line-chart">
                             <path d="M {50 + (position() - 1)*100},{$chart-height - c:v}
                                      L {50 + (position())*100}, {$chart-height - following-sibling::c:pt/c:v} Z" stroke="{$color}"/>
      
                         </xsl:when>
                              
                         <--Bar chart-->
                         <xsl:otherwise>
                             <rect x="{40 + (position() - 1)*100}" y="{$chart-height - c:v}" width="50" height="{c:v}" style="fill:{$color};stroke:rgb(0,0,0);stroke-width:0"/>
                         </xsl:otherwise>
                     </xsl:choose>
                 </xsl:for-each>
      
                 <--X axis labels-->
                 <xsl:for-each select="c:ser/c:cat/c:strRef/c:strCache/c:pt">
                     <text x="{100 + (position()-1)*100}" y="{$chart-height + 20}" font-size="12px" text-anchor="end">
                         <xsl:value-of select="c:v"/>
                     </text>
                 </xsl:for-each>
             </g>
         </svg>
     </xsl:template>

Application of the converter

The use case for Typefi is embedded Excel files within Word documents, but can also be used for standalone Excel documents.

This application considers some input parameters (Table 7.), and the following user interface prompts the user to fill them.

Figure 11: Dialog box of Typefi Writer to get input parameters

The following element will be added into the docx file with the user-provided information. <w:instrText>Sheet="Historical Performance" CellRange="C6:J22" FooterRows="0" HeaderRows="2" PreferredWidth="100" PreserveCellStyles="False" PreserveLocalOverrides="False"</w:instrText>

Table VII

Parameter descriptions

Category Sample data Description
Sheet Historical Performance Sheet name
CellRange C6:J22 Docx file contains the cell range of the table, which cells were included in the docx file as a table.
FooterRows 0 Number of footer rows
HeaderRows 2 Number of header rows
PreferredWidth 100 This relates to the amount that the table can take up when it is positioned in InDesign.
PreserveCellStyles False Using the Excel cell styles, this maps to the @type attribute on the <entry> in CXML which in turn maps to Cell styles in InDesign.
PreserveLocalOverrides False Add cell-format values to the entry/@style attribute.

The sheet name, cell range, footer/header rows and the referred width values are passed to the XSLT as parameters and will be used during the transformation.

                  
     <xsl:param name="sheet"           as="xs:string"  select="'HistoricalMonthly Performance'"/>
     <xsl:param name="cell_range"      as="xs:string"  select="'C6:J22'"/>
     <xsl:param name="footer_rows"     as="xs:integer" select="0"/>
     <xsl:param name="header_rows"     as="xs:integer" select="2"/>
     <xsl:param name="preferred_width" as="xs:integer" select="100"/>

The proposed solution will convert the table information to a tps:table element, and charts and image references to tps:image elements in the CXML. Images and charts are transformed into separate image files and the tps:image element will reference the image file.

Discussion

The previous approach, with the Apache POI, read the XLSX file using the streaming, as it was memory intensive when handling large or memory-intensive files. Ultimately, limitations with this approach prevented us from implementing specific features and maintaining the code with bug fixes. The existing application supports Word documents that are processed using XSLT features. Therefore, our goal is to use a similar approach in the Excel converters.

Meanwhile, the XSLT approach provides the flexibility to read the required cell range of a particular sheet.xml without considering the entire XLSX file. In other words, this custom XSLT implementation will read the XLSX files at the lowest level possible, giving the maximum control and flexibility possible to help bypass the restrictions of the Apache POI API.

Compared to the Java solution, an XSLT developer would find it easier to use XSLT to do these transformations, adding more flexibility by mapping the information into the output XML schema. Finally, XSLT is the globally accepted methodology for XML transformations. Therefore Java code is less suitable than XSLT for document format transforms and conversion to XML.

The XProc pipeline was a new research area that delivered a unique solution to read an Excel file and pass it as a collection to the XSLT without writing into the disk. An inexperienced user of XML Calabash has to spend a considerable amount of time grasping the concepts of XProc language with the available online resources. Even though there are several other alternatives to achieve the same objective, we believe this was an ideal problem to demonstrate the application of XProc in such a domain. The listed resources related to XProc provide guidelines to troubleshoot various errors which occurred in several instances during the development of the final XProc pipeline. Several processors implement XProc 3.0 features, but XML Calabash was compatible with the ANT script so this solution runs the pipeline using it.

This application provides a general approach by delivering the flexibility of converting the XLSX tables into any other XML schema since this research provides a solution around CALS tables. Moreover, the XLSX chart to SVG conversion stores the output chart.svg files in a separate chart folder so it can be easily reused with any other transformation or as a standalone application. Currently, the proposed solution converts the pie chart, line chart and bar chart to an SVG output. It can be improved to support any other chart formats and with complex data; however, as a proof of concept, this solution considers only these types.

Further, the solution provides a separate set of functions to read each file, such as the styles.xml, theme.xml, sharedStrings.xml, and relationship files. For instance, tps:get-total-col-width function is implemented to calculate the total column width in order to calculate the width proportion. In this case, it has used the XSLT 3.0 xsl:iterate feature, and it showcases a great application of that.

This paper presents the possibility of transforming an XLSX file into a different XML schema; however, this paper also identifies several limitations and future improvements. For example, the current solution does not consider shrinkToFit kind of alignment feature and formulas, but this will be implemented as a future improvement. Also, the current implementation considers the common input XLSX file structure to read the files in corresponding order, but an arbitrary file structure (considering relationship files) will be considered in future versions.

Conclusion

This research covers three research areas— XProc pipeline, XLSX table to CALS table transformation, and XLSX chart to SVG conversion. The solution uses XProc, XSLT, XPath and ANT scripts and provides a CXML file as the final output. This paper explains the input file element information and the required information to be fetched to fill the required XML output. CALS tables are derived from the information available in the sheet.xml and other related files. It further demonstrates how the solution converts the chart.xml into an SVG file by explaining the available features of the SVG schema.

The Excel file structure becomes too complex to be processed—with the size and available information of the input file. However, based on the customer files, we decided which features were required to be implemented. Ultimately, the solution will increase the scope in the future.

References

[1] Microsoft Corporation. February 15, 2022. Office Implementation Information for ISO/IEC 29500 Standards Support. https://interoperability.blob.core.windows.net/files/MS-OI29500/%5bMS-OI29500%5d.pdf

[2] Spreadsheet styles. http://officeopenxml.com/SSstyles.php

[3] Microsoft Office, API Reference. https://docs.microsoft.com/en-us/dotnet/api/documentformat.openxml.spreadsheet.sheetformatproperties?view=openxml-2.8.1

[4] "Retrieving data from Excel XML using XSLT". https://stackoverflow.com/questions/33422652/retrieving-data-from-excel-xml-using-xslt

[5] "How to turn an XML file into SVG using XSL?" https://stackoverflow.com/questions/8056671/how-to-turn-an-xml-file-into-svg-using-xsl

[6] XSLT to generate SVG tutorial. 2016. http://edutechwiki.unige.ch/en/XSLT_to_generate_SVG_tutorial

[7] Florent Georges. 2008. Simple SVG chart generation with XSLT. http://fgeorges.blogspot.com/2008/04/simple-svg-chart-generation-with-xslt.html

[8] Max Froumentin, Vincent Hardy, W3C. Using XSLT and SVG together: a survey of case studies. https://www.w3.org/People/maxf/papers/2002-07-SVGOpen/svgopen.html

[9] SVG attribute reference. https://developer.mozilla.org/en-US/docs/Web/SVG/Attribute

[10] A pure XSLT/SVG pie chart. https://www.sleepingdog.org.uk/svg/chart/pie/

[11] SvgCharts4XSL. https://franklinefrancis.github.io/SvgCharts4Xsl/

[12] Adrian McMenamin. 2011. Using XSLT to manipulate an SVG file. https://cartesianproduct.wordpress.com/2011/07/16/using-xslt-to-manipulate-an-svg-file/

[13] SVG: Scalable Vector Graphics. https://developer.mozilla.org/en-US/docs/Web/SVG

[14] Ben Hauser. Content XML 3.0 and earlier. https://help.typefi.com/hc/en-us/articles/228240428-Content-XML-3-0-and-earlier

[15] Laura Powers. 2021. "Local style overrides". https://help.typefi.com/hc/en-us/articles/360001491415#h_3be5e014-f255-49b8-8454-6c7681861b1a

[16] Erik Bruchez. "Unzipping an Excel 2007 .xlsx file and extracting data in XPL/XSLT". https://gist.github.com/ebruchez/1245692

[17] Dave Bruns. "Excel customer number formats". https://exceljet.net/custom-number-formats

[18] Appendix A: Full XML schema. https://docs.microsoft.com/en-us/openspecs/office_standards/ms-xlsx/6624db33-496c-47f7-a562-a54cb01b133f

[19] Apache POI - HSSF and XSSF Limitations. https://poi.apache.org/components/spreadsheet/limitations.html

[20] Harvey Bingham. OASIS Technical Resolution TR 9503:1995, Exchange Table Model Document Type Definition. https://www.oasis-open.org/specs/a503.htm

[21] Ari Nordström. 2020. "Pipelined XSLT Transformations". Presented at Balisage: The Markup Conference 2020, Washington, DC, July 27 - 31, 2020. In Proceedings of Balisage: The Markup Conference 2020. Balisage Series on Markup Technologies, vol. 25. https://www.balisage.net/Proceedings/vol25/print/Nordstrom01/BalisageVol25-Nordstrom01.html. doi:https://doi.org/10.4242/BalisageVol25.Nordstrom01

[22] David Maus. 2018. XProc Step by Step: Implementing a DOCX to TEI step. https://dmaus.name/blog/2018.14/index.html

[23] XProc 3.0: An XML Pipeline Language, Community Group Report 1 July 2022. https://spec.xproc.org/master/head/xproc/

[24] Norman Walsh, Achim Berndzen. 2019. "XProc 3.0". Presented at Balisage: The Markup Conference 2019, Washington, DC, July 30 - August 2, 2019. In Proceedings of Balisage: The Markup Conference 2019. Balisage Series on Markup Technologies, vol. 23. https://www.balisage.net/Proceedings/vol23/print/Walsh02/BalisageVol23-Walsh02.html. doi:https://doi.org/10.4242/BalisageVol23.Walsh02

[25] Erik Siegel. 2020. XProc 3.0 - Strategies for merging documents. https://www.xml.com/articles/2020/11/16/xproc-30-strategies-merging-documents/#sect-pass-as-collection

[26] "How do I convert Excel Serial Date Numbers in an XML file to mm/dd/yyyy for SQL Server in an SSIS Package?" https://stackoverflow.com/questions/64018372/how-do-i-convert-excel-serial-date-numbers-in-an-xml-file-to-mm-dd-yyyy-for-sql

[27] "How to change date format in Excel and create custom formatting". https://www.ablebits.com/office-addins-blog/2015/03/11/change-date-format-excel/#excel-date-format

Gayanthika Udeshani

Typefi Systems Pty Ltd

Gayanthika is an Associate Architect at Typefi, where she leads the XSLT team and provides support for other software solutions as well.

Gayanthika holds a Masters of Science (MSc) in Software Architecture from the University of Moratuwa, Sri Lanka. She has over 11 years of experience as a developer and she joined Typefi in 2018.