Introduction
Consider you are handed hundreds or thousands of XML files in order to process them. You may be required to:
-
Convert from original SGML or XML to another XML format
-
Convert to HTML
-
Create PDF files
-
Create new datasets or products from the original data
-
Populate database(s) with the data
-
etc.
Performing data analysis on XML files can be a daunting task. As a data analyst you need to pull every trick in your in your bag of tricks out. One of the tricks that I almost always use is to create Excel spreadsheets in order to accomplish in-depth analysis of the data and tagging in an easy and understandable way. The spreadsheets provide quick snapshots of the whole dataset or a slice of the dataset.
Depending on what is uncovered by looking at different views of the data, new spreadsheets can be created to help delve deeper into the data.
Spreadsheets are also a good way to present data analysis issues for subject matter experts (SME). SMEs who may or may not be XML savvy and spreadsheets is an easily consumable format to convey information. Almost everyone has dealt with spreadsheets.
Anomalies are easily uncovered with this approach. It is a good way to see how prevasive a data problem may be.
For the purposes of this presentation I am going to describe how to create the spreadsheet using Excel 2003 XML format. You can also create the same spreadsheets using Excel 2007 and beyond. I wrote a paper entitled Convert Excel 2007 XML to OASIS code lists that describes how to create a 2007 format spreadsheet if you are interested. Excel 2007 XML is a more complicated approach to creating Excel. It involves multiple XML files and the necessity to create a zip file in order to Excel to consume the database. An Excel .xslt file is actually a zip archive file. If you are curious you can unzip a file and look through the file structure and look at all the XML files it contains.
The Balisage committee has been kind enough to allow me to use papers from the current and past Balisage conferences for doing a simple analysis for this paper.
Initial Analysis Spreadsheet
The first spreadsheet that I always run on a new dataset is to create a 3 worksheet (tab) spreadsheet of all elements and attributes used in the dataset. The rationale for creating this spreadsheet is to accurately determine what elements and attributes and attribute values are in the entire or subset of the data. I create the following worksheets in the spreadsheet. Worksheets are the tabs in spreadsheet.
-
List of unique elements and unique attributes for each element
-
List of all unique attributes
-
List of all unique attributes and attribute values.
This spreadsheet provides information the data that you need to look further into. For example, let's say you are working with files from Docbook or DITA. We know that both standards have hundreds of elements and attributes and attribute values. Most datasets only use a subset of the entire standard. For instance if your dataset doesn't use the element <procedure> then you can ignore doing any analysis or downstream processes on this element.
Also, both DocBook and DITA use the role attribute to further define many elements. The role attribute is defined in the schema as text which means that there can be many role values and some may be misspelled. Unless the authoring system restricts the values of the this and other attributes it would be impossible to accurately accomplish downstream processes based on each value.
The Balisage paper data is about the cleanest data I have had the honor to look at. However, the element/attribute analysis did provide a few interesting facts. There are 2 elements in the data that are currently not in the latest version (1.2) of the Balisage schema (lineage and para.level). This probably means that these elements have been eliminated from the schema but we still need to know about these elements in order to process them.
The second worksheet is used to get a list of all attribute name. This is useful to determine which attributes should be looked at further for their unique values. Obviously you don't want to get a list of attribute values for attributes such as identifiers, table and figure dimensions because that would be to much data and not useful to the analysis. Attributes such as 'role', 'class', 'type', etc. will provide important insight into the data.
The Balisage papers contain a total of 38 unique attributes. I chose the following attributes to look further at.
Table I
Attribute | Attribute Value |
align | center left right |
role | author bital bold ital rom under |
numeration | arabic loweralpha lowerroman |
spacing | compact |
There isn't anything that jumps out with these values except that the 'upperalpha' and 'upperroman' values are not used for the 'numeration' attribute, 'justify' and 'char' are not used for 'align' attribute and finally, the 'spacing' attribute is no longer in the schema. This means that we don't have to take these values into account with the current dataset.
Developing the XSLT for Analysis
Creating an XSLT for data analysis is really relatively easy. Looking at the code it might seem a little 'daunting' at first but the most difficult part is accurately iterating through the data to extract the information that is helpful.
Creating a Template
If this is the first time that you will be creating an Excel spreadsheet using XSLT, the easiest way to start is to create a template in the Excel application. The template should include all the columns that you will need and a single row of dummy data. You can also create multiple worksheets (tabs) in your template. Below is sample of what the template looks like.
If you want to be able to filter the data or freeze the columns set this capability inside the template. Once the template is completed save the template as Excel 2003 XML.
That is all there is to it. Now you are ready to create your XSLT around the template you just created. Open the template in your favorite XSLT editor.
Excel 2003 XML Structure
There are some modifications to the template that will be required because Excel. First, let's talk about the different sections in the XML. Below is XML file for the template created by Excel. For the most part you will only need to care about the second row because this is where the data will be populated.
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:html="http://www.w3.org/TR/REC-html40"> <DocumentProperties xmlns="urn:schemas-microsoft-com:office:office"> <Author>Betty Harvey</Author> <LastAuthor>Betty Harvey</LastAuthor> <Created>2018-06-12T20:30:38Z</Created> <Version>16.00</Version> </DocumentProperties> <OfficeDocumentSettings xmlns="urn:schemas-microsoft-com:office:office"> <AllowPNG/> </OfficeDocumentSettings> <ExcelWorkbook xmlns="urn:schemas-microsoft-com:office:excel"> <WindowHeight>6648</WindowHeight> <WindowWidth>19200</WindowWidth> <WindowTopX>32767</WindowTopX> <WindowTopY>32767</WindowTopY> <ProtectStructure>False</ProtectStructure> <ProtectWindows>False</ProtectWindows> </ExcelWorkbook> <Styles> <Style ss:ID="Default" ss:Name="Normal"> <Alignment ss:Vertical="Bottom"/> <Borders/> <Font ss:FontName="Calibri" x:Family="Swiss" ss:Size="11" ss:Color="#000000"/> <Interior/> <NumberFormat/> <Protection/> </Style> <Style ss:ID="s68"> <Font ss:FontName="Calibri" x:Family="Swiss" ss:Size="11" ss:Color="#FFFFFF" ss:Bold="1"/> <Interior ss:Color="#305496" ss:Pattern="Solid"/> </Style> </Styles> <Worksheet ss:Name="List of Attributes"> <Names> <NamedRange ss:Name="_FilterDatabase" ss:RefersTo="=Sheet1!R1C1:R1C2" ss:Hidden="1"/> </Names> <Table ss:ExpandedColumnCount="2" ss:ExpandedRowCount="2" x:FullColumns="1" x:FullRows="1" ss:DefaultColumnWidth="50.4" ss:DefaultRowHeight="14.4"> <Column ss:AutoFitWidth="0" ss:Width="87.9"/> <Column ss:AutoFitWidth="0" ss:Width="115.80000000000001"/> <Row> <Cell ss:StyleID="s68"><Data ss:Type="String">List of Elements</Data><NamedCell ss:Name="_FilterDatabase"/></Cell> <Cell ss:StyleID="s68"><Data ss:Type="String">List of Attributes</Data><NamedCell ss:Name="_FilterDatabase"/></Cell> </Row> <Row> <Cell><Data ss:Type="String">element1</Data></Cell> <Cell><Data ss:Type="String">attributes for element1</Data></Cell> </Row> </Table> <WorksheetOptions xmlns="urn:schemas-microsoft-com:office:excel"> <PageSetup> <Header x:Margin="0.3"/> <Footer x:Margin="0.3"/> <PageMargins x:Bottom="0.75" x:Left="0.7" x:Right="0.7" x:Top="0.75"/> </PageSetup> <Selected/> <FreezePanes/> <FrozenNoSplit/> <SplitHorizontal>1</SplitHorizontal> <TopRowBottomPane>1</TopRowBottomPane> <SplitVertical>1</SplitVertical> <LeftColumnRightPane>1</LeftColumnRightPane> <ActivePane>0</ActivePane> <Panes> <Pane> <Number>3</Number> </Pane> <Pane> <Number>1</Number> </Pane> <Pane> <Number>2</Number> </Pane> <Pane> <Number>0</Number> </Pane> </Panes> <ProtectObjects>False</ProtectObjects> <ProtectScenarios>False</ProtectScenarios> </WorksheetOptions> <AutoFilter x:Range="R1C1:R1C2" xmlns="urn:schemas-microsoft-com:office:excel"> </AutoFilter> </Worksheet> </Workbook>
Style Section
You shouldn't need to modify the style section. However, if you decide that you want to add columns to your spreadsheet and you copy and paste the new cell parameters from a temporary template, the style identifier may not be the same as the style in your original template. Excel is sneaky about changing style identifiers.
There is always a default style. If the <Cell> element has no style attribute then it uses the default style or no style. A cell that has styling uses the ss:StyleID, "<Cell ss:StyleID="s68">". The value associates with a style in the <Style> section. You can modify the styles in the XML file, if necessary.
Table Section
If you are familiar with either the OASIS-Open (previously CALS) or HTML table models, the table structure of the Excel XML will look pretty familiar to you.
You will need to remove the 'x:FullRows' and 'ss:ExpandedRowCount' attributes in the table. Excel uses these attributes to determine the size of the spreadsheet. Excel uses these to determine the number of rows of the spreadsheet. If the sizes don't match then Excel will throw an error. However, if the attributes are not there Excel calculates the size of the table automatically and Excel will open successfully and place these attributes into the spreadsheet.
I also remove the 'ss:ExpandedColumnCount' and x:FullColumns even though you don't need to if you don't increase the number of columns. I will often add columns to the spreadsheet after I have created the template based on information I get when I run the initial spreadsheet. Sometimes it is difficult to know where your analysis will lead you once you start looking at the data in-depth.
Before Modification
<Table ss:ExpandedColumnCount="2" ss:ExpandedRowCount="2" x:FullColumns="1" x:FullRows="1" ss:DefaultColumnWidth="50.4" ss:DefaultRowHeight="14.4"> </Table>
After Modification
<Table ss:DefaultColumnWidth="50.4" ss:DefaultRowHeight="14.4"> </Table>
If you do get an error when opening the spreadsheet up after you create your XSLT. Microsoft will point you to a log file. The log file doesn't provide a lot of information but should provide some clues about what the error is.
Row Section
The first row in our template is the table header row. You should leave this row 'as is'. There is no need to modify this row. The second row in the template is the row that you should wrap your XSLT around to create your data rows. The <NamedCell> element is used for filtering the data.
Header Row
<Row> <Cell ss:StyleID="s68"> <Data ss:Type="String">List of Elements</Data> <NamedCell ss:Name="_FilterDatabase"/> </Cell> <Cell ss:StyleID="s68"> <Data ss:Type="String">List of Attributes</Data> <NamedCell ss:Name="_FilterDatabase"/></Cell> </Row>
Data Row That Gets Modified
The second row is where your data will be created. Below is the original output from the template. You will want to remove the 'dummy' data from the <Data> element.
<Row> <Cell><Data ss:Type="String">element1</Data></Cell> <Cell><Data ss:Type="String">attributes for element1</Data></Cell> </Row>
At this point we can create the logic for creating the spreadsheet. There are 2 ways to process the data using XSLT. One way of accessing the data is using the 'collection()' function. The other way of accessing the data is by using 'xinclude'. For the example above I used 'xinclude' in order to look at the data as a whole because I want to get unique values. Below is an example of accessing the document is put together using xinclude' file. Later in this paper I will show an example using the collection() function.
<files xmlns:xi="http://www.w3.org/2001/XInclude"> <file><filename>vol1/xml/Altheim01/BalisageVol1-Altheim01.xml</filename><xi:include href="C:/projects/Presentations/Balisage2018/Papers/vol1/xml/Altheim01/BalisageVol1-Altheim01.xml"><xi:fallback>File not found</xi:fallback></xi:include></file> <file><filename>vol1/xml/Bauman01/BalisageVol1-Bauman01.xml</filename><xi:include href="C:/projects/Presentations/Balisage2018/Papers/vol1/xml/Bauman01/BalisageVol1-Bauman01.xml"><xi:fallback>File not found</xi:fallback></xi:include></file> ... </files>
Below is the XSLT snippet that creates the Rows of the Excel spreadsheet:
<xsl:for-each select="//*"> <xsl:variable name="el-name" select="name()"/> <Row> <Cell ss:StyleID="s63"> <Data ss:Type="String"> <xsl:choose> <xsl:when test="preceding::*[name() = $el-name]"/> <xsl:otherwise> <xsl:value-of select="$el-name"/> </xsl:otherwise> </xsl:choose> </Data> </Cell> <Cell ss:StyleID="s63"><Data ss:Type="String"> <xsl:value-of select="distinct-values(collection('../Papers?select=*.xml;recurse=yes')/descendant::*[name() = $el-name]/@*/name())"/> </Data> </Cell> </Row> </xsl:for-each>
Below is a snippet from the output from the above example code.
<Row> <Cell> <Data ss:Type="String">abstract</Data> </Cell> <Cell> <Data ss:Type="String"/> </Cell> </Row> <Row> <Cell> <Data ss:Type="String">affiliation</Data> </Cell> <Cell> <Data ss:Type="String"/> </Cell> </Row> <Row> <Cell> <Data ss:Type="String">appendix</Data> </Cell> <Cell> <Data ss:Type="String">xml:id xreflabel </Data> </Cell> </Row> . . .
WorksheetOptions and AutoFilter
The last 2 sections of the template setup information for the worksheet. The WorksheetOptions section contains standard information about the worksheet. You will also find information about freeze panes.
The Autofilter section defines the filtering of the columns. Filtering columns provide the capability of having a drop down menu to pick specific values in a column.
Autofilter Function
<AutoFilter x:Range="R1C1:R1C2" xmlns="urn:schemas-microsoft-com:office:excel">
If you add more columns to your spreadsheet increment the 'R1C2' by how many columns are in your spreadsheet.
More Analysis
Now that you have the building blocks for creating analysis spreadsheets the sky is the limit on what you can gleen from your data. For example, using the Balisage data we might want to get a list of all authors. The logic is pretty simple:
<xsl:for-each select="collection('../Papers?select=*.xml;recurse=yes')/doc:article"> <xsl:for-each select="descendant::doc:author"> <Row> <Cell><Data ss:Type="String"><xsl:value-of select="doc:personname/doc:firstname"/></Data></Cell> <Cell><Data ss:Type="String"><xsl:value-of select="doc:personname/doc:surname"/></Data></Cell> <Cell><Data ss:Type="String"><xsl:value-of select="ancestor::doc:article/doc:title"/></Data></Cell> <Cell><Data ss:Type="String"><xsl:value-of select="substring-after(ancestor::doc:article/descendant::doc:confdates, ', ')"/></Data></Cell> </Row> </xsl:for-each> </xsl:for-each>
Data Cleanup
At some point you may be required to perform data cleanup. This is a common occurrence when dealing with large amounts of data that are coming from different sources. You may want to normalize the data especially where attributes such as role and class are concerned. In order to perform the data cleanup it is important to understand what the data looks like and what you may encounter.
If you are cleaning up the data you may want to create a before and after spreadsheet just to make sure that your modifications didn't have unexpected consequences.
For this paper and demonstration purposes I did a search on the Balisage papers for all superscript and subscript elements. I got the value of each superscript and subscript, as well as before and after text to provide textual context. For the most part the data that came back looked reasonable. However, some values jumped out that may not be valid superscript or subscript content. There were 1744 instances of superscript and subscript elements in the complete dataset. Below are examples of where I thought the superscript or subscript values were problematic.
Having the ability to filter values in the spreadsheet allows you to drill down into problem areas pretty quickly. In a matter of minutes you are able to create a spreadsheet, find any problems and come up with the plan for fixing any problems that are in the data.
In the first entry above, the closing parenthesis is part of the subscript.
In the other entries, it looks like the superscript is not closed after before the para but is closed at the end of the block. It turns out that this data is actually a footnote in a table and in order to get the smaller font the author has used superscript.
Again, the logic to create this spreadsheet is relatively simple. In this example I am using the collection() function to iterate through the data.
<xsl:for-each select="collection('../Papers?select=*.xml;recurse=yes')/doc:article"> <xsl:for-each select="//*[ends-with(name(), 'script')]"> <Row> <Cell> <Data ss:Type="String"><xsl:value-of select="substring-after(base-uri(.), 'file:/C:/projects/Presentations/Balisage2018/Papers/')"/></Data> </Cell> <Cell> <Data ss:Type="String"><xsl:value-of select="name()"/></Data> </Cell> <Cell> <Data ss:Type="String"><xsl:value-of select="."/></Data> </Cell> <Cell> <Data ss:Type="String"><xsl:value-of select="preceding-sibling::text()[1]"/></Data> </Cell> <Cell> <Data ss:Type="String"><xsl:value-of select="following-sibling::text()[1]"/></Data> </Cell> </Row> </xsl:for-each> </xsl:for-each>
Conclusion
I hope that I have been able to convey how useful spreadsheets can be in the analysis of large (or small) data sets. It can provide clarity into the data and help every person who is touching the data to understand useful pieces of information. Along with other analysis processes the spreadsheet can be an amazing tool.
This approach is not limited to XSLT. The same analysis can be done using XQuery and databases. I have created spreadsheets using both technologies.