Balisage Paper: Characterizing ill-formed XML on the web
An analysis of the Amsterdam Corpus by document type
Copyright © 2012 by the author. Used with permission.
Abstract
This paper builds on the work of Steven Grijzenhout to analyze the Amsterdam XML Corpus in more detail. Where Grijzenhout had as a primary focus XML validation, this paper focuses on well-formedness; in addition, rather than measuring error frequency by Internet domain or by country of origin, the analysis presented here is by document type. The aim is to bring a more XML-centric view to the work and to inform work on error recovery in XML parsing.