Human Readability of Data Files

Michael Robert Gryk

Abstract

In this era of big data and FAIR data, data formats must be machine interpretable. XML, among other standards, satisfies this requirement. Yet many standardization initiatives cite human readability as a second, key property in data format development. Examples include the development of STAR in the field of structural biology, W3C PROV for provenance, and even the continuing development of XML. This begs the question(s), what is meant by human readability and can this property be measured for a given data format or compared between competing standards?

The broad topic of readability is considered with attention to the various aspects of written text which either foster or counter readability. Drawing on efforts in the educational system, a metric is proposed for estimating the relative human readability of structured data within an archival file format. Comparison is made between the same data represented in various formats, including JSON and XML, to help judge whether these standards have accomplished their simultaneous goals of machine interpretability and human readability.

Introduction and Motivation

I have been motivated to write about this topic after witnessing several years of conversations between colleagues, mentors and students. It is impossible to count the number of discussions which contained statements such as: "I prefer STAR to JSON because it is easier to read." Interestingly, sometimes a discussion within a different group would prompt the exact opposite preference, suggesting there is a subjective element to readability.

I admit to having my own bias. I favor XML over JSON, STAR, etc. due to many factors, including its structure, elegance, longevity and the tremendous technology stack built around XML. Perhaps due to this bias, I had not considered trying to measure the readability of data within XML versus other formats. My role was simply another participant in water cooler conversations: "I prefer XML, full stop."

Last year's submission to Balisage [Gryk, 2021] provided me the opportunity to roll up my sleeves and really engage with the STAR data structures. That contribution was an effort to deconstruct the structure of STAR from its syntax (serialization). An extra benefit of that exercise was the development of not only an XML serialization of STAR, but also a series of transformations (XSLTs) for round-tripping the XML representation back to STAR, for converting to JSON, and even creating a spreadsheet representation of the data. Generating, testing and trouble-shooting these various serializations provided a very tacit experience regarding which formats are more readable than others.

There are two important qualifications to add at this point which will be discussed again later. One, readability of data is somewhat dependent on the data. The STAR format example in this manuscript is used in the field of structural biology, and the pairing of data with format may produce yet another bias to the conversation. Two, the terms format and serialization have been used somewhat interchangeably up until now but it will be important to distinguish the readability of data as presented in a textual layout versus the ability of a format to be converted to a textual layout which is readable.

Background

Mark Twain is often cited for the quip: Everybody talks about the weather, but nobody does anything about it. A similar thing can be said of the human readability of data formats. Everyone talks about human readability, but hardly anyone defines precisely what they mean by it nor which properties make one format more readable than another.

Of course, in a different era the topic in front of us would not be what is meant by human readable. Documents such as books, magazines and pamphlets are written and printed specifically for human consumption. Fifty years ago, the important topic would be how to make documents machine readable. The acronym MARC emphasized MAchine Readable Cataloguing rather than human readable cataloguing. The FAIR data principles also require that scientific data be machine interoperable. FAIR data repositories support this requirement by providing data in machine interpretable formats.

Nevertheless, human readability is frequently cited alongside machine readability as a desirable property for data formats and as a major design consideration. A few examples should drive home the point.

STAR

The STAR file was introduced in 1991 to support scientific data and this format is still in use in the structural biology communities. Hall identifies the following requirements for the STAR format [Hall, 1991]:

A Universal Archive File should be simple to read and to access. — Bullet point 4 of requirements

The file is easy to read visually, or by machine. — Bullet point 5 of the properties of a STAR file

To facilitate access to a STAR File the names of data items must be defined to be as descriptive as possible.

NMR-STAR

A variant of the STAR format (called NMR-STAR) has been used by the BioMagResBank [Ulrich, et al., 2008] for archiving data related to the field of biomolecular nuclear magnetic resonance spectroscopy since the 1990's. The decision to use STAR as opposed to other available standards was made in part out of concern for human readability.

ASN.1, XML, and SGML formats did not meet the need for easy human readability and efficient manual editing with common text editing software.

— BMRB Internal Whitepaper

PROV

The W3C supports a set of standards for recording and reporting provenance, particularly within the context of the world wide web. Part of this family of standards is a specialized notation for provenance, PROV-N. Once again, human readability was a major design consideration.

A key goal of PROV is the specification of a machine-processable data model for provenance. However, communicating provenance between humans is also important when teaching, illustrating, formalizing, and discussing provenance-related issues. With these two requirements in mind, this document introduces PROV-N, the PROV notation, a syntax designed to write instances of the PROV data model according to the following design principles:

Technology independence. PROV-N provides a simple syntax that can be mapped to several technologies.

Human readability. PROV-N follows a functional syntax style that is meant to be easily human-readable so it can be used in illustrative examples, such as those presented in the PROV documents suite.

Formality. PROV-N is defined through a formal grammar amenable to be used with parser generators.

— https://www.w3.org/TR/2013/REC-prov-n-20130430/

YAML / JSON

YAML is a human-friendly data serialization language for all programming languages.

— https://yaml.org/

YAML is a strict superset of JSON. However, when comparing YAML with JSON, once again, the topic often returns to readability:

In practice, however, the two formats look different, as the YAML specification puts more emphasis on human readability by adding a lot more syntactic sugar and features on top of JSON.

— https://realpython.com/python-yaml/

XML

Even the design criteria for XML refer to human readability.

The design goals for XML are: ...

6. XML documents should be human-legible and reasonably clear.

— https://www.w3.org/TR/REC-xml/

Readability

This topic of this paper is the human readability of data as contained within scientific data formats. It is useful at this point to consider readability more generally, as much work has been done in developing metrics for measuring the readability of common texts. These include the Flesch Reading Ease [Flesch, 1948], the Fry Reading Formula [Fry, 1968], and the Simple Measure of Gobbledygook (SMOG) Formula [McLaughlin, 1969]

The Flesch Reading Ease formula is as follows: Reading Ease = 206.835 – 1.015 * (average words per sentence) – 84.6 * (average syllables per word)

A larger number is considered easier to read, a smaller number is more difficult to read. As an example, let's consider:

I do not like green eggs and ham. I do not like them, Sam I am.

— Seuss, 1960

The total number of words is 16. The total number of sentences is 2. Therefore, the average words per sentence is 8. Since they are all single syllable words, the second Flesch value is 1. The overall Reading Ease is 206.835 - 8.12 - 84.6 = 114.115. This is considered to be very easy to read, up to fourth grade reading level.

Let's contrast that with the following:

The broad topic of readability is discussed in this manuscript with attention to the various aspects of written text which either foster or counter readability. Drawing on efforts in the educational system, a metric is proposed for estimating the relative human readability of structured data within an archival file format.

— Draft of Abstract

In this case, we have 50 words in 2 sentences: 25 words per sentence. We have a total of 96 syllables for the 50 words or an average syllable/per word of 1.92. The overall Reading Ease is 206.835 - 25.375 - 162.432 = 19.028. This is considered to be very difficult to read, at a college reading level.

Data Readability

How can we construct a formula similar to the Flesch formula for measuring the human readability of a data format? The first consideration is defining the contents of a data file in as general terms as possible. A data file generally contains data values which are associated with data identifiers. These data items are structured within the file using some type of syntactical characters such that the data file can be parsed by a machine^[1]. Machine readability is assumed as a prerequisite for a data file format. Let's consider these three components, data identifiers, data values and syntax as to how they affect human readability. In the end, I will propose that simply counting the number of identifier characters, value characters and syntactic characters can be used to define a general formula for the human readability of scientific data. I do not claim that this is the best nor the only way to measure human readability; with this exercise I hope to start a conversation on defining and measuring readability rather than ending the conversation.

Data Identifiers

Identifiers provide a name for the underlying data. In principle, the identifiers only need to be unique within whatever scope they are used. For example, the following table illustrates the same concept being expressed using different identifier conventions.

Table I

Verbose Identifier names.

Examples
a = l * w
area = length * width
rectangle.area = base.length * side.width

All three of the examples convey the same concept, that the area is a given by the length multiplied by the width. However, the examples differ in their verboseness or their descriptive value. In the first example, a reader needs to either know that the identifiers refer to area, length and width, or be able to infer that relationship. In the third example, additional qualifiers are used which can be helpful in cases where there are multiple formulas for calculating the area of rectangles, triangles and parallelograms.

As a general rule, we notice that the more characters which are used, the more descriptive the identifier. Of course, this is just a generalization which helps justify counting characters as a measurement of readability, similar to the counting of syllables as a measure of the complexity of a word. "Shunt" may be less readable than "today" irrespective of the number of syllables. "I" might be just as readable as "the author of this manuscript", even though the latter has more characters.

At the other extreme of verboseness are fixed width file formats, such as the original pdb format of the Protein Data Bank [PDB format, 2012]. In the case of fixed width formats, there need not be any identifiers or any syntactical characters either. It is part of the documentation how the various data values are ordered in the file (similar to binary data representations). These formats may still maintain a degree of human readability; however, edge cases where fixed width values bleed into each other can be onerous. (An example of this for the pdb format is between the chain identifer and residue number. Residue number is given 4 characters after the chain ID. In the vast majority of cases, there are fewer than 999 residues in a polymer and there is whitespace between the chain ID and residue number. However, if 1000 residues are reached, then there is no intervening whitespace. This is a common trip point for folks writing parsers for the old pdb format.)

Data Values

Just as data identifiers can benefit from verboseness, so can data values. However, in the case of scientific data values, there is a stronger impetus for quantitative values which can tip the scale more towards machine readability rather than human readability [Wrightson, 2005]. An example is given in below.

Table II

Comparing Oranges to Oranges. Various methods of recording data about color. From a data perspective, the numerical quanties are more precise. However, the textual description is the most human readable.

Type	Value	Audience
Textual Description	Orange	Human (General)
Wavelength	600 nm	Human (Scientist) and Machine
RGB: Hexidecimal	#FFBE00	Machine
RGB: Decimal	255, 190, 0	Machine
CMYK	0%, 25%, 100%, 0%	Machine

The overarching purpose of a data file is to convey the data values using whatever representation the scientific community agrees is the most correct, precise, or important. Therefore, it is not my intention to suggest that the data values should be transcoded from machine readable values useful to the community to something more human readable which is less useful, as in 600 nm versus orange. Nevertheless, as Ann Wrightson pointed out in 2005, much XML is not human readable because the values stored within XML files are intended for machines, not humans [Wrightson, 2005].

As in the preceding section regarding data identifiers, it is a simple generalization that the more characters are used to represent a data value, the more human readable it can be. For example, "true" and "false" are more readable than "1" and "0". In fact, a simple machine encoding of a Boolean value can be used to represent true/false, on/off, up/down, and various other exclusive properties for which a more verbose description could assist in human readability. Counter examples would include lengthy numerical codes as proxies, such as the ISBN of "978-0-385-12167-5" rather than the book title, "The Shining".

Syntactical Characters

The final component of scientific data formats are the characters and expressions which are used to define the syntax. These are useful for both human readability and machine parsing. However, in this section I argue that a less verbose syntax leads to easier human readability. (A more verbose syntax often leads to easier machine parsing as can be seen in programming languages such as ALGOL where every if is closed with a fi and every do is closed with an od.

Throughout the rest of this paper I will explicitly refer to three serializations of the STAR file format. The original serialization which is part of the STAR definition [Hall, 1991], an XML representation of the STAR format [Gryk, 2021] along with a JSON serialization generated through an XSLT of the XML serialization. It should be noted that the motivation for the design of the XML schema for STAR [Gryk, 2021] was specifically to support transformation into other serializations and to support that goal, the XML schema explicitly defines and tracks aspects of the STAR format. As pointed out by one of the reviewers of this paper, that makes the final comparison between the readability of XML and the other serializations a bit unfair as the XML version defines more of the data structure within the schema.

To quickly summarize the STAR format, a data file consists of two types of top level containers, data blocks which have an associated identifer and global blocks (named for their global scope). Within these containers are allowed a third type of container called a save frame which in some variations of STAR can be arbitrarily nested. Finally, the data itself is provided either through key/value pairs for which the keys must be unique within the scope of the container, or as tabular data with explicit column names along with tabular data values. An important point which will be revisited, STAR uses whitespace as a basic delimiter between these keywords, identifers and values, but other than that whitespace has no formal meaning. Because of this, tabular data can be formatted in very human readable ways or the whitespace can be used to obstruct human readability.

A comparison of the syntactical characters required for each of the three serializations is given below.

Table III

STAR, XML and JSON representations for the various STAR constructs. '...' is used to signify that additional content follows which belongs to a different STAR construct. For example, files are composed of data and global blocks, which in turn are composed of save frames, key/value pairs, and tables (called loops in STAR). The second line of each row indicates the number of extra syntactic characters required by each serialization format.

Block	STAR	XML	JSON
File	`...`	`<file>...</file>`	`{"file" : ...}`
File	0	13	9
Data Block	`data_identifier` `...`	`<data name="identifier">...</data>`	`{"data" : { "name" = "identifier", ...}`
Data Block	5	20	20
Global Block	`global_ ...`	`<global>...</global>`	`{"global" : ...}`
Global Block	7	17	11
Save Frame	`save_identifier` `... save_`	`<save name="identifier">...</save>`	`{"save" : { "name" = "identifier", ...}`
Save Frame	10	20	20
Pair	`_key value`	`<datum key="key">value</datum>`	`"key" : "value"`
Pair	1	21	5
Table (loop)	`loop_ _column1` `_column2` `value1` `value2`	`<loop><header><column key="column1"/><column key="column2"/></header><row><cell>value1</cell><cell>value2</cell></row></loop>`	`[["column1","column2"],["value1","value2"]]`
Table (loop)	5 + 1 per column	56 plus multiple of rows and columns	6 plus multiple of rows and columns

The above table provides the general syntax for each of the STAR constructs, serialized as canonical STAR, XML or as JSON. For each STAR construct, the syntax is given on the top row and below is given a tabulation of the number of characters required to define the syntax.

The first row emphasizes that STAR is itself a file format and implicitly uses the file as the top level container. In the case of XML or JSON, this root element is made explicit. Therefore, XML and JSON representations require an additional 13 and 9 characters to define this construct.

The other container constructs in STAR have short keywords to define data blocks, global blocks or save frames. The XML and JSON representations are similarly short; however, they require a few more syntactical characters.

The largest difference noted is for key value pairs and tables. STAR has an extremely concise manner for representing identifiers by preceding them with a single underscore. This is much more syntactically efficient than XML and slightly better than JSON. In the case of tabular data, it becomes impossible to define the difference between the serializations as a single number; the verboseness of both XML and JSON is proportional to the number of columns and the number of rows within the table.

Readability Formula

I propose the following formula as a starting point for measuring the human readability of a scientific data file.

Human Readability = Number of characters used for identifiers and values / Number of total characters (including syntactical characters but excluding whitespace and comments)

This formula is intended to be simple to apply and agnostic to the underlying data. Similar to counting syllables and words for readability of prose as opposed to look up tables for the difficulty of word comprehension (shunt vs today). As an example, a small section of a STAR file hosted at the BioMagResBank is provided in Appendix A, Appendix B and Appendix C in the three different serializations. The overall number of data characters, identifiers and values, is the same for each of the three representations (1238 characters). However, the STAR file is the most compact with a total of 1320 characters yielding a proposed readability value of 93.8%. The JSON serialization yields a readability value of 72.5% while the XML serialization is the least readable at 40.5%. While the proposed formula can certainly be critiqued, it does seem to correlate with the general appearance of readability. The JSON file is a bit tougher to read due to the brackets, curly braces, colons and especially the quotes required around the data identifiers and values^[2]. The largest difficulty of the XML file is the extremely verbose description for tabular data.

Discussion

My goal in this paper is to explore a possibility of measuring the human readability of a scientific data file. The benefits of human readability are often cited as important design considerations for data standards and it has been noted that different XML files can vary on their readability [Wrightson, 2005]. Estimating the readability of books has been attempted by multiple sources with efforts focusing either on look-up tables of content or simple mathematical counting of syllables and words as an indicator of the complexity of the prose. This latter approach is taken to define a formula for readability of scientific data with a use case of the STAR format used in the fields of chemistry and structural biology.

There are several obvious caveats and critiques of this work. The first of which is the appearance of comments within a data file. Both STAR and the XML serialization (Appendix A and Appendix C) allow for comments while JSON does not. Comments are explicitly intended to aid in human comprehension. However, since they are not part of the machine parsable content (at least in STAR) it seemed unfair to include them as either improving or detracting from human readability.

A second important caveat is with regards to whitespace. STAR ranks the best in human readability according to this formula, in large part because whitespace is used as the natural delimiter. This is how natural language is also delimited. It is important to note that while Appendices B and C also use whitespace to aid in human readability, almost none of the whitespace is actually required for those serialization formats. (The only required whitespace is to separate element names and attributes in XML.) STAR on the other hand requires whitespace and uses this requirement as its main mechanism of achieving its stated goal of ensuring the file is easy to read visually or by machine.

However, whitespace can be challenging when used for both humans and machines, particularly because whitespace is invisible to humans but visible to machines. In this regard, if a file format distinguishes between the types and amounts of whitespace (as does Python and YAML), a single representation may be challenging to read both as a human and a machine. In other words, the difference between a tab and three spaces may be significant to the machine but indistinguishable to the human which, while not directly countering human readability, perhaps affects human/machine mutual understanding.

In summary, I hope that this discussion and proposed metric are useful in attempting to more formally define the oft-cited concern of human readability in data formats.

Acknowledgments

This work was supported in part by the National Institute of General Medical Sciences of the National Institutes of Health under Award Number GM-109046.

Appendix A. An example STAR file: 93.8% Human Readability^[3]

data_5208

#######################
#  Entry information  #
#######################

save_entry_information
   _Entry.Sf_category                    entry_information
   _Entry.Sf_framecode                   entry_information
   _Entry.ID                             5208
   _Entry.Title                         
;
1H, 13C and 15N resonance assignments for the perdeuterated 22 kD palm-thumb 
domain of DNA polymerase B
;
   _Entry.Type                           .
   _Entry.Version_type                   original
   _Entry.Submission_date                2001-11-14
   _Entry.Accession_date                 2001-11-14
   _Entry.Last_release_date              2002-05-07
   _Entry.Original_release_date          2002-05-07
   _Entry.Origination                    author
   _Entry.NMR_STAR_version               3.1.1.61
   _Entry.Original_NMR_STAR_version      2.1
   _Entry.Experimental_method            NMR
   _Entry.Experimental_method_subtype    .
   _Entry.Details                        .
   _Entry.BMRB_internal_directory_name   .

   loop_
      _Entry_author.Ordinal
      _Entry_author.Given_name
      _Entry_author.Family_name
      _Entry_author.First_initial
      _Entry_author.Middle_initials
      _Entry_author.Family_title
      _Entry_author.Entry_ID

      1 Michael Gryk        . R. . 5208 
      2 Mark    Maciejewski . W. . 5208 
      3 Anthony Robertson   . .  . 5208 
      4 Mary    Mullen      . A. . 5208 
      5 Samuel  Wilson      . H. . 5208 
      6 Gregory Mullen      . P. . 5208 

   loop_
      _Data_set.Type
      _Data_set.Count
      _Data_set.Entry_ID

      assigned_chemical_shifts 1 5208 

   loop_
      _Datum.Type
      _Datum.Count
      _Datum.Entry_ID

      '1H chemical shifts'  354 5208 
      '13C chemical shifts' 621 5208 
      '15N chemical shifts' 168 5208 

   loop_
      _Release.Release_number
      _Release.Format_type
      _Release.Format_version
      _Release.Date
      _Release.Submission_date
      _Release.Type
      _Release.Author
      _Release.Detail
      _Release.Entry_ID

      1 . . 2002-05-07 2001-11-14 original author . 5208 

save_

Appendix B. An example JSON serialization of the STAR file: 72.5% Human Readability

{"STAR-file" : 
 {"data" : { 
   "name" : "5208", 
   "save" : { "name" : "entry_information", 
    "Entry.Sf_category" : "entry_information",
    "Entry.Sf_framecode" : "entry_information",
    "Entry.ID" : "5208",
    "Entry.Title" : "1H, 13C and 15N resonance assignments for the perdeuterated 22 kD palm-thumb\ndomain of DNA polymerase B",
    "Entry.Type" : ".",
    "Entry.Version_type" : "original",
    "Entry.Submission_date" : "2001-11-14",
    "Entry.Accession_date" : "2001-11-14",
    "Entry.Last_release_date" : "2002-05-07",
    "Entry.Original_release_date" : "2002-05-07",
    "Entry.Origination" : "author",
    "Entry.NMR_STAR_version" : "3.1.1.61",
    "Entry.Original_NMR_STAR_version" : "2.1",
    "Entry.Experimental_method" : "NMR",
    "Entry.Experimental_method_subtype" : ".",
    "Entry.Details" : ".",
    "Entry.BMRB_internal_directory_name" : ".",
    "loop" : [["Entry_author.Ordinal","Entry_author.Given_name","Entry_author.Family_name","Entry_author.First_initial","Entry_author.Middle_initials","Entry_author.Family_title","Entry_author.Entry_ID"],
      ["1","Michael","Gryk",".","R.",".","5208"],
      ["2","Mark","Maciejewski",".","W.",".","5208"],
      ["3","Anthony","Robertson",".",".",".","5208"],
      ["4","Mary","Mullen",".","A.",".","5208"],
      ["5","Samuel","Wilson",".","H.",".","5208"],
      ["6","Gregory","Mullen",".","P.","5208"]],
    "loop" : [["Data_set.Type","Data_set.Count","Data_set.Entry_ID"],
      ["assigned_chemical_shifts","1","5208"]],
    "loop" : [["Datum.Type","Datum.Count","Datum.Entry_ID"],
      ["1H chemical shifts","354","5208"],
      ["13C chemical shifts","621","5208"],
      ["15N chemical shifts","168","5208"]],
    "loop" : [["Release.Release_number","Release.Format_type","Release.Format_version","Release.Date","Release.Submission_date","Release.Type","Release.Author","Release.Detail","Release.Entry_ID"],
      ["1",".",".","2002-05-07","2001-11-14","original author",".","5208"]]
    }
  }
 }
}

Appendix C. An example XML serialization of the STAR file: 40.5% Human Readability

<?xml version="1.0" encoding="UTF-8"?>
<STAR-file version="Hall_96" xmlns="BMRB.STAR" xmlns:xsi="star.xsd">
 <data name="5208">
  <!-- ###################### -->
  <!--   Entry information  # -->
  <!-- ###################### -->
  <save name="entry_information">
     <datum key="Entry.Sf_category" >entry_information</datum>
     <datum key="Entry.Sf_framecode" >entry_information</datum>
     <datum key="Entry.ID" >5208</datum>
     <datum key="Entry.Title" delimiter="semi-colon">\n1H, 13C and 15N resonance assignments for the perdeuterated 22 kD palm-thumb \ndomain of DNA polymerase B</datum>
     <datum key="Entry.Type" >.</datum>
     <datum key="Entry.Version_type" >original</datum>
     <datum key="Entry.Submission_date" >2001-11-14</datum>
     <datum key="Entry.Accession_date" >2001-11-14</datum>
     <datum key="Entry.Last_release_date" >2002-05-07</datum>
     <datum key="Entry.Original_release_date" >2002-05-07</datum>
     <datum key="Entry.Origination" >author</datum>
     <datum key="Entry.NMR_STAR_version" >3.1.1.61</datum>
     <datum key="Entry.Original_NMR_STAR_version" >2.1</datum>
     <datum key="Entry.Experimental_method" >NMR</datum>
     <datum key="Entry.Experimental_method_subtype" >.</datum>
     <datum key="Entry.Details" >.</datum>
     <datum key="Entry.BMRB_internal_directory_name" >.</datum>
     <loop>
      <header>
        <column key="Entry_author.Ordinal"/>
        <column key="Entry_author.Given_name"/>
        <column key="Entry_author.Family_name"/>
        <column key="Entry_author.First_initial"/>
        <column key="Entry_author.Middle_initials"/>
        <column key="Entry_author.Family_title"/>
        <column key="Entry_author.Entry_ID"/>
      </header>
      <row>
        <cell>1</cell>
        <cell>Michael</cell>
        <cell>Gryk</cell>
        <cell>.</cell>
        <cell>R.</cell>
        <cell>.</cell>
        <cell>5208</cell>
      </row>
      <row>
        <cell>2</cell>
        <cell>Mark</cell>
        <cell>Maciejewski</cell>
        <cell>.</cell>
        <cell>W.</cell>
        <cell>.</cell>
        <cell>5208</cell>
      </row>
      <row>
        <cell>3</cell>
        <cell>Anthony</cell>
        <cell>Robertson</cell>
        <cell>.</cell>
        <cell>.</cell>
        <cell>.</cell>
        <cell>5208</cell>
      </row>
      <row>
        <cell>4</cell>
        <cell>Mary</cell>
        <cell>Mullen</cell>
        <cell>.</cell>
        <cell>A.</cell>
        <cell>.</cell>
        <cell>5208</cell>
      </row>
      <row>
        <cell>5</cell>
        <cell>Samuel</cell>
        <cell>Wilson</cell>
        <cell>.</cell>
        <cell>H.</cell>
        <cell>.</cell>
        <cell>5208</cell>
      </row>
      <row>
        <cell>6</cell>
        <cell>Gregory</cell>
        <cell>Mullen</cell>
        <cell>.</cell>
        <cell>P.</cell>
        <cell>.</cell>
        <cell>5208</cell>
      </row>
     </loop>
     <loop>
      <header>
        <column key="Data_set.Type"/>
        <column key="Data_set.Count"/>
        <column key="Data_set.Entry_ID"/>
      </header>
      <row>
        <cell>assigned_chemical_shifts</cell>
        <cell>1</cell>
        <cell>5208</cell>
      </row>
     </loop>
     <loop>
      <header>
        <column key="Datum.Type"/>
        <column key="Datum.Count"/>
        <column key="Datum.Entry_ID"/>
      </header>
      <row>
        <cell delimiter="single-quote">1H chemical shifts</cell>
        <cell>354</cell>
        <cell>5208</cell>
        <cell delimiter="single-quote">13C chemical shifts</cell>
        <cell>621</cell>
        <cell>5208</cell>
        <cell delimiter="single-quote">15N chemical shifts</cell>
        <cell>168</cell>
        <cell>5208</cell>
      </row>
     </loop>
     <loop>
      <header>
        <column key="Release.Release_number"/>
        <column key="Release.Format_type"/>
        <column key="Release.Format_version"/>
        <column key="Release.Date"/>
        <column key="Release.Submission_date"/>
        <column key="Release.Type"/>
        <column key="Release.Author"/>
        <column key="Release.Detail"/>
        <column key="Release.Entry_ID"/>
      </header>
      <row>
        <cell>1</cell>
        <cell>.</cell>
        <cell>.</cell>
        <cell>2002-05-07</cell>
        <cell>2001-11-14</cell>
        <cell>original</cell>
        <cell>author</cell>
        <cell>.</cell>
        <cell>5208</cell>
      </row>
     </loop>
   </save>
 </data>
</STAR-file>

References

[Gryk, 2021] Gryk, Michael R. Deconstructing the STAR File Format. Presented at Balisage: The Markup Conference 2021, Washington, DC, August 2 - 6, 2021. In Proceedings of Balisage: The Markup Conference 2021. Balisage Series on Markup Technologies, vol. 26 (2021). doi:https://doi.org/10.4242/BalisageVol26.Gryk01

[Hall, 1991] Hall, S.R. The STAR File: A New Format for Electronic Data Transfer and Archiving. J. Chem. Inf. Comput., 31, 326-333 (1991). doi:https://doi.org/10.1021/ci00002a020

[Wrightson, 2005] Wrightson, A. Semantics of Well Formed XML as a Human and Machine Readable Language: Why is some XML so difficult to read? Proceedings of Extreme Markup Languages 2005, 2005.

[Flesch, 1948] Flesch, R. A new readability yardstick. Journal of Applied Psychology, 32, 221–233 (1948). doi:https://doi.org/10.1037/h0057532

[Seuss, 1960] Seuss. Green Eggs and Ham. New York, NY: Beginner Books, 1960.

[Fry, 1968] Fry, Edward. A Readability Formula That Saves Time. Journal of Reading, 11, 513-578 (1968).

[McLaughlin, 1969] McLaughlin, G.H. SMOG Grading — A New Readability Formula. Journal of Reading, 12, 639-646, (1969).

[PDB format, 2012] Protein Data Bank (original pdb format). https://www.wwpdb.org/documentation/file-format-content/format33/sect9.html

[Ulrich, et al., 2008] Ulrich, E.L., Akutsu, H., Doreleijers, J.F., Harano, Y., Ioannidis, Y.E., Lin, J., Livny, M., Mading, S., Maziuk, D., Miller, Z., Nakatani, E., Schulte, C.F., Tolmie, D.E., Wenger, R.K., Yao, H. & Markley, J.L. BioMagResBank. Nucleic Acids Research, 36, D402–D408 (2008). doi:https://doi.org/10.1093/nar/gkm957

^[1] Fixed width file formats can be used which require no syntactical characters. The problems for human readability will be discussed.

^[2] This serialziation requires quotes for all values, even numerical values, as the original STAR specification defines all values as strings. Of course, some readability gain could be achieved by using proper data types in the JSON serialization.

^[3] This is valid STAR file but is modified from the original NMR-STAR file. It is beyond the scope of this paper, but NMR-STAR enforces a stop_ keyword at the end of every loop where STAR does not require that.

Author's keywords for this paper:

Human Readability; Data Formats; Metrics

Michael Robert Gryk

Associate Professor

Department of Molecular Biology and Biophysics, UCONN Health (US)

Doctoral Student

University of Illinois, Urbana-Champaign (US)

`<gryk@uchc.edu>`

Dr. Michael R. Gryk is Associate Professor of Molecular Biology and Biophysics at UCONN Health. At UCONN, Michael co-leads a technical research and discovery component of the NMRbox BTRR Center, the mission of which is to foster the computational reproducibility and scientific data re-use of bioNMR data. Michael is also the associate director of the BioMagResBank, the international repository for bioNMR research data. He is also a doctoral student at the School of Information Sciences at the University of Illinois, Urbana-Champaign, where his broad research interests are in provenance, workflows, digital curation and preservation, reproducibility, and scientific data re-use. Michael is also a participant of the W3C Invisible Markup group.

BalisageThe Markup Conference

Balisage Paper: Human Readability of Data Files

Michael Robert Gryk

`<gryk@uchc.edu>`

Table of Contents

Introduction and Motivation

Background

STAR

NMR-STAR

PROV

YAML / JSON

XML

Readability

Data Readability

Data Identifiers

Data Values

Syntactical Characters

Readability Formula

Discussion

Acknowledgments

Appendix A. An example STAR file: 93.8% Human Readability^[3]

Appendix B. An example JSON serialization of the STAR file: 72.5% Human Readability

Appendix C. An example XML serialization of the STAR file: 40.5% Human Readability

References

Author's keywords for this paper:

`<gryk@uchc.edu>`

Balisage Series on Markup Technologies

Balisage Paper: Human Readability of Data Files

Michael Robert Gryk

<gryk@uchc.edu>

Table of Contents

Introduction and Motivation

Background

STAR

NMR-STAR

PROV

YAML / JSON

XML

Readability

Data Readability

Data Identifiers

Data Values

Syntactical Characters

Readability Formula

Discussion

Acknowledgments

Appendix A. An example STAR file: 93.8% Human Readability[3]

Appendix B. An example JSON serialization of the STAR file: 72.5% Human Readability

Appendix C. An example XML serialization of the STAR file: 40.5% Human Readability

References

Author's keywords for this paper:

<gryk@uchc.edu>

Balisage Series on Markup Technologies

`<gryk@uchc.edu>`

Appendix A. An example STAR file: 93.8% Human Readability^[3]

`<gryk@uchc.edu>`