Introduction and Motivation
In this era of big data and FAIR data, data formats must be machine interpretable. XML, among other standards, satisfies this requirement. Yet many standardization initiatives cite human readability as a second, key property in data format development. Examples include the development of STAR in the field of structural biology, W3C PROV for provenance, and even the continuing development of XML. This begs the question(s), what is meant by human readability and can this property be measured for a given data format or compared between competing standards?
The broad topic of readability is considered with attention to the various aspects of written text which either foster or counter readability. Drawing on efforts in the educational system, a metric is proposed for estimating the relative human readability of structured data within an archival file format. Comparison is made between the same data represented in various formats, including JSON and XML, to help judge whether these standards have accomplished their simultaneous goals of machine interpretability and human readability.
I have been motivated to write about this topic after witnessing several years of conversations between colleagues, mentors and students. It is impossible to count the number of discussions which contained statements such as: "I prefer STAR to JSON because it is easier to read." Interestingly, sometimes a discussion within a different group would prompt the exact opposite preference, suggesting there is a subjective element to readability.
I admit to having my own bias. I favor XML over JSON, STAR, etc. due to many factors, including its structure, elegance, longevity and the tremendous technology stack built around XML. Perhaps due to this bias, I had not considered trying to measure the readability of data within XML versus other formats. My role was simply another participant in water cooler conversations: "I prefer XML, full stop."
Last year's submission to Balisage [Gryk, 2021] provided me the opportunity to roll up my sleeves and really engage with the STAR data structures. That contribution was an effort to deconstruct the structure of STAR from its syntax (serialization). An extra benefit of that exercise was the development of not only an XML serialization of STAR, but also a series of transformations (XSLTs) for round-tripping the XML representation back to STAR, for converting to JSON, and even creating a spreadsheet representation of the data. Generating, testing and trouble-shooting these various serializations provided a very tacit experience regarding which formats are more readable than others.
There are two important qualifications to add at this point which will be discussed again later. One, readability of data is somewhat dependent on the data. The STAR format example in this manuscript is used in the field of structural biology, and the pairing of data with format may produce yet another bias to the conversation. Two, the terms format and serialization have been used somewhat interchangeably up until now but it will be important to distinguish the readability of data as presented in a textual layout versus the ability of a format to be converted to a textual layout which is readable.
Background
Mark Twain is often cited for the quip: Everybody talks about the weather, but nobody does anything about it. A similar thing can be said of the human readability of data formats. Everyone talks about human readability, but hardly anyone defines precisely what they mean by it nor which properties make one format more readable than another.
Of course, in a different era the topic in front of us would not be what is meant by human readable. Documents such as books, magazines and pamphlets are written and printed specifically for human consumption. Fifty years ago, the important topic would be how to make documents machine readable. The acronym MARC emphasized MAchine Readable Cataloguing rather than human readable cataloguing. The FAIR data principles also require that scientific data be machine interoperable. FAIR data repositories support this requirement by providing data in machine interpretable formats.
Nevertheless, human readability is frequently cited alongside machine readability as a desirable property for data formats and as a major design consideration. A few examples should drive home the point.
STAR
The STAR file was introduced in 1991 to support scientific data and this format is still in use in the structural biology communities. Hall identifies the following requirements for the STAR format [Hall, 1991]:
A Universal Archive File should be simple to read and to access. — Bullet point 4 of requirements
The file is easy to read visually, or by machine. — Bullet point 5 of the properties of a STAR file
To facilitate access to a STAR File the names of data items must be defined to be as descriptive as possible.
NMR-STAR
A variant of the STAR format (called NMR-STAR) has been used by the BioMagResBank [Ulrich, et al., 2008] for archiving data related to the field of biomolecular nuclear magnetic resonance spectroscopy since the 1990's. The decision to use STAR as opposed to other available standards was made in part out of concern for human readability.
ASN.1, XML, and SGML formats did not meet the need for easy human readability and efficient manual editing with common text editing software.
— BMRB Internal Whitepaper
PROV
The W3C supports a set of standards for recording and reporting provenance, particularly within the context of the world wide web. Part of this family of standards is a specialized notation for provenance, PROV-N. Once again, human readability was a major design consideration.
A key goal of PROV is the specification of a machine-processable data model for provenance. However, communicating provenance between humans is also important when teaching, illustrating, formalizing, and discussing provenance-related issues. With these two requirements in mind, this document introduces PROV-N, the PROV notation, a syntax designed to write instances of the PROV data model according to the following design principles:
Technology independence. PROV-N provides a simple syntax that can be mapped to several technologies.
Human readability. PROV-N follows a functional syntax style that is meant to be easily human-readable so it can be used in illustrative examples, such as those presented in the PROV documents suite.
Formality. PROV-N is defined through a formal grammar amenable to be used with parser generators.
— https://www.w3.org/TR/2013/REC-prov-n-20130430/
YAML / JSON
YAML is a human-friendly data serialization language for all programming languages.
— https://yaml.org/
YAML is a strict superset of JSON. However, when comparing YAML with JSON, once again, the topic often returns to readability:
In practice, however, the two formats look different, as the YAML specification puts more emphasis on human readability by adding a lot more syntactic sugar and features on top of JSON.
— https://realpython.com/python-yaml/
XML
Even the design criteria for XML refer to human readability.
The design goals for XML are: ...
6. XML documents should be human-legible and reasonably clear.
— https://www.w3.org/TR/REC-xml/
Readability
This topic of this paper is the human readability of data as contained within scientific data formats. It is useful at this point to consider readability more generally, as much work has been done in developing metrics for measuring the readability of common texts. These include the Flesch Reading Ease [Flesch, 1948], the Fry Reading Formula [Fry, 1968], and the Simple Measure of Gobbledygook (SMOG) Formula [McLaughlin, 1969]
The Flesch Reading Ease formula is as follows: Reading Ease = 206.835 – 1.015 * (average words per sentence) – 84.6 * (average syllables per word)
A larger number is considered easier to read, a smaller number is more difficult to read. As an example, let's consider:
I do not like green eggs and ham. I do not like them, Sam I am.
Let's contrast that with the following:
The broad topic of readability is discussed in this manuscript with attention to the various aspects of written text which either foster or counter readability. Drawing on efforts in the educational system, a metric is proposed for estimating the relative human readability of structured data within an archival file format.
— Draft of Abstract
Data Readability
How can we construct a formula similar to the Flesch formula for measuring the human readability of a data format? The first consideration is defining the contents of a data file in as general terms as possible. A data file generally contains data values which are associated with data identifiers. These data items are structured within the file using some type of syntactical characters such that the data file can be parsed by a machine[1]. Machine readability is assumed as a prerequisite for a data file format. Let's consider these three components, data identifiers, data values and syntax as to how they affect human readability. In the end, I will propose that simply counting the number of identifier characters, value characters and syntactic characters can be used to define a general formula for the human readability of scientific data. I do not claim that this is the best nor the only way to measure human readability; with this exercise I hope to start a conversation on defining and measuring readability rather than ending the conversation.
Data Identifiers
Identifiers provide a name for the underlying data. In principle, the identifiers only need to be unique within whatever scope they are used. For example, the following table illustrates the same concept being expressed using different identifier conventions.
Table I
Examples |
---|
a = l * w |
area = length * width |
rectangle.area = base.length * side.width |
All three of the examples convey the same concept, that the area is a given by the length multiplied by the width. However, the examples differ in their verboseness or their descriptive value. In the first example, a reader needs to either know that the identifiers refer to area, length and width, or be able to infer that relationship. In the third example, additional qualifiers are used which can be helpful in cases where there are multiple formulas for calculating the area of rectangles, triangles and parallelograms.
As a general rule, we notice that the more characters which are used, the more descriptive the identifier. Of course, this is just a generalization which helps justify counting characters as a measurement of readability, similar to the counting of syllables as a measure of the complexity of a word. "Shunt" may be less readable than "today" irrespective of the number of syllables. "I" might be just as readable as "the author of this manuscript", even though the latter has more characters.
At the other extreme of verboseness are fixed width file formats, such as the original pdb format of the Protein Data Bank [PDB format, 2012]. In the case of fixed width formats, there need not be any identifiers or any syntactical characters either. It is part of the documentation how the various data values are ordered in the file (similar to binary data representations). These formats may still maintain a degree of human readability; however, edge cases where fixed width values bleed into each other can be onerous. (An example of this for the pdb format is between the chain identifer and residue number. Residue number is given 4 characters after the chain ID. In the vast majority of cases, there are fewer than 999 residues in a polymer and there is whitespace between the chain ID and residue number. However, if 1000 residues are reached, then there is no intervening whitespace. This is a common trip point for folks writing parsers for the old pdb format.)
Data Values
Just as data identifiers can benefit from verboseness, so can data values. However, in the case of scientific data values, there is a stronger impetus for quantitative values which can tip the scale more towards machine readability rather than human readability [Wrightson, 2005]. An example is given in below.
Table II
Type | Value | Audience |
---|---|---|
Textual Description | Orange | Human (General) |
Wavelength | 600 nm | Human (Scientist) and Machine |
RGB: Hexidecimal | #FFBE00 | Machine |
RGB: Decimal | 255, 190, 0 | Machine |
CMYK | 0%, 25%, 100%, 0% | Machine |
The overarching purpose of a data file is to convey the data values using whatever representation the scientific community agrees is the most correct, precise, or important. Therefore, it is not my intention to suggest that the data values should be transcoded from machine readable values useful to the community to something more human readable which is less useful, as in 600 nm versus orange. Nevertheless, as Ann Wrightson pointed out in 2005, much XML is not human readable because the values stored within XML files are intended for machines, not humans [Wrightson, 2005].
As in the preceding section regarding data identifiers, it is a simple generalization that the more characters are used to represent a data value, the more human readable it can be. For example, "true" and "false" are more readable than "1" and "0". In fact, a simple machine encoding of a Boolean value can be used to represent true/false, on/off, up/down, and various other exclusive properties for which a more verbose description could assist in human readability. Counter examples would include lengthy numerical codes as proxies, such as the ISBN of "978-0-385-12167-5" rather than the book title, "The Shining".
Syntactical Characters
The final component of scientific data formats are the characters and expressions which are used to define the syntax. These are useful for both human readability and machine parsing. However, in this section I argue that a less verbose syntax leads to easier human readability. (A more verbose syntax often leads to easier machine parsing as can be seen in programming languages such as ALGOL where every if is closed with a fi and every do is closed with an od.
Throughout the rest of this paper I will explicitly refer to three serializations of the STAR file format. The original serialization which is part of the STAR definition [Hall, 1991], an XML representation of the STAR format [Gryk, 2021] along with a JSON serialization generated through an XSLT of the XML serialization. It should be noted that the motivation for the design of the XML schema for STAR [Gryk, 2021] was specifically to support transformation into other serializations and to support that goal, the XML schema explicitly defines and tracks aspects of the STAR format. As pointed out by one of the reviewers of this paper, that makes the final comparison between the readability of XML and the other serializations a bit unfair as the XML version defines more of the data structure within the schema.
To quickly summarize the STAR format, a data file consists of two types of top level containers, data blocks which have an associated identifer and global blocks (named for their global scope). Within these containers are allowed a third type of container called a save frame which in some variations of STAR can be arbitrarily nested. Finally, the data itself is provided either through key/value pairs for which the keys must be unique within the scope of the container, or as tabular data with explicit column names along with tabular data values. An important point which will be revisited, STAR uses whitespace as a basic delimiter between these keywords, identifers and values, but other than that whitespace has no formal meaning. Because of this, tabular data can be formatted in very human readable ways or the whitespace can be used to obstruct human readability.
A comparison of the syntactical characters required for each of the three serializations is given below.
Table III
Block | STAR | XML | JSON |
---|---|---|---|
File | ... |
<file>...</file> |
{"file" : ...} |
0 | 13 | 9 | |
Data Block | data_ identifier ... |
<data name=" identifier ">...</data> |
{"data" : { "name" = " identifier ", ...} |
5 | 20 | 20 | |
Global Block | global_ ... |
<global>...</global> |
{"global" : ...} |
7 | 17 | 11 | |
Save Frame | save_ identifier ... save_ |
<save name=" identifier ">...</save> |
{"save" : { "name" = " identifier ", ...} |
10 | 20 | 20 | |
Pair | _ key value |
<datum key=" key "> value </datum> |
" key " : " value " |
1 | 21 | 5 | |
Table (loop) | loop_ _ column1 _ column2 value1 value2 |
<loop><header><column key=" column1 "/><column key=" column2 "/></header><row><cell> value1 </cell><cell> value2 </cell></row></loop> |
[[" column1 "," column2 "],[" value1 "," value2 "]] |
5 + 1 per column | 56 plus multiple of rows and columns | 6 plus multiple of rows and columns |
The above table provides the general syntax for each of the STAR constructs, serialized as canonical STAR, XML or as JSON. For each STAR construct, the syntax is given on the top row and below is given a tabulation of the number of characters required to define the syntax.
The first row emphasizes that STAR is itself a file format and implicitly uses the file as the top level container. In the case of XML or JSON, this root element is made explicit. Therefore, XML and JSON representations require an additional 13 and 9 characters to define this construct.
The other container constructs in STAR have short keywords to define data blocks, global blocks or save frames. The XML and JSON representations are similarly short; however, they require a few more syntactical characters.
The largest difference noted is for key value pairs and tables. STAR has an extremely concise manner for representing identifiers by preceding them with a single underscore. This is much more syntactically efficient than XML and slightly better than JSON. In the case of tabular data, it becomes impossible to define the difference between the serializations as a single number; the verboseness of both XML and JSON is proportional to the number of columns and the number of rows within the table.
Readability Formula
I propose the following formula as a starting point for measuring the human readability of a scientific data file.
Human Readability = Number of characters used for identifiers and values / Number of total characters (including syntactical characters but excluding whitespace and comments)
This formula is intended to be simple to apply and agnostic to the underlying data. Similar to counting syllables and words for readability of prose as opposed to look up tables for the difficulty of word comprehension (shunt vs today). As an example, a small section of a STAR file hosted at the BioMagResBank is provided in Appendix A, Appendix B and Appendix C in the three different serializations. The overall number of data characters, identifiers and values, is the same for each of the three representations (1238 characters). However, the STAR file is the most compact with a total of 1320 characters yielding a proposed readability value of 93.8%. The JSON serialization yields a readability value of 72.5% while the XML serialization is the least readable at 40.5%. While the proposed formula can certainly be critiqued, it does seem to correlate with the general appearance of readability. The JSON file is a bit tougher to read due to the brackets, curly braces, colons and especially the quotes required around the data identifiers and values[2]. The largest difficulty of the XML file is the extremely verbose description for tabular data.
Discussion
My goal in this paper is to explore a possibility of measuring the human readability of a scientific data file. The benefits of human readability are often cited as important design considerations for data standards and it has been noted that different XML files can vary on their readability [Wrightson, 2005]. Estimating the readability of books has been attempted by multiple sources with efforts focusing either on look-up tables of content or simple mathematical counting of syllables and words as an indicator of the complexity of the prose. This latter approach is taken to define a formula for readability of scientific data with a use case of the STAR format used in the fields of chemistry and structural biology.
There are several obvious caveats and critiques of this work. The first of which is the appearance of comments within a data file. Both STAR and the XML serialization (Appendix A and Appendix C) allow for comments while JSON does not. Comments are explicitly intended to aid in human comprehension. However, since they are not part of the machine parsable content (at least in STAR) it seemed unfair to include them as either improving or detracting from human readability.
A second important caveat is with regards to whitespace. STAR ranks the best in human readability according to this formula, in large part because whitespace is used as the natural delimiter. This is how natural language is also delimited. It is important to note that while Appendices B and C also use whitespace to aid in human readability, almost none of the whitespace is actually required for those serialization formats. (The only required whitespace is to separate element names and attributes in XML.) STAR on the other hand requires whitespace and uses this requirement as its main mechanism of achieving its stated goal of ensuring the file is easy to read visually or by machine.
However, whitespace can be challenging when used for both humans and machines, particularly because whitespace is invisible to humans but visible to machines. In this regard, if a file format distinguishes between the types and amounts of whitespace (as does Python and YAML), a single representation may be challenging to read both as a human and a machine. In other words, the difference between a tab and three spaces may be significant to the machine but indistinguishable to the human which, while not directly countering human readability, perhaps affects human/machine mutual understanding.
In summary, I hope that this discussion and proposed metric are useful in attempting to more formally define the oft-cited concern of human readability in data formats.
Acknowledgments
This work was supported in part by the National Institute of General Medical Sciences of the National Institutes of Health under Award Number GM-109046.
Appendix A. An example STAR file: 93.8% Human Readability[3]
data_5208 ####################### # Entry information # ####################### save_entry_information _Entry.Sf_category entry_information _Entry.Sf_framecode entry_information _Entry.ID 5208 _Entry.Title ; 1H, 13C and 15N resonance assignments for the perdeuterated 22 kD palm-thumb domain of DNA polymerase B ; _Entry.Type . _Entry.Version_type original _Entry.Submission_date 2001-11-14 _Entry.Accession_date 2001-11-14 _Entry.Last_release_date 2002-05-07 _Entry.Original_release_date 2002-05-07 _Entry.Origination author _Entry.NMR_STAR_version 3.1.1.61 _Entry.Original_NMR_STAR_version 2.1 _Entry.Experimental_method NMR _Entry.Experimental_method_subtype . _Entry.Details . _Entry.BMRB_internal_directory_name . loop_ _Entry_author.Ordinal _Entry_author.Given_name _Entry_author.Family_name _Entry_author.First_initial _Entry_author.Middle_initials _Entry_author.Family_title _Entry_author.Entry_ID 1 Michael Gryk . R. . 5208 2 Mark Maciejewski . W. . 5208 3 Anthony Robertson . . . 5208 4 Mary Mullen . A. . 5208 5 Samuel Wilson . H. . 5208 6 Gregory Mullen . P. . 5208 loop_ _Data_set.Type _Data_set.Count _Data_set.Entry_ID assigned_chemical_shifts 1 5208 loop_ _Datum.Type _Datum.Count _Datum.Entry_ID '1H chemical shifts' 354 5208 '13C chemical shifts' 621 5208 '15N chemical shifts' 168 5208 loop_ _Release.Release_number _Release.Format_type _Release.Format_version _Release.Date _Release.Submission_date _Release.Type _Release.Author _Release.Detail _Release.Entry_ID 1 . . 2002-05-07 2001-11-14 original author . 5208 save_
Appendix B. An example JSON serialization of the STAR file: 72.5% Human Readability
{"STAR-file" : {"data" : { "name" : "5208", "save" : { "name" : "entry_information", "Entry.Sf_category" : "entry_information", "Entry.Sf_framecode" : "entry_information", "Entry.ID" : "5208", "Entry.Title" : "1H, 13C and 15N resonance assignments for the perdeuterated 22 kD palm-thumb\ndomain of DNA polymerase B", "Entry.Type" : ".", "Entry.Version_type" : "original", "Entry.Submission_date" : "2001-11-14", "Entry.Accession_date" : "2001-11-14", "Entry.Last_release_date" : "2002-05-07", "Entry.Original_release_date" : "2002-05-07", "Entry.Origination" : "author", "Entry.NMR_STAR_version" : "3.1.1.61", "Entry.Original_NMR_STAR_version" : "2.1", "Entry.Experimental_method" : "NMR", "Entry.Experimental_method_subtype" : ".", "Entry.Details" : ".", "Entry.BMRB_internal_directory_name" : ".", "loop" : [["Entry_author.Ordinal","Entry_author.Given_name","Entry_author.Family_name","Entry_author.First_initial","Entry_author.Middle_initials","Entry_author.Family_title","Entry_author.Entry_ID"], ["1","Michael","Gryk",".","R.",".","5208"], ["2","Mark","Maciejewski",".","W.",".","5208"], ["3","Anthony","Robertson",".",".",".","5208"], ["4","Mary","Mullen",".","A.",".","5208"], ["5","Samuel","Wilson",".","H.",".","5208"], ["6","Gregory","Mullen",".","P.","5208"]], "loop" : [["Data_set.Type","Data_set.Count","Data_set.Entry_ID"], ["assigned_chemical_shifts","1","5208"]], "loop" : [["Datum.Type","Datum.Count","Datum.Entry_ID"], ["1H chemical shifts","354","5208"], ["13C chemical shifts","621","5208"], ["15N chemical shifts","168","5208"]], "loop" : [["Release.Release_number","Release.Format_type","Release.Format_version","Release.Date","Release.Submission_date","Release.Type","Release.Author","Release.Detail","Release.Entry_ID"], ["1",".",".","2002-05-07","2001-11-14","original author",".","5208"]] } } } }
Appendix C. An example XML serialization of the STAR file: 40.5% Human Readability
<?xml version="1.0" encoding="UTF-8"?> <STAR-file version="Hall_96" xmlns="BMRB.STAR" xmlns:xsi="star.xsd"> <data name="5208"> <!-- ###################### --> <!-- Entry information # --> <!-- ###################### --> <save name="entry_information"> <datum key="Entry.Sf_category" >entry_information</datum> <datum key="Entry.Sf_framecode" >entry_information</datum> <datum key="Entry.ID" >5208</datum> <datum key="Entry.Title" delimiter="semi-colon">\n1H, 13C and 15N resonance assignments for the perdeuterated 22 kD palm-thumb \ndomain of DNA polymerase B</datum> <datum key="Entry.Type" >.</datum> <datum key="Entry.Version_type" >original</datum> <datum key="Entry.Submission_date" >2001-11-14</datum> <datum key="Entry.Accession_date" >2001-11-14</datum> <datum key="Entry.Last_release_date" >2002-05-07</datum> <datum key="Entry.Original_release_date" >2002-05-07</datum> <datum key="Entry.Origination" >author</datum> <datum key="Entry.NMR_STAR_version" >3.1.1.61</datum> <datum key="Entry.Original_NMR_STAR_version" >2.1</datum> <datum key="Entry.Experimental_method" >NMR</datum> <datum key="Entry.Experimental_method_subtype" >.</datum> <datum key="Entry.Details" >.</datum> <datum key="Entry.BMRB_internal_directory_name" >.</datum> <loop> <header> <column key="Entry_author.Ordinal"/> <column key="Entry_author.Given_name"/> <column key="Entry_author.Family_name"/> <column key="Entry_author.First_initial"/> <column key="Entry_author.Middle_initials"/> <column key="Entry_author.Family_title"/> <column key="Entry_author.Entry_ID"/> </header> <row> <cell>1</cell> <cell>Michael</cell> <cell>Gryk</cell> <cell>.</cell> <cell>R.</cell> <cell>.</cell> <cell>5208</cell> </row> <row> <cell>2</cell> <cell>Mark</cell> <cell>Maciejewski</cell> <cell>.</cell> <cell>W.</cell> <cell>.</cell> <cell>5208</cell> </row> <row> <cell>3</cell> <cell>Anthony</cell> <cell>Robertson</cell> <cell>.</cell> <cell>.</cell> <cell>.</cell> <cell>5208</cell> </row> <row> <cell>4</cell> <cell>Mary</cell> <cell>Mullen</cell> <cell>.</cell> <cell>A.</cell> <cell>.</cell> <cell>5208</cell> </row> <row> <cell>5</cell> <cell>Samuel</cell> <cell>Wilson</cell> <cell>.</cell> <cell>H.</cell> <cell>.</cell> <cell>5208</cell> </row> <row> <cell>6</cell> <cell>Gregory</cell> <cell>Mullen</cell> <cell>.</cell> <cell>P.</cell> <cell>.</cell> <cell>5208</cell> </row> </loop> <loop> <header> <column key="Data_set.Type"/> <column key="Data_set.Count"/> <column key="Data_set.Entry_ID"/> </header> <row> <cell>assigned_chemical_shifts</cell> <cell>1</cell> <cell>5208</cell> </row> </loop> <loop> <header> <column key="Datum.Type"/> <column key="Datum.Count"/> <column key="Datum.Entry_ID"/> </header> <row> <cell delimiter="single-quote">1H chemical shifts</cell> <cell>354</cell> <cell>5208</cell> <cell delimiter="single-quote">13C chemical shifts</cell> <cell>621</cell> <cell>5208</cell> <cell delimiter="single-quote">15N chemical shifts</cell> <cell>168</cell> <cell>5208</cell> </row> </loop> <loop> <header> <column key="Release.Release_number"/> <column key="Release.Format_type"/> <column key="Release.Format_version"/> <column key="Release.Date"/> <column key="Release.Submission_date"/> <column key="Release.Type"/> <column key="Release.Author"/> <column key="Release.Detail"/> <column key="Release.Entry_ID"/> </header> <row> <cell>1</cell> <cell>.</cell> <cell>.</cell> <cell>2002-05-07</cell> <cell>2001-11-14</cell> <cell>original</cell> <cell>author</cell> <cell>.</cell> <cell>5208</cell> </row> </loop> </save> </data> </STAR-file>
References
[Gryk, 2021]
Gryk, Michael R. Deconstructing the STAR File Format
.
Presented at Balisage: The Markup Conference 2021, Washington, DC, August 2 - 6, 2021.
In Proceedings of Balisage: The Markup Conference 2021.
Balisage Series on Markup Technologies, vol. 26 (2021).
doi:https://doi.org/10.4242/BalisageVol26.Gryk01
[Hall, 1991]
Hall, S.R. The STAR File: A New Format for Electronic Data Transfer and Archiving
.
J. Chem. Inf. Comput.,
31, 326-333 (1991). doi:https://doi.org/10.1021/ci00002a020
[Wrightson, 2005]
Wrightson, A. Semantics of Well Formed XML as a Human and Machine Readable Language: Why is some
XML so difficult to read?
Proceedings of Extreme Markup Languages 2005,
2005.
[Flesch, 1948]
Flesch, R. A new readability yardstick
.
Journal of Applied Psychology,
32, 221–233 (1948).
doi:https://doi.org/10.1037/h0057532
[Seuss, 1960] Seuss. Green Eggs and Ham. New York, NY: Beginner Books, 1960.
[Fry, 1968]
Fry, Edward. A Readability Formula That Saves Time
.
Journal of Reading,
11, 513-578 (1968).
[McLaughlin, 1969]
McLaughlin, G.H. SMOG Grading — A New Readability Formula
.
Journal of Reading,
12, 639-646, (1969).
[PDB format, 2012] Protein Data Bank (original pdb format). https://www.wwpdb.org/documentation/file-format-content/format33/sect9.html
[Ulrich, et al., 2008]
Ulrich, E.L., Akutsu, H., Doreleijers, J.F., Harano, Y., Ioannidis, Y.E., Lin, J.,
Livny, M., Mading, S.,
Maziuk, D., Miller, Z., Nakatani, E., Schulte, C.F., Tolmie, D.E., Wenger, R.K., Yao,
H. & Markley, J.L.
BioMagResBank
. Nucleic Acids Research, 36, D402–D408 (2008).
doi:https://doi.org/10.1093/nar/gkm957
[1] Fixed width file formats can be used which require no syntactical characters. The problems for human readability will be discussed.
[2] This serialziation requires quotes for all values, even numerical values, as the original STAR specification defines all values as strings. Of course, some readability gain could be achieved by using proper data types in the JSON serialization.
[3] This is valid STAR file but is modified from the original NMR-STAR file.
It is beyond the scope of this paper, but NMR-STAR enforces a stop_
keyword
at the end of every loop where STAR does not require that.