Christopher Kelly and Jeff Beck
National Center for Biotechnology Information, National Library of Medicine, US National Institutes of Health
beck@ncbi.nlm.nih.gov
Presented at Balisage, Montreal, QC, August 6, 2012
PMC is the US National Library of Medicine's electronic archive of full-text journal literature.
Content is stored in XML at the article level. and is displayed dynamically from the archival XML each time that a user retrieves an article.
Participation by publishers is voluntary, although we require that the content be submitted in SGML or XML.
So, publishers send XML
and we have all of these fancy XML tools
Our jobs are easy!
That XML can be well-formed
... and valid
... and make sense
... and not be true
or in our case not represent the article it is supposed to represent.
‡Bauman, Syd. (2010) "The 4 Levels of XML Rectitude", Balisage 2010, poster.
Editorial Comment: Best Balisage poster ever.
But we still need to get eyes on the articles.
There is no XML test or tool for Veracity.
... are real, but the XML has been changed to protect me.
They are also all well-formed and valid.
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE article SYSTEM "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"> <article article-type="example"> <front> <journal-meta> <journal-id journal-id-type="nlm-ta">J Example Studies</journal-id> <issn>1111-XXXX</issn> </journal-meta> <article-meta> <title-group> <article-title>Good Science Info for You</article-title> </title-group> <contrib-group> <contrib> <name> <surname>Snap</surname> <given-names>Ginger P</given-names> </name> </contrib> <contrib> <name> <surname>House</surname> <given-names>Toul</given-names> </name> </contrib> </contrib-group> <pub-date> <month>03</month> <year>2012</year> </pub-date> <volume>12</volume> <issue>14</issue> <fpage>155</fpage> <lpage>159</lpage> </article-meta> </front> </article>
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE article SYSTEM "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"> <article article-type="example"> <front> <journal-meta> <journal-id journal-id-type="nlm-ta">J Example Studies</journal-id> <issn>1111-XXXX</issn> </journal-meta> <article-meta> <title-group> <article-title>Good Science Info for You</article-title> </title-group> <contrib-group> <contrib> <name> <surname>Taylor</surname> <given-names>Katy Rose</given-names> </name> </contrib> <contrib> <name> <surname>Hamelers</surname> <given-names>Audrey</given-names> </name> </contrib> </contrib-group> <pub-date> <month>03</month> <year>2012</year> </pub-date> <volume>12</volume> <issue>14</issue> <fpage>160</fpage> <lpage>164</lpage> </article-meta> </front> </article>
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE article SYSTEM "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"> <article article-type="example"> <front> <journal-meta> <journal-id journal-id-type="nlm-ta">J Example Studies</journal-id> <issn>1111-XXXX</issn> </journal-meta> <article-meta> <title-group> <article-title>Good Science Info for You</article-title> </title-group> <contrib-group> <contrib> <name> <surname>Waters</surname> <given-names>Roger</given-names> </name> </contrib> <contrib> <name> <surname>Gilmour</surname> <given-names>David</given-names> </name> </contrib> <contrib> <name> <surname>Wright</surname> <given-names>Rick</given-names> </name> </contrib> <contrib> <name> <surname>Best</surname> <given-names>Pete</given-names> </name> </contrib> </contrib-group> <pub-date> <month>03</month> <year>2012</year> </pub-date> <volume>12</volume> <issue>14</issue> <fpage>165</fpage> <lpage>168</lpage> </article-meta> </front> </article>
Good Science Info for You
Ginger P Snap and Toul House
J Example Studies 2012, 12(14): 155–159.
Good Science Info for You
Katy Rose Taylor and Audrey Hamelers
J Example Studies 2012, 12(14): 160–164.
Good Science Info for You
Roger Waters, David Gilmour, Rick Wright, and Pete Best
J Example Studies 2012, 12(14): 165–168.
It really happens ... still
In the early days, all participants had XML or SGML that was created for some other reason. We simply were going to reuse it. Because that is one thing that Marked-up content promised us.
Now over 70% of the content coming to PMC is in JATS (NLM DTD).
Because we convert incoming SGML or XML to our article model for loading to PMC, we need to find out before a publisher sends us content that we can map their article model to ours
Of course, we use XML tools to check for well-formedness and validity
But we have to put eyes on these articles, because we've seen thing like ...
<html> <head> <title>My Article</title> </head> <body> <p><font size="-1"><i>J Example Studies</i> <b>12</b>(14):155-159.</font></p> <p> <b> <font size="+4">Good Science Info for You</font> </b> </p> <p> <i>Ginger P Snap PhD and Toul House, PhD</i> </p> <p> <b>Abstract</b> </p> <p><b>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </b></p> <p> <b>Introduction</b> </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p> <p> <b>Materials and Methods</b> </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p> <p> <b>Results</b> </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p> <p> <b>Discussion</b> </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p> </body> </html>
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"> <article> <front> <journal-meta> <journal-id/> <journal-title-group> <journal-title>J Example Studies</journal-title> </journal-title-group> <issn/> </journal-meta> <article-meta> <title-group> <article-title>Good Science Info for You</article-title> </title-group> <pub-date> <year>2012</year> </pub-date> </article-meta> </front> <body> <p> <![CDATA[ <html> <head> <title>My Article</title> </head> <body> <p><font size="-1"><i>J Example Studies</i> <b>12</b>(14):155-159.</font></p> <p> <b> <font size="+4">Good Science Info for You</font> </b> </p> <p> <i>Ginger P Snap PhD and Toul House, PhD</i> </p> <p> <b>Abstract</b> </p> <p><b>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </b></p> <p> <b>Introduction</b> </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p> <p> <b>Materials and Methods</b> </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p> <p> <b>Results</b> </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p> <p> <b>Discussion</b> </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p> </body> </html> ]]> </p> </body> </article>
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"> <article> <front> <journal-meta> <journal-id/> <journal-title-group> <journal-title>J Example Studies</journal-title> </journal-title-group> <issn/> </journal-meta> <article-meta> <title-group> <article-title>Good Science Info for You</article-title> </title-group> <pub-date> <year>2012</year> </pub-date> </article-meta> </front> <body> <p><![CDATA[ <html> <head> <title>My Article</title> </head> <body> <p><font size="-1"><i>J Example Studies</i> <b>12</b>(14):155-159.</font></p> <p> <b> <font size="+4">Good Science Info for You</font> </b> </p> <p> <i>Ginger P Snap PhD and Toul House, PhD</i> </p> <p> <b>Abstract</b> </p> <p><b>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </b></p> <p> <b>Introduction</b> </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p> <p> <b>Materials and Methods</b> </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p> <p> <b>Results</b> </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p> <p> <b>Discussion</b> </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p> </body> </html> ]]> </p> </body> </article>
<!DOCTYPE article SYSTEM "ourdtd.dtd"> <article> J Example Studies 12(14):155-159. Good Science Info for You Ginger P Snap PhD and Toul House, PhD Abstract Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis Introduction Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis Materials and Methods Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis Results Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis Discussion Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </article>
And at least they shipped us their DTD.
<!ELEMENT article (#PCDATA) >
Documentation not necessary.
In Eval, every article from the sample set is checked by XML tools and by eye.
We don't have the staff to check every article once a journal has moved into production, but we've built a system to manage the QA work.
Once an article clears ingest, it moves into this system.
Allows us to quantify errors found in batches
Creates those nasty Word (well, almost Word) reports that we send out.
Reduces the level of expertise in XML needed to do QA.
Even though it was not the intent when PMC was created, we have built and XML publishing system
where we can't trust the content that is being sent to us.
Even with the power of XML in the palm of our hands, we still need to get eyes on articles.
Because there is no XML test or tool for Veracity.