Balisage Paper: Can LLMs help with XML?

Balisage: The Markup Conference 2024
July 29 - August 2, 2024

The materials listed below were provided by the speaker as supplements to a presentation at Balisage. These materials may include the slides or visuals used in the presentation; supplementary material, such as code samples or a demonstration application; and/or the paper accompanying the presentation (if it has not been provided in XML). These materials have been zipped for easy download and are identified by a brief description of the contents. The materials themselves are untouched, that is, they have not been tested or edited by Balisage: The Markup Conference or by Mulberry Technologies, Inc. As such, they are included on this website AS IS, i.e., as provided by the speaker, with no warranties, express or otherwise, made by Balisage or Mulberry.

Slides and Materials

Bal2024-DeRose-presentation-pdf.zip: Presentation slides in Adobe PDF
Bal2024-DeRose-presentation-pptx.zip: Presentation slides in Microsoft PowerPoint (pptx)

Abramson, N. 1963. Information Theory and Coding. New York: McGraw-Hill. https://archive.org/details/informationtheor0000abra

Anthropic. May 9, 2023. Claude’s Constitution. https://www.anthropic.com/news/claudes-constitution

Bai, Yuntao, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Jared Kaplan. 2022. Constitutional AI: Harmlessness from AI Feedback. https://arxiv.org/abs/2212.08073. doi:https://doi.org/10.48550/arXiv.2212.08073

Baroni, M., Bernardini, S., Ferraresi, A., and Zanchetta, E. 2009. The WaCky wide Web: A collection of very large linguistically processed Web-crawled corpora. Language Resources and Evaluation 43(3): 209–226. doi:https://doi.org/10.1007/s10579-009-9081-4. Cited in [Sharoff 2015].

Bauman, S. The Hard Edges of Soft Hyphens. 2016. Presented at Balisage: The Markup Conference 2016, Washington, DC, August 2 - 5, 2016. In Proceedings of Balisage: The Markup Conference 2016. Balisage Series on Markup Technologies, vol. 17. doi:https://doi.org/10.4242/BalisageVol17.Bauman01

Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜[sic]. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’21): 610–623. New York: Association for Computing Machinery. doi:https://doi.org/10.1145/3442188.3445922

Berg, D., Gonnet, G. and Tompa, F. 1988. The New Oxford English Dictionary Project at the University of Waterloo. Report number: OED-88-01. University of Waterloo Centre for the New Oxford English Dictionary. https://www.researchgate.net/publication/243451160

Bernstein, M. 2010. Card Sharks and Holy Scrollers. https://www.markbernstein.org/Oct10/CardSharksandHolyScrollers.html

Burton, N. G. and J. C. R. Licklider. 1955. Long-Range Constraints in the Statistical Structure of Printed English. American Journal of Psychology 68: 650-653. doi:https://doi.org/10.2307/1418794

CCEL. Theological Markup Language (ThML). https://www.ccel.org/ThML/index.html

Chicago Tribune. December 2, 1995. America Online Admits ‘Error’ in Banning Word ‘Breast’. https://www.chicagotribune.com/1995/12/02/america-online-admits-error-in-banning-word-breast/

Church, K. 1988. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text. Second Conference on Applied Natural Language Processing (Austin, Texas), pp. 136-143. https://aclanthology.org/A88-1019.pdf. doi:https://doi.org/10.3115/974235.974260

Churchland, P. S. 1987. Epistemology in the Age of Neuroscience. Journal of Philosophy 84 (10): 544-553. https://patriciachurchland.com/wp-content/uploads/2020/05/1987-Epistemology-in-the-Age-of-Neuroscience.pdf. doi:https://doi.org/10.5840/jphil1987841026

Cole, D. 2020. The Chinese Room Argument. Stanford Encyclopedia of Philosophy. https://plato.stanford.edu/entries/chinese-room

Dartmouth Dante Project. Longfellow, H. W. 1867. Translation of Dante, Paradiso. https://Dante.Dartmouth.EDU, http://dantelab.dartmouth.edu/reader

Darwin, C. 3 July 1881. Letter to William Graham. https://www.darwinproject.ac.uk/letter/DCP-LETT-13230.xml

DeRose, S. J. 1988. Grammatical Category Disambiguation by Statistical Optimization. Computational Linguistics 14(1), Winter 1988. https://aclanthology.org/people/s/steven-j-derose/

DeRose, S. J. 1990. Stochastic Methods for Resolution of Grammatical Category Ambiguity in Inflected and Uninflected Languages. Thesis. Providence: Brown University Department of Cognitive and Linguistic Sciences. http://www.derose.net/derose/steve/writings/dissertation/Diss.0.html

DeRose, S. J. 2004. Markup Overlap: A Review and a Horse. Extreme Markup Languages. https://www.researchgate.net/publication/221211490_Markup_Overlap_A_Review_and_a_Horse

Edwards, B. 2024. ‘The king is dead’—Claude 3 surpasses GPT-4 on Chatbot Arena for the first time. Ars Technica, March 27, 2024. https://arstechnica.com/information-technology/2024/03/the-king-is-dead-claude-3-surpasses-gpt-4-on-chatbot-arena-for-the-first-time/

Francis, W. N. and Kucera, H. 1979. Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English, for Use with Digital Computers. Providence: Department of Linguistics, Brown University. http://icame.uib.no/brown/bcm.html

Greenstein, S. and Feng Zhu. 2018. Do experts or crowd-based models produce more bias? evidence from encyclopedia britannica and wikipedia. MIS Quarterly 42(3), September 2018: 945–960. doi:https://doi.org/10.25300/MISQ/2018/14084

Horton, R. 2015. Offline: What is medicine’s 5 sigma? The Lancet 385(9976): 1380, April 11, 2015. https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(15)60696-1/fulltext. doi:https://doi.org/10.1016/S0140-6736(15)60696-1

HTTP Archive. 2022. Web Almanac: HTTP Archive’s annual state of the web report. https://almanac.httparchive.org/en/2022/table-of-contents

Koplenig, A. 2017. The impact of lacking metadata for the measurement of cultural and linguistic change using the Google Ngram data sets—Reconstructing the composition of the German corpus in times of WWII. Digital Scholarship in the Humanities, 32(1), 169-188. doi:https://doi.org/10.1093/llc/fqv037

Leibnitz, G. 1714. The Principles of Philosophy known as Monadology. https://www.earlymoderntexts.com/assets/pdfs/leibniz1714b.pdf

Marshall, C. C., and Irish, P. M. 1989. Guided Tours and On-Line Presentations: How Authors Make Existing Hypertext Intelligible for Readers. In Proceedings of the Second Annual ACM Conference on Hypertext, pp. 15-26. New York: ACM Press. doi:https://doi.org/10.1145/74224.74226

Mikolov, T., Chen, K., Corrado, G., and Dean, J. 2013. Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781, 2013. doi:https://doi.org/ 10.48550/arXiv.1301.3781

Miller, G. A. and Chomsky, N. 1963. Finitary Models of Language Users. In R. Duncan Lee, Robert A. Bush, and Eugene Galanter (eds.), Handbook of Mathematical Psychology 2: 420-491. New York: John Wiley & Sons, Inc. https://faculty.wcas.northwestern.edu/robvoigt/courses/2021_spring/ling334/readings/finitary_models.pdf

Nunberg, G. 2009. Google’s Book Search: A disaster for scholars. The Chronicle of Higher Education, August 31, 2009. https://www.chronicle.com/article/googles-book-search-a-disaster-for-scholars/

Pechenick, E. A., Danforth, C. M., and Dodds, P. S. 2015. Characterizing the Google Books corpus: Strong limits to inferences of socio-cultural and linguistic evolution. PLOS ONE, 10(10), e0137041. doi:https://doi.org/10.1371/journal.pone.0137041

Plantinga, A. 1993. Warrant and Proper Function. Oxford University Press.

Posner, M. and Keele, S. 1968. On the Genesis of Abstract Ideas. Journal of experimental psychology 77: 353-63. doi:https://doi.org/10.1037/h0025953

Searle, J. R. 2002. Consciousness and Language. New York: Cambridge University Press.

Shannon, C. 1948. A Mathematical Theory of Communication. Bell System Technical Journal, July and October. doi:https://doi.org/10.1002/j.1538-7305.1948.tb01338.x

Sharoff, S. 2015. Review of Roland Schäfer and Felix Bildhauer, Web Corpus Construction, Morgan & Claypool (Synthesis Lectures on Human Language Technologies, volume 22), 2013, ISBN 978-1608459834. In Computational Linguistics 41(1). https://aclanthology.org/J15-1009. doi:https://doi.org/10.1162/COLI_r_00214

Simonite, T. Feb 4, 2021. AI and the List of Dirty, Naughty, Obscene, and Otherwise Bad Words. Wired. https://www.wired.com/story/ai-list-dirty-naughty-obscene-bad-words/

Smith, B. 2024. Self-Attention Explained with Code: How Large Language Models Create Rich, Contextual Embeddings. Towards Data Science. https://medium.com/towards-data-science/contextual-transformer-embeddings-using-self-attention-explained-with-diagrams-and-python-code-d7a9f0f4d94e

Text Encoding Initiative. 2023. TEI: Guidelines for Electronic Text Encoding and Interchange. P5 Version 4.7.0. Last updated on 16th November 2023, revision e5dd73ed0. Section 21.1, https://tei-c.org/release/doc/tei-p5-doc/ja/html/CE.html

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. 2017. Attention is All You Need. Advances in Neural Information Processing Systems 30 (NIPS 2017). https://arxiv.org/pdf/1706.03762. doi:https://doi.org/10.48550/arXiv.1706.03762

Wikipedia. Scunthorpe problem. https://en.wikipedia.org/wiki/Scunthorpe_problem

Yonge, C. D. (tr). 1854-1855. The Works of Philo Judaeus. Electronic edition, 2012-05-14. Christian Classics Ethereal Library. https://www.ccel.org/ccel/philo/works.html

Author's keywords for this paper:

AI; LLMs; XML; Markup Systems