Balisage Paper: Can LLMs help with XML?

Balisage: The Markup Conference 2024
July 29 - August 2, 2024

The materials listed below were provided by the speaker as supplements to a presentation at Balisage. These materials may include the slides or visuals used in the presentation; supplementary material, such as code samples or a demonstration application; and/or the paper accompanying the presentation (if it has not been provided in XML). These materials have been zipped for easy download and are identified by a brief description of the contents. The materials themselves are untouched, that is, they have not been tested or edited by Balisage: The Markup Conference or by Mulberry Technologies, Inc. As such, they are included on this website AS IS, i.e., as provided by the speaker, with no warranties, express or otherwise, made by Balisage or Mulberry.

Slides and Materials

×

Abramson, N. 1963. Information Theory and Coding. New York: McGraw-Hill. https://archive.org/details/informationtheor0000abra

×

Anthropic. May 9, 2023. Claude’s Constitution. https://www.anthropic.com/news/claudes-constitution

×

Bai, Yuntao, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Jared Kaplan. 2022. Constitutional AI: Harmlessness from AI Feedback. https://arxiv.org/abs/2212.08073. doi:https://doi.org/10.48550/arXiv.2212.08073

×

Baroni, M., Bernardini, S., Ferraresi, A., and Zanchetta, E. 2009. The WaCky wide Web: A collection of very large linguistically processed Web-crawled corpora. Language Resources and Evaluation 43(3): 209–226. doi:https://doi.org/10.1007/s10579-009-9081-4. Cited in [Sharoff 2015].

×

Bauman, S. The Hard Edges of Soft Hyphens. 2016. Presented at Balisage: The Markup Conference 2016, Washington, DC, August 2 - 5, 2016. In Proceedings of Balisage: The Markup Conference 2016. Balisage Series on Markup Technologies, vol. 17. doi:https://doi.org/10.4242/BalisageVol17.Bauman01

×

Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜[sic]. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’21): 610–623. New York: Association for Computing Machinery. doi:https://doi.org/10.1145/3442188.3445922

×

Berg, D., Gonnet, G. and Tompa, F. 1988. The New Oxford English Dictionary Project at the University of Waterloo. Report number: OED-88-01. University of Waterloo Centre for the New Oxford English Dictionary. https://www.researchgate.net/publication/243451160

×

Bernstein, M. 2010. Card Sharks and Holy Scrollers. https://www.markbernstein.org/Oct10/CardSharksandHolyScrollers.html

×

Burton, N. G. and J. C. R. Licklider. 1955. Long-Range Constraints in the Statistical Structure of Printed English. American Journal of Psychology 68: 650-653. doi:https://doi.org/10.2307/1418794

×

CCEL. Theological Markup Language (ThML). https://www.ccel.org/ThML/index.html

×

Chicago Tribune. December 2, 1995. America Online Admits ‘Error’ in Banning Word ‘Breast’. https://www.chicagotribune.com/1995/12/02/america-online-admits-error-in-banning-word-breast/

×

Church, K. 1988. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text. Second Conference on Applied Natural Language Processing (Austin, Texas), pp. 136-143. https://aclanthology.org/A88-1019.pdf. doi:https://doi.org/10.3115/974235.974260

×

Churchland, P. S. 1987. Epistemology in the Age of Neuroscience. Journal of Philosophy 84 (10): 544-553. https://patriciachurchland.com/wp-content/uploads/2020/05/1987-Epistemology-in-the-Age-of-Neuroscience.pdf. doi:https://doi.org/10.5840/jphil1987841026

×

Cole, D. 2020. The Chinese Room Argument. Stanford Encyclopedia of Philosophy. https://plato.stanford.edu/entries/chinese-room

×

Dartmouth Dante Project. Longfellow, H. W. 1867. Translation of Dante, Paradiso. https://Dante.Dartmouth.EDU, http://dantelab.dartmouth.edu/reader

×

Darwin, C. 3 July 1881. Letter to William Graham. https://www.darwinproject.ac.uk/letter/DCP-LETT-13230.xml

×

DeRose, S. J. 1988. Grammatical Category Disambiguation by Statistical Optimization. Computational Linguistics 14(1), Winter 1988. https://aclanthology.org/people/s/steven-j-derose/

×

DeRose, S. J. 1990. Stochastic Methods for Resolution of Grammatical Category Ambiguity in Inflected and Uninflected Languages. Thesis. Providence: Brown University Department of Cognitive and Linguistic Sciences. http://www.derose.net/derose/steve/writings/dissertation/Diss.0.html

×

DeRose, S. J. 2004. Markup Overlap: A Review and a Horse. Extreme Markup Languages. https://www.researchgate.net/publication/221211490_Markup_Overlap_A_Review_and_a_Horse

×

Edwards, B. 2024. ‘The king is dead’—Claude 3 surpasses GPT-4 on Chatbot Arena for the first time. Ars Technica, March 27, 2024. https://arstechnica.com/information-technology/2024/03/the-king-is-dead-claude-3-surpasses-gpt-4-on-chatbot-arena-for-the-first-time/

×

Francis, W. N. and Kucera, H. 1979. Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English, for Use with Digital Computers. Providence: Department of Linguistics, Brown University. http://icame.uib.no/brown/bcm.html

×

Greenstein, S. and Feng Zhu. 2018. Do experts or crowd-based models produce more bias? evidence from encyclopedia britannica and wikipedia. MIS Quarterly 42(3), September 2018: 945–960. doi:https://doi.org/10.25300/MISQ/2018/14084

×

Horton, R. 2015. Offline: What is medicine’s 5 sigma? The Lancet 385(9976): 1380, April 11, 2015. https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(15)60696-1/fulltext. doi:https://doi.org/10.1016/S0140-6736(15)60696-1

×

HTTP Archive. 2022. Web Almanac: HTTP Archive’s annual state of the web report. https://almanac.httparchive.org/en/2022/table-of-contents

×

Koplenig, A. 2017. The impact of lacking metadata for the measurement of cultural and linguistic change using the Google Ngram data sets—Reconstructing the composition of the German corpus in times of WWII. Digital Scholarship in the Humanities, 32(1), 169-188. doi:https://doi.org/10.1093/llc/fqv037

×

Leibnitz, G. 1714. The Principles of Philosophy known as Monadology. https://www.earlymoderntexts.com/assets/pdfs/leibniz1714b.pdf

×

Marshall, C. C., and Irish, P. M. 1989. Guided Tours and On-Line Presentations: How Authors Make Existing Hypertext Intelligible for Readers. In Proceedings of the Second Annual ACM Conference on Hypertext, pp. 15-26. New York: ACM Press. doi:https://doi.org/10.1145/74224.74226

×

Mikolov, T., Chen, K., Corrado, G., and Dean, J. 2013. Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781, 2013. doi:https://doi.org/ 10.48550/arXiv.1301.3781

×

Miller, G. A. and Chomsky, N. 1963. Finitary Models of Language Users. In R. Duncan Lee, Robert A. Bush, and Eugene Galanter (eds.), Handbook of Mathematical Psychology 2: 420-491. New York: John Wiley & Sons, Inc. https://faculty.wcas.northwestern.edu/robvoigt/courses/2021_spring/ling334/readings/finitary_models.pdf

×

Nunberg, G. 2009. Google’s Book Search: A disaster for scholars. The Chronicle of Higher Education, August 31, 2009. https://www.chronicle.com/article/googles-book-search-a-disaster-for-scholars/

×

Pechenick, E. A., Danforth, C. M., and Dodds, P. S. 2015. Characterizing the Google Books corpus: Strong limits to inferences of socio-cultural and linguistic evolution. PLOS ONE, 10(10), e0137041. doi:https://doi.org/10.1371/journal.pone.0137041

×

Plantinga, A. 1993. Warrant and Proper Function. Oxford University Press.

×

Posner, M. and Keele, S. 1968. On the Genesis of Abstract Ideas. Journal of experimental psychology 77: 353-63. doi:https://doi.org/10.1037/h0025953

×

Searle, J. R. 2002. Consciousness and Language. New York: Cambridge University Press.

×

Shannon, C. 1948. A Mathematical Theory of Communication. Bell System Technical Journal, July and October. doi:https://doi.org/10.1002/j.1538-7305.1948.tb01338.x

×

Sharoff, S. 2015. Review of Roland Schäfer and Felix Bildhauer, Web Corpus Construction, Morgan & Claypool (Synthesis Lectures on Human Language Technologies, volume 22), 2013, ISBN 978-1608459834. In Computational Linguistics 41(1). https://aclanthology.org/J15-1009. doi:https://doi.org/10.1162/COLI_r_00214

×

Simonite, T. Feb 4, 2021. AI and the List of Dirty, Naughty, Obscene, and Otherwise Bad Words. Wired. https://www.wired.com/story/ai-list-dirty-naughty-obscene-bad-words/

×

Smith, B. 2024. Self-Attention Explained with Code: How Large Language Models Create Rich, Contextual Embeddings. Towards Data Science. https://medium.com/towards-data-science/contextual-transformer-embeddings-using-self-attention-explained-with-diagrams-and-python-code-d7a9f0f4d94e

×

Text Encoding Initiative. 2023. TEI: Guidelines for Electronic Text Encoding and Interchange. P5 Version 4.7.0. Last updated on 16th November 2023, revision e5dd73ed0. Section 21.1, https://tei-c.org/release/doc/tei-p5-doc/ja/html/CE.html

×

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. 2017. Attention is All You Need. Advances in Neural Information Processing Systems 30 (NIPS 2017). https://arxiv.org/pdf/1706.03762. doi:https://doi.org/10.48550/arXiv.1706.03762

×

Yonge, C. D. (tr). 1854-1855. The Works of Philo Judaeus. Electronic edition, 2012-05-14. Christian Classics Ethereal Library. https://www.ccel.org/ccel/philo/works.html

Author's keywords for this paper:
AI; LLMs; XML; Markup Systems