Balisage Paper: Can LLMs help with XML?
July 29 - August 2, 2024
The materials listed below were provided by the speaker as supplements to a
presentation at Balisage. These materials may include the slides or visuals used in
the
presentation; supplementary material, such as code samples or a demonstration application;
and/or the paper accompanying the presentation (if it has not been provided in XML).
These
materials have been zipped for easy download and are identified by a brief description
of
the contents. The materials themselves are untouched
, that is, they
have not been tested or edited by Balisage: The Markup Conference or by Mulberry
Technologies, Inc. As such, they are included on this website AS IS
,
i.e., as provided by the speaker, with no warranties, express or otherwise, made by
Balisage
or Mulberry.
Slides and Materials
- Bal2024-DeRose-presentation-pdf.zip: Presentation slides in Adobe PDF.
- Bal2024-DeRose-presentation-pptx.zip: Presentation slides in Microsoft PowerPoint (pptx)
Abramson, N. 1963. Information Theory and Coding. New York: McGraw-Hill. https://archive.org/details/informationtheor0000abra
Anthropic. May 9, 2023. Claude’s
Constitution.
https://www.anthropic.com/news/claudes-constitution
Bai, Yuntao, Saurav Kadavath, Sandipan Kundu,
Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini,
Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez,
Dawn
Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared
Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt,
Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma,
Robert
Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk,
Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan,
Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas
Joseph, Sam McCandlish, Tom Brown, Jared Kaplan. 2022. Constitutional AI:
Harmlessness from AI Feedback.
https://arxiv.org/abs/2212.08073. doi:https://doi.org/10.48550/arXiv.2212.08073
Baroni, M., Bernardini, S., Ferraresi,
A., and Zanchetta, E. 2009. The WaCky wide Web: A collection of very large
linguistically processed Web-crawled corpora.
Language Resources and Evaluation 43(3): 209–226.
doi:https://doi.org/10.1007/s10579-009-9081-4. Cited in [Sharoff
2015].
Bauman, S. The Hard Edges of
Soft Hyphens.
2016. Presented at Balisage: The Markup Conference 2016,
Washington, DC, August 2 - 5, 2016. In Proceedings of Balisage:
The Markup Conference 2016. Balisage Series on Markup Technologies, vol.
17. doi:https://doi.org/10.4242/BalisageVol17.Bauman01
Bender, E. M., Gebru, T.,
McMillan-Major, A., and Shmitchell, S. 2021. On the Dangers of Stochastic
Parrots: Can Language Models Be Too Big? 🦜[sic].
In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and
Transparency (FAccT ’21): 610–623. New York: Association for Computing
Machinery. doi:https://doi.org/10.1145/3442188.3445922
Berg, D., Gonnet, G. and Tompa, F. 1988.
The New Oxford English Dictionary Project at the University of
Waterloo.
Report number: OED-88-01. University of Waterloo Centre for the
New Oxford English Dictionary. https://www.researchgate.net/publication/243451160
Bernstein, M. 2010. Card
Sharks and Holy Scrollers.
https://www.markbernstein.org/Oct10/CardSharksandHolyScrollers.html
Burton, N. G. and J. C. R. Licklider.
1955. Long-Range Constraints in the Statistical Structure of Printed
English.
American Journal of Psychology 68: 650-653.
doi:https://doi.org/10.2307/1418794
CCEL. Theological Markup Language
(ThML).
https://www.ccel.org/ThML/index.html
Chicago Tribune.
December 2, 1995. America Online Admits ‘Error’ in Banning Word ‘Breast’.
https://www.chicagotribune.com/1995/12/02/america-online-admits-error-in-banning-word-breast/
Church, K. 1988. A Stochastic
Parts Program and Noun Phrase Parser for Unrestricted Text.
Second Conference on Applied Natural Language
Processing (Austin, Texas), pp. 136-143. https://aclanthology.org/A88-1019.pdf. doi:https://doi.org/10.3115/974235.974260
Churchland, P. S. 1987.
Epistemology in the Age of Neuroscience.
Journal of Philosophy 84
(10): 544-553. https://patriciachurchland.com/wp-content/uploads/2020/05/1987-Epistemology-in-the-Age-of-Neuroscience.pdf.
doi:https://doi.org/10.5840/jphil1987841026
Cole, D. 2020. The Chinese Room
Argument.
Stanford Encyclopedia of Philosophy. https://plato.stanford.edu/entries/chinese-room
Dartmouth Dante Project. Longfellow, H. W. 1867. Translation of Dante, Paradiso. https://Dante.Dartmouth.EDU, http://dantelab.dartmouth.edu/reader
Darwin, C. 3 July 1881. Letter to William Graham. https://www.darwinproject.ac.uk/letter/DCP-LETT-13230.xml
DeRose, S. J. 1988.
Grammatical Category Disambiguation by Statistical Optimization.
Computational Linguistics 14(1), Winter 1988. https://aclanthology.org/people/s/steven-j-derose/
DeRose, S. J. 1990. Stochastic Methods for Resolution of Grammatical Category Ambiguity in Inflected and Uninflected Languages. Thesis. Providence: Brown University Department of Cognitive and Linguistic Sciences. http://www.derose.net/derose/steve/writings/dissertation/Diss.0.html
DeRose, S. J. 2004. Markup
Overlap: A Review and a Horse.
Extreme Markup Languages. https://www.researchgate.net/publication/221211490_Markup_Overlap_A_Review_and_a_Horse
Edwards, B. 2024. ‘The king is
dead’—Claude 3 surpasses GPT-4 on Chatbot Arena for the first time.
Ars Technica, March 27, 2024. https://arstechnica.com/information-technology/2024/03/the-king-is-dead-claude-3-surpasses-gpt-4-on-chatbot-arena-for-the-first-time/
Francis, W. N. and Kucera, H. 1979. Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English, for Use with Digital Computers. Providence: Department of Linguistics, Brown University. http://icame.uib.no/brown/bcm.html
Greenstein, S. and Feng Zhu.
2018. Do experts or crowd-based models produce more bias? evidence from
encyclopedia britannica and wikipedia.
MIS Quarterly 42(3), September 2018:
945–960. doi:https://doi.org/10.25300/MISQ/2018/14084
Horton, R. 2015. Offline: What
is medicine’s 5 sigma?
The Lancet 385(9976): 1380, April 11, 2015. https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(15)60696-1/fulltext.
doi:https://doi.org/10.1016/S0140-6736(15)60696-1
HTTP Archive. 2022. Web
Almanac: HTTP Archive’s annual state of the web report.
https://almanac.httparchive.org/en/2022/table-of-contents
Koplenig, A. 2017. The
impact of lacking metadata for the measurement of cultural and linguistic change
using the Google Ngram data sets—Reconstructing the composition of the German corpus
in times of WWII.
Digital Scholarship in the Humanities, 32(1), 169-188.
doi:https://doi.org/10.1093/llc/fqv037
Leibnitz, G. 1714. The Principles of Philosophy known as Monadology. https://www.earlymoderntexts.com/assets/pdfs/leibniz1714b.pdf
Marshall, C. C., and Irish, P. M.
1989. Guided Tours and On-Line Presentations: How Authors Make Existing Hypertext
Intelligible for Readers.
In Proceedings of the Second Annual ACM
Conference on Hypertext, pp. 15-26. New York: ACM Press. doi:https://doi.org/10.1145/74224.74226
Mikolov, T., Chen, K., Corrado, G.,
and Dean, J. 2013. Efficient Estimation of Word Representations in Vector
Space.
arXiv preprint arXiv:1301.3781, 2013. doi:https://doi.org/
10.48550/arXiv.1301.3781
Miller, G. A. and Chomsky, N. 1963.
Finitary Models of Language Users.
In R. Duncan Lee, Robert A. Bush,
and Eugene Galanter (eds.), Handbook of Mathematical
Psychology 2: 420-491. New York: John Wiley & Sons, Inc. https://faculty.wcas.northwestern.edu/robvoigt/courses/2021_spring/ling334/readings/finitary_models.pdf
Nunberg, G. 2009. Google’s
Book Search: A disaster for scholars.
The Chronicle of Higher Education, August 31, 2009.
https://www.chronicle.com/article/googles-book-search-a-disaster-for-scholars/
Pechenick, E. A., Danforth, C.
M., and Dodds, P. S. 2015. Characterizing the Google Books corpus: Strong limits
to inferences of socio-cultural and linguistic evolution.
PLOS ONE, 10(10), e0137041. doi:https://doi.org/10.1371/journal.pone.0137041
Plantinga, A. 1993. Warrant and Proper Function. Oxford University Press.
Posner, M. and Keele, S. 1968.
On the Genesis of Abstract Ideas.
Journal of experimental psychology 77: 353-63.
doi:https://doi.org/10.1037/h0025953
Searle, J. R. 2002. Consciousness and Language. New York: Cambridge University Press.
Shannon, C. 1948. A
Mathematical Theory of Communication.
Bell System Technical Journal, July and October.
doi:https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
Sharoff, S. 2015. Review of Roland Schäfer and Felix Bildhauer, Web Corpus Construction, Morgan & Claypool (Synthesis Lectures on Human Language Technologies, volume 22), 2013, ISBN 978-1608459834. In Computational Linguistics 41(1). https://aclanthology.org/J15-1009. doi:https://doi.org/10.1162/COLI_r_00214
Simonite, T. Feb 4, 2021. AI
and the List of Dirty, Naughty, Obscene, and Otherwise Bad Words.
Wired. https://www.wired.com/story/ai-list-dirty-naughty-obscene-bad-words/
Smith, B. 2024. Self-Attention
Explained with Code: How Large Language Models Create Rich, Contextual
Embeddings.
Towards Data Science. https://medium.com/towards-data-science/contextual-transformer-embeddings-using-self-attention-explained-with-diagrams-and-python-code-d7a9f0f4d94e
Text Encoding Initiative. 2023. TEI: Guidelines for Electronic Text Encoding and Interchange. P5 Version 4.7.0. Last updated on 16th November 2023, revision e5dd73ed0. Section 21.1, https://tei-c.org/release/doc/tei-p5-doc/ja/html/CE.html
Vaswani, A., Shazeer, N., Parmar, N.,
Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. 2017.
Attention is All You Need.
Advances in Neural Information Processing Systems 30
(NIPS 2017). https://arxiv.org/pdf/1706.03762. doi:https://doi.org/10.48550/arXiv.1706.03762
Wikipedia. Scunthorpe
problem.
https://en.wikipedia.org/wiki/Scunthorpe_problem
Yonge, C. D. (tr). 1854-1855. The Works of Philo Judaeus. Electronic edition, 2012-05-14. Christian Classics Ethereal Library. https://www.ccel.org/ccel/philo/works.html