Introduction
Part-of-speech tagging (hereafter PoS tagging; also called grammatical tagging or word-class disambiguation) is the process of annotating each linguistic unit in a text with the grammatical category to which it belongs. Most PoS tagging done today identifies both a broad PoS category (verb, noun, adverb) and more granular information about the word’s morphosyntactic category or lexicosemantic features (distinguishing past from present tense verbs, common from proper nouns, singular from plural pronouns, and so on) [Voutilainen2003]. Although PoS tagging can be performed by humans, either from scratch or on pre-processed texts whose machine-generated tags require human validation, the term is most commonly associated with computational methods in corpus linguistics and natural language processing (NLP). This paper specifically discusses the computational process. The units to be tagged are usually what we think of as words, but this is not always the case. In syntactic terms, a form such as won’t might be treated as two separate units (a form of the verb will plus the adverb not), while light bulb might be treated as a single unit (a compound noun, in conformity with the alternate spellings lightbulb and light-bulb). Instead of word, therefore, the broader term token is used for anything that the PoS tagger treats as a taggable unit. The input to a PoS tagger generally includes all the non-whitespace characters of a text, split into tokens. A token may therefore be a punctuation character, a numeral, or an emoji, as well as an alphabetic string [Manningetal2009].
Taggers tend to follow a similar series of steps: tokenization; lexicon lookup of tokens; guessing processes for unknown tokens; and disambiguation of tokens for which more than one tag has been found [Voutilainen2003], [Leechetal1994]. The greatest challenge for PoS tagging is disambiguating lexical tokens which share the same written form but belong to different categories [DeRose1988]. These may be words from the same root, such as number as a noun and as a verb meaning “enumerate”, or words whose identical spelling is coincidental, such as either of the foregoing examples alongside number as a comparative adjective meaning “more numb”. Ideally, a PoS tagger will be able to identify the form number in each of the following examples correctly:
-
The number was chosen randomly (noun)
-
Number the examples sequentially (verb)
-
Her face felt number after the second injection (adjective)
Broadly speaking, PoS tagging can be either supervised or unsupervised, and either statistical or rule-based (although models which use a combination of approaches exist). Supervised tagging uses pre-tagged linguistic data as a training set to train the tagger; unsupervised tagging uses an untagged training set [ClarkLappin2010]. The earliest PoS taggers were rule-based, using disambiguation rules written by hand. These rules may take the form of finite-state automata based on regular expressions, which accept or reject potential sentence readings, or they may encode context-patterns and select or reject PoS analyses based on a token’s context frame [Voutilainen2003, DeRose1988]. Later development of rule-based tagging notably resulted in the constraint-grammar formalism, which allows detailed construction of fine-grained rules, taking into account short- and long-range context of a token as well as its structural relations with other tokens [BickDidriksen2015]. Both supervised and unsupervised rule-based tagging is possible, the latter having been most famously used by Eric Brill, using transformation-based error-driven learning to derive appropriate rules [Brill1995]. Statistical PoS tagging (also sometimes called probabilistic or stochastic) disambiguates by using the likelihood of occurrence of a particular token with a proposed PoS tag in the immediate context of its surrounding tokens, often using (Hidden) Markov Models. This likelihood may be statistically computed from either pre-tagged or untagged training sets, using a variety of different approaches [Abney2007, Merialdo1991].
It is evident that PoS tagging is a markup problem in a very trivial sense. The process of annotating units in a text is precisely what markup is, after all. However, what I mean by calling PoS tagging “a markup problem” is not simply that PoS tagging is markup. What I want to investigate here is how the theoretical problems that have been identified in the domain of markup can throw light on the problem of grammatical category disambiguation.
Unless specified otherwise, the linguistic examples in this paper will be adapted from the corpus collected for the DERWen PoS tagger, which is an offshoot of my PhD research into Welsh-English bilingualism. My dissertation focuses specifically on the presence of English-origin items in Welsh discourse, and considers whether it is necessary or possible to distinguish borrowings (defined as words which have been adapted and adopted into the Welsh lexicon) from code-switches (words which belong exclusively to the English lexicon, and indicate that the speaker is, in some sense, switching between the two languages). DERWen is an attempt to produce a tagger capable of tagging mixed-language Welsh-English Twitter discourse. It is worth explaining briefly the particular problems posed by this kind of language, in order to provide context for the discussion that follows.
“Hwn yn textbook styff”: The nature of mixed Welsh-English social-media data
The data to be analysed by DERWen is around a million words taken from Twitter. The tweets were selected using Welsh keyword searching, and cleaned to remove corporate accounts, obvious quotations (lines from certain famous Welsh songs are extremely popular on Twitter whenever the Welsh men’s rugby team plays an international), and false positives. Since the search criteria only used very common Welsh words as search terms, there was no attempt to control how many English-origin items would appear in the corpus.
Before social media provided the potential to access naturalistic, colloquial written data, large corpora of contemporary (or near-contemporary) language tended to be composed of published written texts and/or prepared transcripts of spoken discourse. In these cases, the standard conventions of written language (such as use of whitespace and punctuation) are usually maintained. Data from social media, in contrast, tend to be unconventional and unpredictable [Derczynskietal2013]. This provides opportunities to gather large samples of colloquial language which, in many ways, mirrors linguistic characteristics of speech (such as code-switching). However, it also means that NLP tasks become much more difficult [Owoputi2013]. NLP tasks often rely, at least partially, on pre-built dictionaries of known words with standardized spelling and on algorithms pre-trained with standard forms of language [Neunerdtetal2013]. Non-standard spelling, omitted punctutation, typos, abbreviations, neologisms, and playful distortions of language are all common in social media discourse.
Welsh Twitter is no exception to the linguistic messiness of social media, with the additional complication that it is often messy in two languages, as well as in a third, hybrid set of lexical items which belong to both languages. Welsh speakers are always at least bilingual, with fluency in English almost universal (apart from some members of the small Welsh community in Patagonia, who are instead fluent in Spanish). Intensive lexical borrowing by a sociopolitically subordinate language in close geographic and/or cultural contact with a dominant language is common (although by no means universal [Wohlgemuth2009]), and this is certainly the case for Welsh and English. Code-switching by bilinguals is also a common phenomenon, particularly when communicating with others who speak the same language pair. Although mixing languages in single sentences or clauses is often stigmatised, there is now substantial agreement that it is a sign of balanced skill in the two languages [MyersScotton1993]; speakers who have mastered the linguistic systems of both languages to an equal extent tend to be those who code-switch most intensively [Poplack1980]. Given this, it is no surprise that a Welsh-language Twitter corpus should feature numerous English-origin items (as illustrated in Figure 1).
Such words present one kind of challenge to PoS tagging. However, these words which retain their original English form are not the only kind of English-origin items to be found in the corpus, as seen in Figure 2 and Figure 3. The first of these figures shows English-origin words which have been adapted to Welsh orthography. Welsh has a fairly shallow orthography, especially in comparison with English, which means that there is little ambiguity about the sound represented by a sequence of written letters. However, the letters used by Welsh for a variety of sounds differ from those commonly used in English. For example, the vowel sound in English “but” can only be spelled with the letter y in Welsh, while the vowel sound in “boot” is always written w. The vowel sound in “beat” is the most variable in terms of possible Welsh spellings: it can be represented by i, u, or y. As a result of the significant differences between Welsh and English orthographic conventions, it is very common for loanwords into Welsh to change their spelling. Figure 2 shows an established loanword with a long history in Welsh (busnes / “business”) and three which appear to be off-the-cuff adaptations of English words (garantîd / “guaranteed”, findalŵ / “vindaloo”, styff / “stuff”)[1]. All are respelled using Welsh orthography.
In Figure 3, we see how English-origin words may be adapted morphologically when used in Welsh discourse, with Welsh plural endings (grwpiau / “groups”), verbal inflection (bownsiais / “I bounced”), and initial consonant mutation (chapsiwn / “caption”).
This kind of language mixing makes PoS tagging harder, both in terms of the identification of word forms themselves, and in terms of the data modelling that underpins the tagging process.
Putting somethings into computers: an overview of three key discussions in markup theory
Texts cannot be put into computers. — Michael Sperberg-McQueen
A full survey of markup theory would be impossible in the space available here. I will therefore only attempt a brief overview of a few foundational theoretical discussions that seem to me particularly relevant to PoS tagging.
Subjectivity and objectivity in markup vocabularies
Markup is, of course, always a layer (or layers) of information added to something. The nature of that something has been a matter of discussion for a number of theorists. For DeRose et al. (1990) and others, content elements constitute a document; if the content elements change, the document is no longer the same. Changing the markup, in contrast, may change how the document is interpreted, how it can be stored or shared, and what use can be made of its content; but regardless of such changes, the content — and therefore the document itself — remains the same [DeRoseetal1990]. It is important to consider here Sperberg-McQueen’s (1991) assertion that the document is not the text: the text is an abstraction, which is realized in one or more physical (analogue or digital) forms. A representation of a text is, in Sperberg-McQueen’s view, never impartial; it results from and is shaped by the creator’s inevitable biases and judgements [SperbergMcQueen1991]. Combining these two perspectives leads to the conclusion that a marked-up document is a palimpsest formed of three layers: the text itself; the text representation; and the markup. Each of these layers introduces uncertainty because each is the result of human cognitive processes, and human cognition is nothing if not unreliable.
A central task of a well-organized markup project is the preparation of a data model, represented by some kind of schema. Klein and Hirscheim (1987) describe a schema as the representation of a “Universe of Discourse”: some subset of existing objects and structures. The authors seek to unpack the sense in which these objects and structures “exist”, by considering how data modelling approach both the ontology and epistemology of the universe of discourse. They argue that the ontology may be approached from one of two philosophical standpoints: realism or nominalism. Realists see the universe of discourse as a representation of some immutable, objective set of objects and structures, which have empirical existence prior to the creation of the schema. Nominalists, in contrast, see reality as a subjective construct, whose representation in a schema is guided by the creator’s sociocultural assumptions and linguistic background [KleinHirschheim1987]. Sperberg-McQueen's assertion of the partiality of text representations aligns with this nominalist approach to markup, leading to the insight that both the text representation and the markup schema are (in different ways) partial or biased renderings of the text abstraction.
The epistemology of the universe of discourse, that is our understanding of what we know about it and how we know it, may also be approached in one of two ways. A positivist approach explains observable phenomena by identifying causal relationships, and selecting the causal model that best fits those phenomena. Meanwhile, an interpretivist approach asserts that a causal model is inappropriate for understanding phenomena mediated through human action. The data modeller cannot avoid using socially conditioned pre-understanding of the subject, and so can only understand from a subjective point of view, not from some objective, outsider’s standpoint. The tendency is for ontological and epistemological approaches to align in only one configuration, creating an objective (realist-positivist) paradigm on the one hand, and a subjective (nominalist-interpretivist) paradigm on the other [KleinHirschheim1987]. We will see later how these paradigms can help us to think about the construction and application of PoS tagsets.
Markup as a theory of the text
Sperberg-McQueen imagines the markup scheme as a theory of the text for which it is intended. Markup schemes provide a particular view of a text, and shape what we are likely to do with that text by making certain tasks easier to conceive and perform than others [SperbergMcQueen1991]. Maximizing reusability of texts in a variety of ways is (or should be) a key aim of PoS tagging for corpus annotation, in particular, since corpus creations tends to be expensive in time, effort, and money [Kahreletal1997]. This aim is perhaps best served by a declarative markup scheme, in which we represent what the text is, not how it should be processed [SperbergMcQueen1994]. When the markup project takes as its object not the text abstraction but a specific text representation, however, it may be necessary for the markup scheme to make a distinction between the essence of the text and how it should be represented. This distinction may be conceptualized as that between the markup-object’s deep structure and its surface structure [IdeVeronis1995]. If the aim of the markup scheme is to allow creation of a facsimile of a particular text representation, then it will necessarily be to some extent procedural. It will also encode what Coombs et al. (1987) call presentational markup: the kind of markup which includes line breaks and page numbers, and which is conventionally used for physically printed text [Coombsetal1987]. These are not elements that markup would generally aim to capture since, to repeat Sperberg-McQueen’s maxim, “the text is not the same as the page” [SperbergMcQueen1994]. Of course, one way of reconciling facsimile markup schemas with this maxim is to conceive of the text-representation in question as a new text abstraction, different from the abstraction that generated the representation originally. The markup scheme is therefore a theory of the text-representation-as-text-abstraction, and not of the original text abstraction.
The problem of facsimile-markup schemes is illuminated by Birnbaum and Mundie (1999) who, like Ide and Véronis [IdeVeronis1995], consider markup schemata created for dictionaries [BirnbaumMundie1999]. While the abstract text of a dictionary — its deep structure — is evidently of significant interest to many users, there are equally those for whom the historical record of the dictionary’s physical form — its surface structure — are important information. That is to say, we may want to consult a dictionary’s text representation in order to access the abstract information contained in it (the definition or spelling of a word, or its date of first attestation, for example); but we may also want to know exactly how the Oxford English Dictionary (OED)’s entry for a word looked on the page when it first appeared in print. Alongside Ide and Véronis' deep/surface distinction, we might here consider Piez’ (2001) distinction between proleptic and metaleptic markup. The former looks to the document’s future, and the uses to which it may be put; the latter looks backwards, towards the pre-existing features of the data’s structure [Piez2001]. Proleptic markup facilitates future production and interchange, because it is focused primarily on what we might want to do with the data. Metaleptic markup facilitates accurate representation of structures that are derived from the data; it is not interested in how the data may be used, but in what it seeks to describe [Piez2001]. Whereas preparing a facsimile representation of the OED is clearly metaleptic, representing the OED’s content so that it can be queried, transformed, and distributed is proleptic. Piez’ distinction clarifies the important fact that both types of markup scheme have value and purpose; choosing which one to privilege may therefore pose difficulties.
Birnbaum and Mundie note that the structure of entries in the OED’s first and second editions is not always entirely regular. There are rules governing which elements are obligatory in an entry, and in what order they should appear. However, it is not entirely uncommon to find that errors have crept in, leading to incorrectly structured entries. A markup scheme author for such a text must find a way to resolve this problem: either the scheme must become extremely flexible, to allow for (in principle) any kind of violation; or the scheme must include what the authors term an “escape-hatch” structure for deviant data; or the text must be modified, to conform to the scheme [BirnbaumMundie1999][2]. These options all allow the marked-up document to be processed automatically as valid. However, all three also require compromise on the theory of the text represented by the schema[3]. In the first case (editorial correction), the theory privileges the idealised text abstraction, and sidelines the text representation where it fails to conform. The second case (the flexible DTD) theorises the text abstraction as a collection of loosely-structured information lacking structural specificity, and thereby fails to recognise that the text abstraction in fact has a strict intended structure, even if the text representation does not always convey that structure validly. Acknowledging the intended structure is an important aspect of representing the meaning of the abstract text: dictionary entries are presented in such a way as to maximize space and to guide the reader to an understanding of the entry’s information structure in an efficient and unambiguous manner. Finally, the third solution (the escape hatch) allows the representation of divergent data by offering alternative elements whose structure parallels, but is far looser than, the canonically-structured elements. The theory of the text implied here is that it is validly composed of strictly-structured data, interspersed with the occasional passage of deliberately unstructured data. Even if the escape-hatch elements are named in such a way as to make clear that the structures they contain are irregular or erroneous, this identification of error is purely semantic [BirnbaumMundie1999]. It elides the fact that such entries are not syntactically valid, according to the original text abstraction, even if they are validly a part of the text representation's abstraction.
The solution offered by Birnbaum and Mundie is to theorize the document as having (at least) two structural layers: the idealised one and the concrete one. These layers, they argue, are analogous to the distinction in descriptive linguistics between competence (the idealized abilities of a language user, including knowledge of linguistic rules) and performance (how the user actually uses the language, including errors, slips of the tongue, etc.). They propose that the relationship between these two layers may be represented as transformation rules. The valid, corrected version of the text is maintained for convenience, alongside a set of transformation rules which can be used to derive the invalid “facsimile” form of the text-representation [BirnbaumMundie1999]. This ingenious solution theorises the text as an abstraction, and the text-representation as a transformation of that abstraction into a concrete instance, which makes some attempt to represent the structure of the abstraction consistently, but may ultimately fail to do so. Nevertheless, even an imperfect text representation becomes its own text abstraction when it is the object of an attempt to mark it up. The text behind a text-representation is therefore theorised, not as a single abstraction, but as two: the text abstraction that generated the representation; and the text abstraction that the representation generates. This theoretical approach to anomalous data will become important later, as we consider the abilities of PoS tagsets to theorise non-standard and mixed language varieties.
The semantics of a markup vocabulary
Renear et al. (2002, 2003) wrestle with the problem of markup vocabulary semantics. Whereas the syntax of a markup vocabulary can be specified in a schema document, there is no comparable way of specifying the semantics of a vocabulary. As a result, users have to conjecture what the vocabulary designer may have intended, or — at best — rely on prose documentation, which is not formally verifiable and is prone to human error [Renearetal2002, Renearetal2002]. The disconnection between designers and users, between vocabulary and instance, leads to the use of tags to “mean” things that were unintended by the modellers [Piez2001]. Taken to an extreme, it might mean that two instances using the “same” vocabulary are not, in fact, representing data in comparable or compatible ways. Renear et al. propose that some means of specifying the semantics intended by a markup vocabulary would reduce ambiguity and tag abuse, as well as making processing easier to automate. A formal specification of a vocabulary’s semantics is, after all, the only way for a non-human interpreter to “understand”, for example, that a <title> element child of a <head> is the title of the <document>, but that a <title> child of a <chapter> is the title of its parent element, and that both of these are different from the <title> in a bibliographic reference [Renearetal2002]. Other types of information that, according to the authors, can only be understood semantically include class relationships between elements and attributes, the propagation of attributes and their values from parent to child elements, and ontological variation in the reference of an element. Ontological variation is the problem that a single element may in fact be a conflation of a number of different referents. A <sentence> element, for example, may refer to the sentence as an abstraction, to the proposition expressed by the sentence, and to the concrete character data used to render the sentence. Each of these referent of the element might be addressed by different attributes, which (in a syntactic analysis) would be interpreted simply as attributes of the same element [Renearetal2002].
Tennison (2002) discusses the importance of combining both syntactic and semantic understanding of markup in attempting to automate transformations between different markup vocabularies. An effective transformation application needs to be able to measure the distance between languages, and to determine from that information that the best way to transform vocabulary A into vocabulary B is to perform an intermediate transformation into vocabulary C or D or E [Tennison2002]. Tennison proposes that a potentially useful measure of the distance between vocabularies is how much information is lost and/or gained in the transformation from one to the other. Vocabularies show assymetry in terms of which information they choose to represent, to what level of specificity, and with what kinds of labels or structures. In a similar vein, Sperberg-McQueen (2011) is concerned with how to measure the success of data format conversion, in the context of digital preservation. He explores the case of conversion between markup vocabularies to outline a model of “noise-free lossless conversion”. According to this model (drawing on prior work by e.g. [Renearetal2002], [Renearetal2003], [Marcoux2006]) the meaning of a document’s markup is the sum of the inferences licensed by that markup (that is, the things that are accepted as being true as a result of the markup). Noise-free lossless conversion between vocabularies can be said to have occurred if the output format licenses the same (and only the same) inferences as the input does. Lossless conversion requires that all the input inferences be present in the output; noise-free conversion requires that no inferences be present in the output that were not present in the input [SperbergMcQueen2011]. The problems of semantic and syntactic differences in tagsets and of (in)compatibility between descriptive models for different languages will be the subject of much of what follows in this paper.
The language of lions: whose universe of discourse is represented in PoS tagging?
Wenn ein Löwe sprechen könnte, wir könnten ihn nicht verstehen. — Ludwig Wittgenstein
Whether relying on pre-tagged natural-language data, dictionaries, or even on untagged data, PoS-tagging approaches generally have in common that they presuppose a tagset. The process of assigning a PoS tag, or of disambiguating assigned tags, is successful when it selects a PoS tag that is (at worst, only plausibly) correct according to a human linguistic analysis, and belongs to a PoS category that occurs in the language of the text being tagged[4]. Even unsupervised models need to have some knowledge of the categories to be assigned to the tokens in output: this may not immediately seem to be a particularly thorny issue. However, although ten word classes[5] are traditionally accepted, their ability to account for the nuances of grammatical description, or for the grammar of all human languages, is not clear [Haspelmath2009]. Furthermore, once we go beyond these basic categories and consider the possibilities for representing subcategories in the tagset, we begin to see significant differences in the conception of the lexico-semantic inventory of a language. As an illustration of this issue, let us consider the case of the Welsh grammatical category traditionally known as the berfenw (“verbnoun” or “verbal noun”).
The Welsh berfenw: a case study in linguistic tag abuse?
According to many, if not most, grammars of Welsh, the berfenw is a non-finite verb. It is used in combination with finite verbs to express past, future, and present time. It also has other functions, which are generally translated into English using a present participle. Figure 4 gives examples of some of the major uses of the berfenw (an italicised term in the second line of each example indicates a berfenw in the original text).
The partcipation of the berfenw in constructions which, in English, are accomplished with verbs (4a–4d) as well as the ability to translate other uses with English forms derived from verbs, such as participial adjectives (4e) and gerunds (4f), contributes to the understanding of the form as a kind of verb. However, the “enw” component of berfenw means “noun”, and reflects the fact that the berfenw is very commonly used as in example 4e. Furthermore, decomposing these constructions with the berfenw shows that, in every case, it is structurally better understood as a noun than as a verb. Figure 5 repeats the previous set of examples, but this time with an unidiomatic English translation that reflects the actual grammatical structure of the original with respect to the function of the berfenw.
The constructions in a–d are what is known as “light-verb” constructions, in which the main verb of the sentence (usually a verb meaning something like be or do) carries little semantic content. The semantics of the event or action are instead carried by another element. This is not a usual way of expressing such semantics in most varieties of English. However, such constructions are common in the Celtic languages; echoes of them may be heard in a quintessentially Irish-English way of talking about something that has been done (a construction sometimes known as the after perfect: “it’s after upsetting him” (meaning “it has upset him”); “I’m after being in at the mart” (meaning “I’ve just been in at the mart”) [Carey2016]. It is be alone which performs the grammatical function of the verb in these sentences; the semantic function of expressing the action that has occurred is performed by the gerund (i.e. a noun which is derived from a verb). Nonetheless, the berfenw-type forms are predominantly labelled as verbs in work on all the Celtic languages [Jeffers1978], [Li2004]. Despite persuasive analysis by [Willis1988], who argues for classifying the berfenw exclusively as a noun, it is still almost always called a verb in modern Welsh linguistics.
The early linguistics of vernacular Indo-European languages was heavily influenced by the linguistics of Latin and Greek, with categories from these languages either carried over wholesale, or adapted to the needs of the vernaculars. While undeniably an artificial constriction of language description, this nonetheless also provided a coherent framework in which knowledge of the vernaculars could be codified and exchanged [RabyAndrieu2018]. A potential problem with describing a language with grammatical categories derived from different languages is that, if one knows that there are “nouns” and “verbs”, one approaches linguistic data by trying to find the nouns and the verbs. It is difficult, if not impossible, to conceptualise the structure of a new language without reference to alreadyknown linguistic terms and concepts. This, I would argue, is why the Welsh berfenw is called a “verb”: linguistics, like history, is written by the victors; and English established itself as victor over Welsh early on. Welsh linguistics has, until very recently, essentially been the province of people who received their formal education in some other language. From the early history of public education in the British Isles through the latter half of the twentieth century, formally educated Welsh people received their education through the medium of English. English was considered the best route to educational and social achievement, and the point of entry into modernity, even for children who spoke nothing but Welsh at home, and whose communities were primarily or entirely Welsh-speaking [ADavies2003]. It is only in the last half century that significant numbers of children have been able to receive an education through the medium of Welsh. Curricula and learning materials, however, have generally been centrally mandated by the English government and adapted or translated for the Welsh-language context. As a result, there is no Welsh-first tradition of education or of scholarship in linguistics or the sciences that might challenge the use of English-derived linguistic description for Welsh.
Traditional English grammatical categories have no place for a noun that carries the semantic content of a verb, other than calling it a non-finite verb form performing the functions of a noun. Klein and Hirscheim’s distinction between entity-based and rule-based modelling [KleinHirschheim1987] helps us to focus on this issue as a problem of data modelling. Are the members of a PoS tagset conceived as entities or as the sum of a set of rules? That is, does the presence of a tag <non-finite-verb> in the tagset indicate a belief that the noun has an objective existence as a linguistic entity, and that the purpose of tagging is to find the tokens that “are” non-finite verbs? Or does it express belief in the existence of a set of conventionalized linguistic rules, which justify tagging a token as a non-finite verb if it satisfies those rules? The former approach adheres to an objectivist conception of the universe of discourse represented by the tagset, and the latter to a subjectivist conception. The tagging of the Welsh berfenw as a non-finite verb requires a conception of “verb” and “noun” that ignores the grammatical rules used to identify these categories in English and other languages, and instead insists that the Welsh non-finite verb exists, despite all the evidence that the berfenw behaves like a noun[6]. The berfenw-as-verb is an entity. The berfenw-as-noun, in contrast, is the product of a rule-based construction of the universe of discourse, in which the approach when tagging tokens is not “where are the non-finite verbs?”, but “the tag <non-finite-verb> will be used for any token which satisifes the conventional criteria by which we identify tokens as non-finite verbs”. The tagset, although pre-created in the sense that the rules for named categories are known, is not pre-assumed. It is possible for words that are traditionally called “verbs”, or which perform the semantic functions fulfilled in other languages by verbs, to be tagged as nouns because they follow the rules for nouns.
Encoding information and licensing inferences about English pronouns
A significant problem in all kinds of linguistic annotation of data is that different projects develop or adapt markup vocabularies which differ more or less substantially from those used by other projects or for other languages. The problem described above, where there is disagreement about the appropriate category to assign to a token, is paralleled by the problem of disagreement on how to model categories upon which there is agreement. Even on the level of supposedly unifying ontologies for linguistic annotation, at least three major projects exist, each of which takes a different approach and draws on different pools of expertise [ChiarcosSukhareva2015]. For English, a well-resourced language with a wealth of scholarly discussion informing NLP, major tagsets vary in size from the Penn Treebank tagset, with 36 PoS tags [Tayloretal2003], through the Oxford English Corpus tagset (101 PoS tags) [SketchEngine], to the CLAWS tagset (at least 150 tags; the exact number dependent on the tagset version) [CLAWS]. While these tagsets do, at least, broadly agree with each other on what the parts of speech are (there are no “verbnoun”-type problems here), they nonetheless disagree on how to categorize them, to what level of detail, and with what labels — in short, they disagree in the inferences that the markup licenses [SperbergMcQueen2011].
The Penn Treebank tagset (hereafter Penn), for example, conflates prepositions and subordinating conjunctions: the token “for” would therefore be given the same tag, <IN>, in the sentences “It was for my mother” and “She went, for she had no reason not to”. Subject and object pronouns are also tagged with a single tag in Penn (<PRP>), and distinguished from possessive pronouns (<PRP$>) whereas the Oxford English Corpus tagset (hereafter OEC) distinguishes objective personal or possessive pronouns (tagged <OPP>) from subjective personal or possessive pronouns (<SPP>). Meanwhile, the latest CLAWS tagset, version C7, distinguishes the possessive pronouns (<PPGE>) from the (non-reflexive) personal pronouns, which are then subdivided into ten categories. These distinguish the neuter it as either subject or object (<PPH1>) and the second-person singular/plural subject/object pronoun you (<PPY>) from the third-person and first-person singular and plural subject and object pronouns. Figure 6 shows how the three tagsets categorize eight pronouns as used in three example sentences.
As we can see, the three tagsets categorise the tokens differently. CLAWS provides the most fine-grained analysis. However, it is not capable of distinguishing the two instances of you. OEC, which is generally much less informative, does encode the information about the token’s syntactic role as subject or object that provides one way to distinguish the you tokens. The main types of information that could be encoded about these pronouns are shown in Figure 7.
The encoding scheme is a theory of the text it encodes; electronic representations embody ideas of what is important in a text [SperbergMcQueen1991]. As we would expect, all of the tagsets encode the various pronouns’ basic PoS; whether they conceive of the eight tokens as belonging to a single category (Penn) or split them into two categories (OEC) or four, (CLAWS), the labels for those categories all indicate that the token is unambiguously a pronoun. In contrast, none of the tagsets encodes information about the grammatical gender of gendered pronouns. (Ungendered pronouns are indicated by a dash in the relevant table cells). CLAWS is the only tagset that encodes grammatical number, but only does so for those tokens which have a different form in singular and plural; the same is true for its encoding of the subject/object distinction. This means that it is not capable of capturing number or subject/object information about “you” tokens. It does, however, distinguish between the personal pronouns “you” and the possessive pronoun “yours”, unlike either of the others. Penn is the only tagset that fails to distinguish entirely between pronouns used as subject or object. The only distinction asserted by the tagset, in fact, is that these pronouns are distinct from "adjectival possessive" pronouns (tagged <PRP$>) [Santorini1990], such as his and my. These forms, I would argue, are not pronouns at all, but determiners. Penn therefore gives no extra categorial information about personal pronouns, except that they are personal pronouns and not possessive determiners.
The set of potentially interesting features in a text is infinite [SperbergMcQueen1991], and which are deemed worthy of representation in a markup scheme will vary according to many factors, including whose interest provides the defining context of “interesting”. There will often be a tension between general applicability and precision in tagset creation [IdeVeronis1995]. A focus on what is lost when translating between markup vocabularies, as recommended by both Sperberg-McQueen and Tennison (and discussed above), highlights the problems of the three tagets in focus here, as regards the precision with which they are able to encode linguistic data. If we were to mark up the three sentences shown in Figure 1 according to either Penn or OEC, conversion to CLAWS would be impossible, since neither of the smaller tagsets encodes enough information to select CLAWS tags correctly. Conversion from Penn to OEC, is also impossible, since Penn does not encode the subject/object distinction encoded by OEC. Meanwhile, conversion from CLAWS to OEC would not only sacrifice the more granular information encoded by CLAWS, it would also be made impossible by the fact that OEC needs to know whether the very first token is a subject or an object pronoun, and CLAWS does not encode this information. The only lossless conversion possible between these tagsets (assuming that the conversion is being done automatically, without reanalysis of the text content in order to access information missing in the original encoding) is from either CLAWS or OEC to Penn, and these are only possible because Penn would throw away all of the extra information about these pronouns encoded by the other two tagsets. The rationale behind the decision not to encode this information was an attempt to reduce redundancy, modifying the tagset used previously for the Brown Corpus, and conflating tags if the differences they encoded were recoverable either from the character data they contained or from the parse tree in the alternative, parsed version of the Penn Treebank corpus [Tayloretal2003]. The distinction between the two “you” tokens in Figure 1 would require the latter approach, while distinctions between “she”, “him”, “them”, and “theirs” on the basis of number, person, gender, subject/object role, and possessive/personal distinction can all be retrieved from the actual word forms in question (e.g. “him” can only be a singular, third person, masculine, object personal pronoun).
It is perhaps not terribly surprising that a tagset developed for a specific project has idiosyncrasies that tie it to the needs of that project. For a language with a fairly limited inventory of personal pronouns, there is some sense in not multiplying categories too far. Penn would be entirely inadequate for representing Welsh, which has a significantly more complex inventory of personal pronouns. Penn’s theory of the text (the text, in this case, being the linguistic data to which the tagset is applied) is oddly antipathetic to the usual purposes of descriptive markup or of PoS tagging. Markup is intended to license inferences [SperbergMcQueen2011], to codify information that is not made explicit by the text content. Linguistic information is, of course, recoverable from the text content; that, after all, is what PoS tagging does. But the tagging process should remove the need for prospective users of the text to repeat the analysis themselves. Penn is a particularly proleptic tagset, in Piez’ terms. It assumes that the marked-up text will only be used by those who additionally have access to the Penn Treebank parsed corpus, and who are able to do the work of enriching the sparse PoS tag for the pronouns analysed above using both information from that corpus and from the text data itself. As much as a theory of the text, Penn is a theory of the text representation’s end-uses and end-users.
Weirding language: PoS tagging of mixed-language data
“Verbing weirds language.” — Calvin and Hobbes
Standardized language models and linguistic hegemony
The argument that Penn’s personal/possessive pronoun category avoids the proliferation of individual tags which are only ever used for a single wordform seems at first to have some legitimacy on its own terms, at least. After all, as a glance at the CLAWS tagset confirms, a more granular tagset which distinguishes categories such as person, number, etc. might end up with individual tags for the first person singular subjective pronoun (“I”) and its objective counterpart (“me”), and for their plural counterparts (“we”, “us”), and so on. The intended text to be marked up by the tagset, according to this argument, must be theorized as containing entirely regular language in a standardized dialect of English, such as British Standard English or General American. Other dialects often use the pronouns differently, or use different pronouns altogether. I is used as an objective pronoun in some forms of Caribbean English, and I and I is used by some Rastafarian speakers as a singular or plural, subjective or objective first-person form. Us is used in some dialects as a singular first-person pronoun, and in others as the plural subjective form rather than the objective. Forms such as we-all and we-uns are used as first person plurals in some U.S. dialects, while myself (in standard usage, a reflexive first-person pronoun) is used as a subjective and objective first-person singular pronoun especially in Irish English. (All examples are taken from [OED].) Not only does Penn strain our understanding of how markup should license inferences about the text, it also conceives of the abstract text to which the markup might be applied as an instance of English as the authorities say it should be used, not as it actually is used.
The DERWen PoS tagger began with open-source code created for the CorCenCC modern Welsh corpus project [Nealeetal2018], which was written with monolingual Welsh texts in mind. DERWen’s first step was to add an English lexicon alongside CorCenCC’s Welsh one, for initial lookup and naive assignment of PoS categories to be fed into the constraint grammar for disambiguation. The next step was to adapt the tokenizer with rules for English tokenization. A problem which became obvious early on was that both the tokenizer and the naive PoS tagger were making assumptions based on a standardized model of language which was inadequate for unnormalized Twitter data. In English, the tokens its and it’s, for example, pose a significant problem. The tokenizer splits it’s into the pronoun it and the clitic ’s, whereas it does not split its, which is assumed to be the neuter form of the possessive pronoun. Yet, predictably, these two forms are often used in ways that are inconsistent with the standardized English model. In Welsh, colloquial written language often features forms which compound a series of tokens, often a verb, pronoun, and (optionally) a negative modifier (e.g. allaim, which would canonically be written as “alla’ i ddim” (literally “can I not”, meaning “I can’t”). These forms are entirely unstandardized and, although to some extent predictable, are not easy to identify reliably during tokenization. Errors at the tokenization stage, of course, leave a text representation which is impossible to PoS tag correctly. I have not yet settled on a solution for these forms. If we think of the abstract text as a series of tokens, represented by the written data, we might theorize the use of compounds and other structures (deliberate or erroneous) which obscure the token boundaries in unexpected ways as a kind of anomalous representation of the text abstraction, following Birnbaum and Mundie (1999). It would then be possible to develop something like the “escape hatch” structure they describe from the TEI’s tagset for dictionaries, perhaps by adding a special tag for otherwise unknown tokens which match the common morphological features of the compound forms, which could be used to “warn” the constraint grammar that a token may represent a series of PoS categories rather than a single one. However, theorizing these written forms as anomalous risks implying that there is a non-anomalous way of representing the text abstraction, and that it is represented by the standardized written form(s) of the language, which is predictable and can therefore usually be formalized in a way that is automatically tokenizable. The prescriptivist implications of such a theorization are somewhat at odds with the aims and principles of descriptive linguistics [7].
One lesson that can be learned from any attempt to perform PoS tagging on colloquial language is that it is never safe to assume that any token is limited as to its PoS classification. Certainly, there are some words that are exceedingly likely to belong to the category assigned them in a standard dictionary of the language in question. But humans are endlessly inventive and creative with language (even if prescriptivists might like to pretend that non-standard language use is in some meaningful, abstract sense “incorrect”, rather than simply following rules that are not those of the dominant linguistic authorities). One recent example is the word because, which was quite clearly a subordinating conjunction (with the odd, rare use as a noun or an adverb) until fairly recently, and yet is now regularly used as a preposition, as reflected in the title of McCulloch’s (2019) groundbreaking study of internet linguistics, Because Internet. As discussed above, data from sources such as Twitter can be particularly noisy and messy, both intentionally and unintentionally. A markup scheme for even monolingual Twitter data must start with a theory of the text abstraction that conceives of it as a far looser, more playful, more error-prone, and far more pluricentric system than any standardized model of a language can encompass. The safest approach to tagging any such data must be to assume that at least the linguistic categories and distinctions which are known to exist in standardized forms of the language will be present in the data, that they will all probably be of some linguistic interest, and that they may be instantiated in unexpected ways using unpredictable tokens.
Proleptic conclusions: looking towards the future
As the discussion of the berfenw above suggests, it is necessary to be careful when mixing tagsets originally intended for different languages, because it may be that similarities in semantics between two tagsets (e.g. a “verb” category) obscures differences in the underlying conception of what that category should contain. Nonetheless, it is also important to generate a tagset that can be used for both languages, in order to avoid implying a neat kind of linguistic separation that does not exist in bilingual (or multilingual) reality. The universe of discourse of the bilingual language user is not simply the union of the universes of discourse of monolinguals in each of the relevant languages. Rather, it is a complex system which draws on both languages but has features which belong to neither. The English-origin tokens with Welsh orthography and/or morphology discussed above are one example of such a feature, although there is slippage between the innovative forms here and the conventionalized forms of borrowings from English. (Since there is no such thing as a monolingual Welsh speaker, it is impossible to say for certain that a feature used in mixed Welsh-English data is entirely missing from Welsh.) Approaches to PoS tagging for mixed data have included a number which tagged the text in two passes, using monolingual taggers for the two languages involved, with a fairly high degree of accuracy [Jamatiaetal2015]. This is a theoretically very unsatisfactory approach, and on a practical level seems likely to fail on forms that belong to the bilingual language system rather than to either of the monolingual systems.
There have been attempts to create “universal”, or at least multilingual, PoS tagsets. One of the most promising currently is that of the [UniversalDependencies] (UD) project. The outline of the project given in [Nivreetal2016] is extremely encouraging from the perspective of markup theory. The authors note that the UD project is a merger of a number of separate initiatives, which suggests an approach that prioritizes interoperability and actively seeks to integrate previous tagsets rather than simply adding yet another notional “standard”. They also emphasize the extensibility of their tagset, satisfying Sperberg-McQueen’s requirement that markup schemes must be extensible, because the set of features worth marking up in a text and the set of texts to be studied are both infinite [SperbergMcQueen1991] (something which cannot be truer than when the text-as-abstraction in question is the sum of all possible utterances in any human language, which must be the notional object of a universal PoS tagset). Moreover, both the paper authors and the UD website are clear about the theoretical underpinning of UD’s approach to tokenization (and, therefore, to the creation of the text representation that will ultimately be marked up by the tagger). This implies their awareness of the subjective nature of their markup scheme, which — although aiming for universality in its application to all human language — does not lay claim to objectivity. UD imagines the text abstraction as consisting of syntactically-defined tokens, more than one of which may be represented by a single orthographic object (like the Welsh compounds discussed above).
Conversion of the DERWen PoS tagger to use the UD tagset is at a very early stage. It currently uses an extended version of the CorCenCC tagger used by the code on which DERWen is based, with the addition of tags needed for English categories as well as some Twitter-specific categories (such as emoticons and hashtags). Nonetheless, the apparent theoretical engagement of UD with the issues of markup theory that have been explored here (whether grounded in markup theory itself or not) is encouraging. I hope to be able to report soon on some successful experiments in using the UD tagset for mixed-language Welsh-English PoS tagging.
References
[Abney2007] Abney, Steven. Semisupervised learning for computational linguistics. Chapman and Hall/CRC. 2007.
[BickDidriksen2015] Bick, Eckhard, Tino Didriksen. “CG-3—beyond classical constraint grammar”. pp 31—39. In Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015. 2015.
[BirnbaumMundie1999] Birnbaum, David J, David A Mundie. “The problem of anomalous data: A transformational approach”. pp 1—19. In Markup Languages: Theory and Practice. v1. 1999. doi:https://doi.org/10.1162/109966299760283157.
[Brill1995] Brill, Eric. “Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging”. pp 543—565. In Computational Linguistics. v21. 1995.
[Carey2016] Carey, Stan. The hot news or after perfect in Irish English. 2016. URI: https://stancarey.wordpress.com/2016/03/14/the-hot-news-or-after-perfect-in-irish-english/.
[ClarkLappin2010] Clark, Alexander, Shalom Lappin. Unsupervised learning and grammar induction. Clark, Alexander, C Fox, Shalom Lappin. The handbook of computational linguistics and natural language processing. Wiley-Blackwell. 2010.
[CLAWS] CLAWS part-of-speech tagger for English. URI: http://ucrel.lancs.ac.uk/claws/.
[Coombsetal1987] Coombs, James H, Allen H Renear, Steven J DeRose. “Markup systems and the future of scholarly text processing”. pp 933—947. In Communications of the ACM. v30. 1987. doi:https://doi.org/10.1145/32206.32209.
[ChiarcosSukhareva2015] Chiarcos, Christian, Maria Sukhareva. “OLiA—Ontologies of Linguistic Annotation”. pp 379—386. In Semantic Web. v6. 2015. doi:https://doi.org/10.3233/SW-140167.
[ADavies2003] Davies, Alan. The native speaker: Myth and reality. Multilingual Matters. 2003.
[DeRoseetal1990] DeRose, Steven J, David G Durand, Elli Mylonas, Allen H Renear. “What is text, really?”. pp 3—26. In Journal of Computing in Higher Education. v1. 1990. doi:https://doi.org/10.1007/BF02941632.
[DeRose1988] DeRose, Steven J. “Grammatical category disambiguation by statistical optimization”. pp 31—39. In Computational Linguistics. v14. 1988.
[Derczynskietal2013] Derczynski, Leon, Alan Ritter, Sam Clark, Kalina Bontcheva. “Twitter part-of-speech tagging for all: Overcoming sparse and noisy data”. pp 198—206. In Proceedings of the International Conference: Recent Advances in Natural Language Processing. 2013.
[Haspelmath2009] Haspelmath, Martin. Lexical borrowing: concepts and issues. Haspelmath, Martin, Uri Tadmor. Loanwords in the world’s languages: a comparative handbook. Walter de Gruyter. 2009.
[IdeVeronis1995] Ide, Nancy, Jean Véronis. Encoding dictionaries. Ide, Nancy, Jean Véronis. Text Encoding Initiative. Springer. 1995.
[Jeffers1978] Jeffers, Robert J. “Old Irish verbal-nouns”. pp 1—12. In Ériu. v29. 1978.
[Jamatiaetal2015] Jamatia, Anupam, Björn Gambäck, Amitava Das. “Part-of-speech tagging for code-mixed English-Hindi Twitter and Facebook chat messages”. pp 239—248. In Proceedings of the International Conference: Recent Advances in Natural Language Processing. 2015. doi:https://doi.org/10.13140/RG.2.1.1222.0640.
[Kahreletal1997] Kahrel, Peter, Ruthanna Barnett, Geoffrey Leech. Towards cross-linguistic standards or guidelines for the annotation of corpora. Garside, Roger, Geoffrey N Leech, Tony McEnery. Corpus annotation: linguistic information from computer text corpora. Taylor & Francis. 1997.
[KleinHirschheim1987] Klein, Heinz K., RA Hirschheim. “A comparative framework of data modelling paradigms and approaches”. pp 8—15. In The computer journal. v30. 1987. doi:https://doi.org/10.1093/comjnl/30.1.8.
[Leechetal1994] Leech, Geoffrey, Roger Garside, Michael Bryant. “CLAWS4: the tagging of the British National Corpus”. pp 622—628. In COLING 1994 Volume 1: The 15th International Conference on Computational Linguistics. 1994. doi:https://doi.org/10.3115/991886.991996.
[Li2004] Li, Chao. “On verbal nouns in Celtic languages”. pp 163—194. In Proceedings of the Harvard Celtic Colloquium. 2004.
[Lynnetal2015] Lynn, Teresa, Kevin Scannell, Eimear Maguire. “Minority language Twitter: Part-of-speech tagging and analysis of Irish tweets”. pp 1—8. In Proceedings of the Workshop on Noisy User-generated Text. 2015. doi:https://doi.org/10.18653/v1/W15-4301.
[Marcoux2006] Marcoux, Yves. “A natural-language approach to modeling: Why is some XML so difficult to write”. In Proceedings of the Extreme Markup Languages 2006 Conference. 2006.
[Merialdo1991] Merialdo, Bernard. “Tagging text with a probabilistic model”. pp 809—812. In ICASSP 91: International Conference on Acoustics, Speech, and Signal Processing. 1991. doi:https://doi.org/10.1109/ICASSP.1991.150460.
[Mitkov2003] Mitkov, Ruslan, editor. The Oxford handbook of computational linguistics. Oxford University Press. 2004.
[Manningetal2009] Manning, Christopher D., Prabhakar Raghavan, Hinrich Schütze. An introduction to information retrieval. Cambridge University Press. 2009.
[MyersScotton1993] Myers-Scotton, Carol. Duelling languages: Grammatical structure in codeswitching. Oxford University Press. 1993.
[Nivreetal2016] Nivre, Joakim, Marie-Catherine De Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, Daniel Zeman. “Universal dependencies v1: A multilingual treebank collection”. pp 1659—1666. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). 2016.
[Nealeetal2018] Neale, Steven, Kevin Donnelly, Gareth Watkins, Dawn Knight. “Leveraging lexical resources and constraint grammar for rule-based part-of-speech tagging in Welsh”. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018). 2018.
[Neunerdtetal2013] Neunerdt, Melanie, Bianka Trevisan, Michael Reyer, Rudolf Mathar. “Part-of-speech tagging for social media texts”. pp 139—150. In Language Processing and Knowledge in the Web. 2013. doi:https://doi.org/10.1007/978-3-642-40722-2_15.
[OED] Oxford English Dictionary online. URI: http://www.oed.com.
[Owoputi2013] Owoputi, Olutobi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider, Noah A Smith. “Improved part-of-speech tagging for online conversational text with word clusters”. pp 380—390. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013.
[Pilliereetal2018] Pillière, Linda, Wilfrid Andrieu, Valérie Kerfelec, Diana Lewis. Standardising English: norms and margins in the history of the English language. Cambridge University Press. 2018.
[Piez2001] Piez, Wendell. “Beyond the ‘descriptive vs. procedural’ distinction”. pp 141—172. In Markup Languages: Theory and Practice. v3. iss 2. 2001. doi:https://doi.org/10.1162/109966201317356380.
[Poplack1980] Poplack, Shana. “Sometimes I’ll start a sentence in Spanish y termino en Español: toward a typology of code-switching”. pp 581—618. In Linguistics. v18. 1980. doi:https://doi.org/10.1515/ling-2013-0039.
[RabyAndrieu2018] Raby, Valérie, Wilfrid Andrieu. Norms and rules in the history of grammar: French and English handbooks in the seventeenth century. Pillière, Linda, Wilfrid Andrieu, Valérie Kerfelec, Diana Lewis. Standardising English: norms and margins in the history of the English language. Cambridge University Press. 2018.
[Renearetal2002] Renear, Allen, David Dubin, C Michael Sperberg-McQueen, Claus Huitfeldt. “Towards a semantics for XML markup”. pp 119—126. In Proceedings of the 2002 ACM Symposium on Document Engineering. 2002. doi:https://doi.org/10.1145/585058.585081.
[Renearetal2003] Renear, Allen, David Dubin, C Michael Sperberg-McQueen, Claus Huitfeldt. “XML semantics and digital libraries”. pp 303—305. In 2003 Joint Conference on Digital Libraries, 2003. 2003. doi:https://doi.org/10.1109/JCDL.2003.1204879.
[Santorini1990] Santorini, Beatrice. Part-of-speech Tagging Guidelines for the Penn Treebank Project. Technical report MS-CIS-90-47, Department of Computer and Information Science, University of Pennsylvania. 1990.
[SketchEngine] Oxford English part-of-speech tagset. URI: https://www.sketchengine.eu/oxford-english-corpus-tagset/.
[SperbergMcQueen1991] Sperberg-McQueen, C Michael. “Text in the electronic age: Texual study and textual study and text encoding, with examples from medieval texts”. pp 34—46. In Literary and linguistic computing. v6. 1991. doi:https://doi.org/10.1093/llc/6.1.34.
[SperbergMcQueen1994] Sperberg-McQueen, C Michael. “The Text Encoding Initiative: Electronic text markup for research”. pp 35—55. In Literary texts in an electronic age: Scholarly implications and library services. (Papers presented at the 1994 Clinic on Library Applications of Data Processing). 1994.
[SperbergMcQueen2011] Sperberg-McQueen, CM. “What constitutes successful format conversion? towards a formalization of ‘intellectual content’”. pp 153—164. In The International Journal of Digital Curation. v6. 2011. doi:https://doi.org/10.2218/ijdc.v6i1.179.
[Tennison2002] Tennison, Jenni. “Comparing markup languages”. In Proceedings of the Extreme Markup Languages 2002 Conference. 2002.
[Tayloretal2003] Taylor, Ann, Mitchell Marcus, Beatrice Santorini. The Penn Treebank: An overview. Abeillé, Anne, editor. Treebanks: Building and Using Parsed Corpora. Springer. 2003.
[UniversalDependencies] Universal dependencies. URI: https://universaldependencies.org/.
[Voutilainen2003] Voutilainen, Atro. Part-of-speech tagging. Mitkov, Ruslan, editor. The Oxford handbook of computational linguistics. Oxford University Press. 2004.
[Willis1988] Willis, Penny. “Is the Welsh verbal noun a verb or a noun?”. pp 201—224. In Word. v39. 1988. doi:https://doi.org/10.1080/00437956.1988.11435790.
[Wohlgemuth2009] Wohlgemuth, Jan. A typology of verbal borrowings. Walter de Gruyter. 2009.
[1] Welsh has an established borrowed form of “stuff”, which is spelled stwff and pronounced to rhyme with English “woof”; styff is not a standard form, but is an accurate rendering of the English pronunciation into Welsh orthography.
[2] Birnbaum and Mundie’s fourth option is to keep the text precisely as it is, and produce an invalid document. Although this is evidently an option, it seems to me rather less interesting for understanding the possibilities and constraints of markup as applied to text representations. For better or for worse, therefore, I will not discuss it further here.
[3] The following discussion draws on Birnbaum and Mundie’s appraisal of the relative drawbacks of these three solutions, and frames them in terms of the markup schema as a theory of the text. Although my discussion is therefore indebted to [BirnbaumMundie1999], errors or misjudgements in reinterpreting their arguments in light of Sperberg-McQueen's understanding of the text are my own.
[4] Although words that technically belong to categories not occurring in the language of the text might appear, these can only be words belonging to a different language; words with foreign morphology are either adapted to the grammatical system of the language in question, or remain as they are (in which case, they are considered as insertions from that language). Tagsets generally have some variant of a <foreign> tag to mark words that cannot be categorised as belonging to the language of the text.
[5] (noun, verb, adjective, adverb, pronoun, adposition, conjunction, numeral, article, and interjection)
[6] Undoubtedly, the berfenw must be thought of as a special type of noun. Two of its most obvious linguistic features are that it cannot be pluralized, and that it is associated with a specific morphological ending (-io or -o). This ending is used very frequently to convert an English borrowed verb into a Welsh berfenw: “dim socket i chargeo ffôn” — no socket for charging a phone. The ending is generally added only to root forms, which are usually shared by an inflected verb with the same meaning (a notable exception is the berfenw mynd ("go"), whose corresponding verb has the root a-). A tagset for Welsh should ideally model the berfenw as a kind of noun, but ensure that it has its own tag to distinguish it from other nouns. (See [Lynnetal2015] for a similar decision made in Irish PoS tagging).
[7] Attention to the semantics of the tag might mitigate this somewhat, of course, but (as discussed above) the semantics of markup are not formalizable and, as such, may be considered less implicationally significant to the theory of the text abstraction than the syntax. The fact of having an “escape hatch” implies a model of the token as anomalous, regardless of the tag’s label.