How to cite this paper

Hillman, Tomos, C. M. Sperberg-McQueen, Bethan Tovey-Walsh and Norm Tovey-Walsh. “Designing for change: Pragmas in Invisible XML as an extensibility mechanism.” Presented at Balisage: The Markup Conference 2022, Washington, DC, August 1 - 5, 2022. In Proceedings of Balisage: The Markup Conference 2022. Balisage Series on Markup Technologies, vol. 27 (2022). https://doi.org/10.4242/BalisageVol27.Sperberg-McQueen01.

Balisage: The Markup Conference 2022
August 1 - 5, 2022

Balisage Paper: Designing for change

Pragmas in Invisible XML as an extensibility mechanism

Tomos Hillman

eXpertML Ltd

Tom Hillman has worked as an XML practitioner and consultant for fifteen years, doing everything from documentation to IT support and administration to workflows for digital publishing to conference organization to XML database management and consultancy.

C. M. Sperberg-McQueen

Black Mesa Technologies LLC

C. M. Sperberg-McQueen is the founder of Black Mesa Technologies LLC, a consultancy specializing in the use of descriptive markup to help memory institutions preserve cultural heritage information. He co-edited the XML 1.0 specification, the Guidelines of the Text Encoding Initiative, and the XML Schema Definition Language (XSDL) 1.1 specification.

Bethan Tovey-Walsh

Swansea University

Bethan Tovey-Walsh is a PhD student in Applied Linguistics and Welsh at Swansea University. She is funded by the CorCenCC corpus of modern Welsh, and created the Welsh part-of-speech tagger now used by the project. She previously worked for OUP as a content architect and as a researcher for the Oxford English Dictionary.

Norm Tovey-Walsh

Senior Software Developer

Saxonica

Norm Tovey-Walsh is currently a senior software developer at Saxonica Ltd, working from his home in Swansea, Wales. Previously, he was employed by MarkLogic Corporation, Sun Microsystems, Arbortext, and O’Reilly Media (then O’Reilly & Associates).

Designing for change: Pragmas in Invisible XML as an extensibility mechanism copyright © 2022 by Tomos Hillman, C. M. Sperberg-McQueen, Bethan Tovey-Walsh, and Norman Tovey-Walsh is licensed under CC BY-NC-SA 4.0.

Abstract

Invisible XML (ixml) is a method for treating non-XML documents as if they were XML. The 1.0 specification for Invisible XML was announced in June of this year. No technology foresees all of its use cases, especially in 1.0. How can ixml allow experimentation, and channel experimentation in useful ways, to allow ideas to be expressed in ixml grammars that go beyond what is foreseen, without compromising interoperability or the value of strict conformance to the specification?

Many programming languages (C, JavaScript, Pascal, XQuery, etc.) address this question with pragmas. A pragma is a semi-formal way to instruct a processor/compiler/interpreter how it should operate. Typical pragmas extend a specification but are not a part of it. We propose pragmas as an optional add-on to ixml to allow implementation of non-standardized functionality in a way that does not interfere with standard ixml processing. We describe our general framework for pragmas, some specific pragmas (to illustrate how pragmas can be used), and a few pragmatic implementations.

Table of Contents

Introduction
What is a pragma?
Some approaches to extensibility
Requirements, desiderata, and use cases
Use cases
Requirements and desiderata
Requirements
Desiderata
Design questions
Pragmas proposal
Syntax in ixml
Internal syntax of pragmas
External syntax: where pragmas may appear
Syntax in XML
Pragma scope
Operational semantics
Conformance requirements for pragmas
Examples
Renaming
Name indirection
Rule rewriting
Tokenization annotation and alternative formulations
Text injection
What next?
Appendix A. Modified ixml syntax

Introduction

Strictly limiting the scope of a specification helps keep the technology simple; prohibiting variation among conforming processors helps implementers achieve interoperability. Simplicity and interoperability may lead to success, success to a broader user community, a broader user community to demands for broader functionality and further development of the specification. This is the virtuous spiral many technology developers hope to achieve.

Successful extension of a technology to address new use cases and incorporate new functionality will, in general, require some experimental implementations of the new functionality. If the initial specification is tightly focused on its core use cases and very strict about prohibiting non-conforming behavior, however, any such experimentation will be non-conforming, which brings two risks: implementers may be reluctant to experiment with new behavior, which means later versions of the spec may lack a firm grounding in experience, or implementers and users may come to regard conformance to the specification as irrelevant to the really interesting work of solving particular problems and providing useful capabilities. If the initial specification is too lax on conformance requirements, on the other hand, interoperability is likely to suffer and user communities will form (if they form at all) around particular implementations rather than around the technology as specified.

We present a concrete design for extensibility in Invisible XML (Pemberton 2022), in the form of a proposal for pragmas, a mechanism designed to allow out-of-band communication between a grammar writer and an ixml processor. An author, for example, might know that a particular rule is amenable to some optimization, or that they would prefer ambiguity to be resolved in a particular way, or that they wish to employ a processor extension of some sort.

We begin with a description of what we mean by the term pragma (section “What is a pragma?”), followed by a short description of some different approaches to the general problem of extensibility in different technologies and specifications (section “Some approaches to extensibility”). We then proceed to a sketch of the requirements (as we understand them) for pragmas in ixml, illustrated with several specific use cases (section “Requirements, desiderata, and use cases”). Then we present the pragmas proposal itself. A few worked examples illustrate how the pragmas proposal outlined here could in principle be used in practice (section “Examples”). We conclude (section “What next?”) with some speculations on future developments.

The proposal described here has grown out of work in the World Wide Web Consortium's community group on Invisible XML, and we thank our colleagues in the community group for discussions of pragmas, extensibility, and related topics.

What is a pragma?

By pragma we mean, in general, a construct in a formal language which conveys non-standard or out-of-band information to processing software in a way not defined by the specification of the language in which the pragma is embedded.

That description may need some unpacking:

  • A pragma is a syntactic construct. That is, it is defined by the grammar of the language, so that any parser for the language can and should recognize pragmas when they are encountered, if only for the purpose of ignoring them.

  • It conveys information to processing software. That is, pragmas are not typically intended solely for human consumption.

    Note that it is impossible to enforce a strict separation between information intended for humans and information intended for software, and so this point must be taken as a description of a general tendency and not as a testable or enforceable rule. But one of the key differences between pragmas and comments is that in general comments are directed at human beings and are to be ignored by software, while in normal usage pragmas serve to convey information to a processor and are thus typically less free-form than comments.

  • The information conveyed by a pragma is typically non-standard.

    This too describes a tendency rather than an enforceable rule. Nothing can prevent someone from using a pragma to convey information which could be conveyed by the standard mechanisms of the language as defined. But if the information in question can be expressed without a pragma, it would be unnecessary, verging on eccentric, to go to the effort of expressing it in a non-standard way.

    Because the interpretation of pragmas is not defined by the specification of the language, the usual rule is that pragmas have no effect on the standard meaning of the document in which they are embedded and can be ignored (e.g., by software which does not understand them).

The term appears to have entered the vocabulary of computing from Algol 68 van Wijngaarden et al. 1976. which defines a construct it calls a pragmat (apparently short for pragmatic remark or pragmatic comment).[1]

A pragment is a comment or a pragmat. No semantics of pragments is given and therefore the meaning ... of any program is quite unaffected by their presence. It is indeed the intention that comments should be entirely ignored by the implementation, their sole purpose being the enlightenment of the human interpreter of the program.

Pragmats may, on the other hand, convey to the implementation some piece of information affecting some aspect of the meaning of the program which is not defined by this Report, for example: ....

They may also be used to convey to the implementation that the source text is to be augmented with some other text, or edited in some way, for example: ....

The interpretation of pragmats is not defined in this Report, but is left to the discretion of the implementer, who ought, at least, to provide some means whereby all further pragmats may be ignored, ....

Many but not all programming languages defined more recently provide for pragmas, sometimes under other names (directives, declarations); in others, the comment construct is used to convey pragmatic information. The Wikipedia article on Directive (programming) has an unsystematic but informative survey. Typical use cases for programming-language pragmas include hints that a certain kind of optimization might usefully be applied.

Some approaches to extensibility

Designs and specifications for earlier computing technologies have taken a variety of approaches to extensions and to the provision of extensibility mechanisms, with a variety of outcomes. It should be noted that this section presents a series of examples illustrating some points in the abstract design space. It is not a historical survey and should not be misunderstood as attempting to be one.

Sometimes, a major functional area of the technology was left undefined, in the expectation that implementers would fill the gap, and sometimes perhaps in the belief that only implementers working in a particular environment would be in a position to work out the necessary details fully.

The Algol 60 report (Naur et al. 1960) provided no mechanisms for input or output; it was expected that implementations would extend the language in ways suitable for the I/O facilities of the host environment. The designers of C famously made the same decision; the C compiler developed by Kernighan and Ritchie provided a standard I/O (stdio) library but expected (apparently in a sort of let a hundred flowers bloom, let a hundred schools of thought contend frame of mind) that different implementers would choose different ways of managing I/O, with different libraries reflecting different ways of handling the task. Pressure from users (i.e., programmers using C compilers) eventually forced all C compilers to provide a version of stdio, and forced the relevant standards committees to standardize that library.

The ISO Pascal standard includes an interesting provision in the list of things a Pascal compiler must do to comply with ISO 7185 (quoted from Jensen and Wirth 1974/1985):

It [must be] able to process as an error any use of an extension or of an implementation-dependent feature.

Two things seem striking here: first the requirement that it be possible to turn off all extensions, which allows users to check to make sure their program does not depend on vendor extensions, and second the quiet assumption (without any discussion that I have found) that there will of course be extensions to the language, in some processors if not in all. The balancing of interests here seems worth bearing in mind: implementers may have an interest in extending the language, and so extensions are implicitly tolerated in a conforming processor.[2] Users, on the other hand, have an interest in portability and in avoiding lock-in, so conforming processors must be able to turn extensions off.

SGML took a different approach (ISO 8879:1986): with its processing instructions, ISO 8879 provided a mechanism that allowed users (and SGML editors) to insert non-standard information into documents and mark it as such, which allows other applications to ignore the information so marked. By requiring that processing instructions begin with a defined name, XML attempted to make it a little easier for processors which use processing instructions to know at a glance whether the instruction is one they should pay attention to or one they should ignore.

Programming-language processors have often felt a need for some similar mechanism for inserting processor-specific annotations into programs. Because programming language syntaxes often lack anything analogous to processing instructions, these processor-specific (or at least non-standardized) annotations are often embedded in what syntactically are comments. Thus a Pascal program[3] might begin with the comment:

{SC+: distinguish between upper and lower case}

In the absence of any inter-implementer agreement on how to distinguish one implementation's annotations from another's, of course, such mechanisms may lead to collisions.

A specification frequently mentioned as having found a successful formula for extensibility is the original HTML specification, which defined a set of element types and required that if an HTML processor encountered an element of an unknown type, it should ignore that element's tags. This provision allowed browser makers to experiment with support for new elements, which in turn allowed for swift development of new functionalities, both good and bad (the blink element is seldom regarded as a triumph of good markup design), although it also tended to make the actual specification of HTML less important than whatever browser makers were supporting on any given day. The HTML rule works less well in cases where the best approach would be for the entire element to be ignored, rather than just its start- and end-tags. But this flaw illustrates an important point about extensibility: finding some path for extensibility can be very useful, even if it is manifestly imperfect.

Some XML-based syntaxes have taken a similar, though less flamboyantly anarchic, approach to extensibility and non-standard content. XSLT (Kay 2017), for example, allows XSLT stylesheets to contain extension elements whose syntax and semantics are implementation-defined. It also allows attributes in any non-XSL namespace to appear on any element in the XSL namespace.

XSLT demonstrates that it is possible to give the author even more control. XSLT provides an explicit fallback mechanism that allows a stylesheet to use later (e.g., version 2.0) constructs when relevant while still telling a processor what to do if it does not understand the base expression. It also provides a “use-when” mechanism that allows the stylesheet author to delimit areas of the stylesheet where extensions are used so that they are targeted only at specific processors that are known to understand them.

The XML Schema Definition language similarly allows foreign attributes on all elements, and for more complex annotations it provides an appinfo element available at key locations, into which schema authors can insert arbitrarily complex material. The namespace-qualified names built into the XML stack in the interests of distributed extensibility are also useful here.

Because XPath (Robie et al. 2017a) and XQuery (Robie et al. 2017b) do not use an XML-based syntax, providing for such extensibility is somewhat harder for them. But namespace-qualified names (QNames, for short) do provide a simple mechanism that allows non-standard functions to be available in a processor, and compile-time and run-time facilities for testing the availability of a function make it possible for users of XSLT and XQuery to adjust to the set of available functions. XQuery also provides extension expressions, which consist of a series of pragmas followed by a fallback expression. The pragmas, each guarded with a qualified name, can contain expressions using extensions to the base language; a processor which understands none of the pragmas will evaluate the fallback expression. The XQuery specification is unusual in disavowing any expectation that the pragmas and the fallback expression will always produce the same result; the extensions used in the pragmas may provide functionality not available in XQuery. The standard interpretation of a query is of course unaffected by extension expressions, but what a processor actually does may well be affected. Since there is no way to prevent this happening in any case (short of solving the Halting Problem), XQuery's clear-eyed realism on the topic seems to us to take the right approach.

These are not the only possible approaches. There is a continuum between the most restrictive possible interpretation: all extensions are errors, and the most liberal: anything that doesn’t conform to the specification in any way can be interpreted however the implementation likes. Different languages appear at different places along this continuum.

From this unsystematic survey, we think several lessons may be drawn:

  • Providing a mechanism for non-standard information can be useful, whether it is used for setting options in a processor or extending the base language.

    It is important enough that it is often better to have an imperfect extension mechanism than to have none at all.

  • When extensions are tolerated, interoperability can be preserved if implementations are required to support a mode in which all extensions are ignored.

  • It's helpful if there is a simple way for processors to identify extensions in materials they are processing and decide reliably whether they are extensions supported by the processor or not.

  • It's important to be clear about what processors are to do if they don't understand an extension. The ability to specify fallback behaviors case by case can be helpful.

These examples illustrate, we hope, the design space within which we believe the pragmas proposal presented here is to be situated. Our proposal is inspired in part by the xsl:fallback and use-when mechanisms of XSLT and the extension expression and annotation mechanisms of XQuery. SGML and XML processing instructions have also contributed to our thinking. Because the ixml specification itself has no provision for pragmas, we follow the common practice of conveying non-standardized information as magic comments: that is, strings which are treated as comments by standard processors, but which have a specific structure which allows processors to recognize them as pragmas.[4]

Because pragmas as described here will be handled by standard ixml processors as comments and ignored, the use of pragmas does not in itself make any ixml grammar non-conformant.

Requirements, desiderata, and use cases

In this section, we discuss what requirements we think a proposal for pragmas must meet. We also identify some concrete examples of information not provided for by ixml as specified, but of potential interest to users or implementations. In some cases, there is external evidence that the information is of interest, because there have been proposals to integrate it into the ixml specification itself.

As was explained above, the general idea of pragmas is to provide a channel for information that is not a required part of the ixml specification but can be used by some implementations to provide useful behavior, without interfering with the operation of other implementations for which the information is irrelevant. The additional information contained in pragmas may be used to control options in a processor, in roughly the same way as pragmas and structured comments in C or Pascal programs may be used to control optimization levels in some compilers, or to extend the specification and provide additional functionality, just as extension expressions in XQuery can be used to invoke non-standard functionality to an XQuery processor and just as extension elements in XSLT can be used to specify non-standard behavior in an XSLT processor.

On this view, pragmas are a form of annotation, and we use the terms pragma and annotation accordingly.

Use cases

Among the use cases that motivate the proposal are these.

Note that some of these use cases may in practice be handled by future changes to the core syntax of ixml (and one has in fact been handled by a change already made). We include them in the list of use cases for pragmas not because we think pragmas are the best imaginable way to handle them but because they are (a) plausible ideas for things one might want to do which are (b) not supported by ixml in its current form (or in one case, its earlier form), and thus (c) natural examples of the kinds of things an extension mechanism like pragmas ought ideally to be able to support.

  • Renaming

    Using pragmas to specify that an element or attribute name serializing a nonterminal should be given a name different from the nonterminal itself.

  • Name indirection

    Using pragmas to specify that an element or attribute name should be taken not from the grammar but from the input, specifically from the string value of a given nonterminal.

  • Rule rewriting

    Using pragmas to specify that a rule as given is shorthand for a set of other rules, which can be obtained by rewriting the rule as given.

  • Tokenization annotation

    Using pragmas to annotate nonterminals in an ixml grammar to indicate that they (a) define a regular language and (b) can be safely recognized by a greedy regular-expression match.

  • Alternative formulations

    Using pragmas to provide alternative formulations of rules in an ixml grammar to allow different annotation or better optimization.

  • Text injection

    Using pragmas to indicate that a particular string should be injected into the XML representation of the input as (part of) a text node or attribute value. (This can help make the output of an ixml parser conform to a pre-existing schema.)

    After the preparation of this pragmas proposal, the ixml specification was changed to support text injection, which illustrates the point that what is described and implemented at one point as a non-standard extension to a specification may later become standard.

  • Attribute grammar specification

    Using pragmas to annotate a grammar with information about grammatical attributes to be associated with nodes of the parse tree, whether they are inherited from an ancestor or an elder sibling or synthesized from the children of a node, and what values should be assigned to them. Grammatical attributes are not to be confused with XML attributes, although in particular cases it may be helpful to render a grammatical attribute as an XML attribute.

Some of these use cases seem most naturally handled by annotations which apply to a grammar as a whole, some by annotations which apply to individual rules, and some by annotations which apply to individual symbols in the grammar.

We do not currently see a strong use case for annotations which apply to arbitrary expressions in a grammar.

Requirements and desiderata

Our tentative list of requirements and desiderata is as follows.

By requirement we mean a property or functionality which must be achieved for a pragmas proposal to be worth adopting. By desideratum we mean a property or functionality that should be included if possible, but which does not doom the proposal to pointlessness if it proves impossible to achieve.

Requirements

  • It must be straightforward for processors to ignore pragmas they do not understand, and to determine whether they understand a given pragma or not.

  • It must be clear to human readers and software which expressions in standard ixml notation are and are not affected or overridden by a given pragma.

  • For any occurrence of a pragma in a grammar, it must be clear both what should be done by a processor that understands and processes the pragma and what should be done by a processor that does not understand and process the pragma. We refer to the latter as the fallback expression.

Desiderata

  • Ideally, the result of evaluating the fallback expression should be a useful and meaningful result, but this is more a matter for the individual writing a grammar than for this proposal. The desideratum for a pragmas proposal is to make it easy (or at least not unnecessarily hard) to write useful fallbacks.

  • It should ideally be possible to specify pragmas as annotations applying to a symbol, a rule, or a grammar as a whole, and it should be possible to know which is which. It is not required that the distinction be a syntactic one, however, since it can also be expressed by the semantics of the particular pragma.

  • It should ideally be possible for processors to generate the XML representation of an ixml grammar containing pragmas, even if they do not understand the pragmas contained. And conversely it should ideally be possible for processors to write out the ixml form of an XML grammar containing pragmas, even if the processor does not understand the pragmas appearing in the grammar.

Design questions

Several design questions can be distinguished; they are not completely orthogonal.

  • What information should be encodable with pragmas?

  • What syntax should pragmas have in Invisible XML?

  • What representation should pragmas have in the XML form of a grammar?

  • Where can pragmas appear?

Pragmas proposal

Pragmas are a syntactic device to allow grammar writers to communicate with processors in non-standard ways without interfering with the operation of other processors. To avoid interference with other processors, two requirements arise:

  • Pragmas must be syntactically identifiable as such.

  • Also, it must be possible for processors to distinguish pragmas directed at them from other pragmas. This proposal uses URIs to allow grammar writers and implementations to avoid collisions.

Pragmas may affect the behavior of a processor in any way, either in ways that leave the meaning of a grammar unchanged or in ways that change the meaning of the grammar in which the pragmas appear.

Syntax in ixml

Extensibility mechanisms are designed to facilitate independent invention. At the same time, a processor which recognizes an extension pragma may behave differently because of that pragma. It follows that pragmas will benefit from some form of distributed naming mechanism. In an XML context, the obvious candidate for distributed naming is the namespace-qualified name or QName. The TEI “p” element is distinct from the XHTML “p” element because they are in different namespaces.

Invisible XML doesn’t provide any support for namespaces, so we must look elsewhere. In principle, the pragmas proposal could invent a pragma-based mechanism for defining namespace prefixes and then use QNames in pragmas. But such a mechanism wouldn’t extend to the nonterminals in a grammar without breaking syntactic compatibility with Invisible XML 1.0. There are at least some voices in the community that favor adding a namespace mechanism to Invisible XML, so it seems wise to leave that space open for future versions of Invisible XML.

The part of qualified names that guarantees distributed naming and thus distributed extensibility is the use of URIs to identify namespaces. As long as people coin names only in the parts of URI space where they have the authority to construct names, name collisions can be avoided. So we can take a step back from qualified names and employ the URI directly for distributed naming.

Internal syntax of pragmas

Comments in Invisible XML are enclosed in braces, { a comment }. Pragmas are enclosed in braces and square brackets, {[a pragma]}, to make them appear as comments to a processor that doesn’t understand pragmas and at the same time to distinguish them from “ordinary comments” to a processor that does understand pragmas.

Pragmas contain a name, and optionally additional data, which takes the form of a sequence of brace-balanced characters. The relevant part of the ixml grammar is:

       pragma: -"{[", @pname, (whitespace, pragma-data)?, -"]}". 
       @pname: name.
  pragma-data: (-pragma-char; -bracket-pair)*.
 -pragma-char: ~["{}"].
-bracket-pair: '{', -pragma-data, '}'.

For example, the following are both syntactically well formed pragmas:

  • {[blue]}

  • {[color blue]}

Here we must pause and consider what mechanism we will use to establish that a pragma name (for example, “blue” or “color”) is associated with a URI.

We assert that the pragma named “pragma” is special (in a manner entirely analogous to the way that Namespaces in XML (Bray et al. 2009) asserts that the namespace prefix “xmlns” is special). This pragma is used to associate a pragma name with a URI:

{[+pragma myPragma "https://example.com/pragmas/mine"]}

(We shall come back to the significance of the leading “+” shortly; briefly, it is a way to distinguish a pragma that appears in the prolog, and applies to the entire grammar, from one that merely appears before the first rule.)

From this point forth, the pragma named myPragma is taken to be the one identified by the URI specified. Like namespace prefixes in QNames, the in-grammar name of the pragma is arbitrary; it is the association with the URI that identifies it. The pragma data that follows the name, if there is any, is interpreted according to the rules for that pragma, as specified by the inventor of the pragma. It is regarded as an error if a pragma is used before a URI association is made. A pragma-aware processor should report this error to the author.

An Invisible XML grammar might define an arbitrary number of pragmas this way. It is worth observing that for cases where it might be inconvenient for authors to define a great many pragmas with distinct URIs, there’s nothing that prevents an implementation from specifying a single pragma and using the pragma data to distinguish between different effects, much as many modern command line programs use “subcommands” (for example, git checkout, git status, git push etc.) instead of having many distinct commands.

It is a consequence of the syntax that pragmas can contain nested pragmas, as shown here:

{[rewrite
    comment: -"{", cchars, (comment++cchars, cchars)?, -"}". 
    {[token]} -cchars:  cchar*. 
]}

Here, in fact, the pragma contains a nested pragma, though the nesting is only apparent to a processor which understands the rewrite pragma and knows to parse its pragma data as a sequence of rules in ixml notation. A processor which does not understand the rewrite pragma will merely know that the pragma data here contains a sequence of characters, which happens to include two nested pairs of braces. That suffices. And of course a processor which does not handle pragmas at all will treat the entire thing as a comment, containing two nested comments.

External syntax: where pragmas may appear

Pragmas may appear:

  • immediately before a terminal or nonterminal symbol in the right-hand side of a rule, before or after its mark if any, or

  • immediately before the nonterminal symbol on the left-hand side of a rule, before or after its mark if any, or

  • after the final alternative of a rule, before the full stop ending the rule, or

  • before the first rule of the grammar.

In the final case, it must be possible to distinguish between a pragma that applies to the first rule of a grammar and a pragma that precedes it but applies to the grammar as a whole. We do that by adding one more syntactic convention: a pragma that begins “{[+” can only appear at the beginning of a grammar and applies to the grammar as a whole.

Changes to the grammar of ixml

We allow pragmas to appear in specific places, where we interpret them as applying to specific parts of the grammar. Each of these requires some changes to the grammar of ixml. To allow pragmas immediately before symbols, we change the grammatical definitions of symbols. First, the changes for nonterminals:

 nonterminal: annotation, name, s.
 -annotation: (pragma, sp)?, (mark, sp)?.
         -sp: (whitespace; comment; pragma)*.
Pragmas and marks are grouped together as annotation, and the nonterminal sp is defined for whitespace that may contain pragmas. The changes for terminals are similar; since terminal marks are distinct from those for nonterminals, the additional nonterminals tmark and tannotation are needed.
     -quoted: tannotation, string, s.
    -encoded: tannotation, -"#", @hex, s.
   inclusion: tannotation,          set.
   exclusion: tannotation, -"~", s, set.
-tannotation: (pragma, sp)?, (tmark, sp)?.

To allow pragmas on the left-hand side of a rule and before its closing full stop, we modify the definition of rule:

        rule: annotation, name, s, 
              -["=:"], s, -alts, (pragma, sp)?, -".".

To distinguish pragmas which apply to the entire grammar from pragmas occurring on the left-hand side of the first rule, we modify the definition of prolog to include prolog pragmas (ppragma for short), which are distinguished from normal pragmas by having a plus sign as part of their starting delimiter.

     -prolog: version, s, ppragma++s, s. 
     ppragma: -"{[+", @pname, (whitespace, pragma-data)?, -"]}".

Why not just allow pragmas to appear where comments can appear?

At this point, some readers may be asking why we don't take the apparently simpler approach of just defining pragmas as whitespace, like comments, and allowing them wherever comments can appear. After all, pragmas can be viewed as a kind of comment, can they not?

Yes, pragmas can be viewed as a kind of comments, in as much as, like comments, you can ignore them if you don’t care about pragmas, or if you encounter a pragma you don’t recognize, or if the moon is full.

But at the same time no, pragmas cannot really be viewed that way in practice. Implementations which don't recognize pragmas will parse them as comments, but for implementations which actually implement any pragmas, it’s not sufficient to just leave them as comments in the grammar. It’s easy to demonstrate why with an example. Consider:

{[+pragma my "http://example.com/pragmas/g342"]}

{[my example rule pragma]}
symbol: A .

A: {[my example symbol 'a' pragma]} 'a',
   {[my example symbol B pragma]} B.
B: .

If you parse this with an ixml grammar that knows nothing about pragmas, those are comments, and the result is:

<ixml>
   <comment>[+pragma my "http://example.com/pragmas/g342"]</comment>
   <comment>[my example rule pragma]</comment>
   <rule name="symbol">
      <alt>
         <nonterminal name="A"/>
      </alt>
   </rule>
   <rule name="A">
      <comment>[my example symbol 'a' pragma]</comment>
      <alt>
         <literal string="a"/>
         <comment>[my example symbol B pragma]</comment>
         <nonterminal name="B"/>
      </alt>
   </rule>
   <rule name="B">
      <alt/>
   </rule>
</ixml>

This is unsatisfactory in a couple of ways. First, it’s necessary to resort to re-parsing the comment to distinguish between the prolog pragmas that are intended to apply to the grammar as a whole and the pragmas that are supposed to apply to the first rule. Second, the pragmas are not reliably associated with their targets. Two of the pragmas are the immediate left siblings of their targets (my example rule pragma and my example symbol B pragma),so perhaps we could say that pragmas apply to the next construct, but that doesn’t work for the ‘a’ pragma because its immediate right sibling is the <alt>. And the prolog pragma is different again: it's the child of its target.

By extending the ixml grammar to distinguish pragmas from comments, we can do much better:

<ixml>
   <prolog>
      <ppragma pname="pragma">
         <pragma-data>my "http://example.com/pragmas/g342"</pragma-data>
      </ppragma>
   </prolog>
   <rule name="symbol">
      <pragma pname="my">
         <pragma-data>example rule pragma</pragma-data>
      </pragma>
      <alt>
         <nonterminal name="A"/>
      </alt>
   </rule>
   <rule name="A">
      <alt>
         <literal string="a">
            <pragma pname="my">
               <pragma-data>example symbol 'a' pragma</pragma-data>
            </pragma>
         </literal>
         <nonterminal name="B">
            <pragma pname="my">
               <pragma-data>example symbol B pragma</pragma-data>
            </pragma>
         </nonterminal>
      </alt>
   </rule>
   <rule name="B">
      <alt/>
   </rule>
</ixml>

Now each pragma is a child (or in the case of prolog pragmas, the grandchild) of the element to which it applies.

In order to make the XML form of grammars with pragmas more useful, therefore, the proposal here modifies the grammar of ixml as described. The changes made guarantee that every input which matches the modified grammar also matches the standard ixml specification grammar, and every conforming ixml grammar which uses no pragmas has the same XML structure in a pragma-aware processor as in a standard ixml processor.[5]

Syntax in XML

Following the normal rules of ixml, pragmas are serialized as elements named pragma or ppragma (for prolog pragmas), with an attribute named pname and an optional child element named pragma-data. In addition, in XML grammars pragma elements may contain any number of XML elements following the pragma-data element.

For example:

<pragma pname="blue"/>
or
<pragma pname="color">
    <pragma-data>blue</pragma-data>
</pragma>
or
<pragma pname="rewrite">
    <pragma-data>
    comment: -"{", cchars, (comment++cchars, cchars)?, -"}". 
    {[token]} -cchars:  cchar*.  
</pragma-data>
</pragma>

Processors which do not implement the pragma in question will as a matter of course produce pragma elements with just the one child element (or none). But processors which implement a given pragma are free to inject additional XML elements into the XML form of the pragma. It is to be assumed that the XML elements contain no additional information, only a mechanically derived XML form which makes the information in the pragma easier to process. It is to be expected that any software to serialize XML grammars in ixml form will discard the additional XML elements.

For example, note that a processor which understands the rewrite pragma (shown above in an example) might prefer to produce a different XML representation for it, e.g., one in which the embedded grammar rules are parsed into their normal XML representation.[6] For such a processor, the XML representation might be:

<pragma pname="rewrite">
    <pragma-data>
    comment: -"{", cchars, (comment++cchars, cchars)?, -"}". 
    {[token]} -cchars:  cchar+. 
</pragma-data>
</pragma>
<rule name="comment">
   <alt>
      <literal tmark="-" string="{"/>
      <nonterminal name="cchars"/>
      <option>
         <alts>
            <alt>
               <repeat1>
                  <nonterminal name="comment"/>
                  <sep>
                     <nonterminal name="cchars"/>
                  </sep>
               </repeat1>
               <nonterminal name="cchars"/>
            </alt>
         </alts>
      </option>
      <literal tmark="-" string="}"/>
   </alt>
</rule>
<rule mark="-" name="cchars">
   <pragma pname="token"/>
   <alt>
      <repeat0>
         <nonterminal name="cchar"/>
      </repeat0>
   </alt>
</rule>

Note that because the additional XML elements within the pragma are just redundant XML representations of the pragma data, an application to rewrite XML grammars in ixml form will lose no information when transcribing this XML pragma as

{[rewrite
              comment: -"{", cchars, (comment++cchars, cchars)?, -"}". 
              {[token]} -cchars:  cchar*. 
]}

Pragma scope

In this proposal, pragmas always apply explicitly to some part of a grammar:

  • to a symbol occurrence in the right-hand side of a rule, or

  • to a rule

  • to the grammar as a whole.

The relation between a pragma and the part of the grammar to which it applies is reflected in the XML form of a grammar: ordinary pragmas appear as child elements and prolog pragmas as grandchild elements of the part of the grammar they apply to (an element named ixml, rule, nonterminal, literal, inclusion, or exclusion).

These associations between pragmas and parts of grammars are specified here for clarity and to enable clearer discussion of pragmas, but they have no effect on the operational semantics of ixml processors. If a processor does not implement a given pragma, or any pragmas at all, it will not be affected by the pragmas, regardless of what they apply to, and a processor that does understand a given pragma may be able to tell from its definition what changes in behavior it requests and what it applies to. The associations given above are thus of most direct use to those specifying the meaning of specific pragmas.

Operational semantics

In describing the operational semantics of pragmas, we distinguish different classes of ixml processor:

  • standard ixml processors treat pragmas syntactically as comments and ignore them in the same way as they ignore all comments. Informally, they do not understand any pragmas, and their only obligation is not to trip over pragmas when they encounter them.

  • pragma-aware processors recognize pragmas syntactically and modify their behavior in accordance with some pragmas. Informally, they understand some pragmas but not all. For each pragma they recognize, they must determine whether it is one they understand and implement, or not.

With regard to a given pragma, processors either implement that pragma or they do not. A processor implements a pragma if and only if it adjusts its behavior as specified by that pragma. In the ideal case there will be some written specification of the pragma which describes the operational effect of the pragma clearly. This proposal assumes that a processor can use the URI of a pragma, possibly in conjunction with the pragma data, to determine whether the processor implements the pragma or not and thus decide whether to modify its normal operation or not.

Pragma-aware processors MUST accept pragmas when they occur in the ixml form of a grammar, and (if they are producing an XML form of the grammar) must produce the correct XML form of each pragma, just as they produce the corresponding XML form for any construct in the grammar.

Conformance requirements for pragmas

The conformance requirements mentioned in this section apply to pragma-aware processors; the qualifier pragma-aware is sometimes omitted for brevity.

Pragma-aware processors MUST be capable, at user option, of ignoring all pragmas and processing a grammar using the standard rules of ixml.

Processors which accept ixml grammars MUST accept pragmas in the ixml form of a grammar, whether they understand or implement the specific pragmas or not.

Processors which accept XML grammars MUST accept pragmas in the XML form of a grammar, whether they understand or implement the specific pragmas or not.

If a pragma which the processor does not understand or implement is present in a grammar used to parse input, the processor MUST process the grammar in the same way as if the pragma were not present.

When ixml grammars are processed as input using the processor's built-in grammar, processors MUST produce the correct XML form of each pragma, just as they produce the corresponding XML form for any construct in the grammar, except as the processor's behavior is affected by the presence of pragmas in the grammar for ixml used to parse the input.

Examples

The examples in this section describe some scenarios in which we can imagine an implementation wanting to support behavior that goes beyond what is in the current version of the ixml specification. They illustrate how the pragma mechanisms described above could be used to invoke the behavior in question.

They are thus intended to persuade the reader that the mechanisms described above suffice for some plausible use cases. They are not intended as full specifications of the syntax and semantics of the pragmas described, although some of them have in fact been implemented.

Note

In the future, we expect to elaborate the description of some of these pragmas and publish them as specifications of particular pragmas which may be implemented by more than one processor. We anticipate doing this by describing pragmas in the vendor-neutral namespace https://gyfre.org/ns with the conventional name gyfre. Gyfre is the name of the invisible servant in the Middle English poem Sir Launfal.

Renaming

Use case: Using pragmas to specify that an element or attribute name serializing a nonterminal should be given a name different from the nonterminal itself.

In the grammar below, the two forms of month have different syntaxes, so they are required to have different nonterminal names, and so they are required to be serialized using different XML element names.

We define a renaming pragma which specifies the name to be used when serializing a nonterminal as XML. A parser which does not support the pragma will produce results in which some months are named month and others nmonth; a parser which does support the pragma will call them all month.

    {[+pragma rename
    "https://lists.w3.org/Archives/Public/public-ixml/2021Oct/0014.html"]}
    
    date: day, " ", month, " ", year.
    day: d, d?.
    month: "January"; "February"; "March"; 
           "April"; "May"; "June";
           "July"; "August"; "September";
           "October"; "November"; "December".
    year: d, d, d, d.
    
    iso: year, "-", {[rename month]} nmonth, "-", day.
    nmonth: d, d.

The fallback behavior of a parser that does not support these pragmas will be to produce output using both the element name month and the element name nmonth.

Name indirection

Use case: Using pragmas to specify that an element or attribute name should be taken not from the grammar but from the string value of a given nonterminal.

Consider the following grammar which recognizes a superset of a simple subset of XML. It's a subset of XML for simplicity, and it's a superset of the subset because a grammar written at this level cannot enforce all of the well-formedness constraints of XML.

{ A grammar for a small subset of XML, as an illustration. }

document: ws?, element, ws? .
    
element: starttag, content, endtag; soletag .
    
-starttag:  -"<", @gi, (ws, attribute)*, ws?, -">".
-endtag:  -"</", @gi2, (ws, attribute)*, ws?, -">".
-soletag:  -"<", @gi, (ws, attribute)*, ws?, -"/>".

attribute:  @name, ws?, -"=", ws?, @value.
@value: dqstring; sqstring.
-dqstring: dq, ~['"']*, dq.
-sqstring: sq, ~["'"]*, sq.
-dq: -['"'].
-sq: -["'"].

{ allow at most one PCDATA block between pieces of markup }
-content:  PCDATA?,
           ((processing-instruction; comment; element)++(PCDATA?),
	   PCDATA?)?.

PCDATA:  (~["<>&"]; "&amp;"; "&lt;"; "&gt;"; "&apos;"; "&quot;")+.
processing-instruction:  "<?", @name, ws, @pi-data, "?>".
comment:  "<--", comment-data, "-->".

gi: name.
gi2: name.
{ name is left as an exercise for the reader. }

ws:  (#20; #A; #C; #9)+.

Among the input sequences which should be accepted by this grammar is the following XML representation of a haiku.

<haiku author="Basho" date="1686">
    <line>When the old pond</line>
    <line>gets a new frog</line>
    <line>it's a new pond.</line>
</haiku>

We might like an ixml processor to read this and produce the same XML that any XML parser would produce. (This desire makes sense only when the ixml processor's results are supplied to a user in a DOM or XDM or SAX or other XML API or model. If they are supplied as an XML character stream, we might as well feed the XML straight to the downstream user; we don't need to parse it.) What the grammar above will produce has a clear structural similarity to the input XML, but it is not the same:

<document xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
   <element gi="haiku" gi2="haiku"> 
      <attribute name="author" value="Basho"/> 
      <attribute name="date" value="1686"/>
      <PCDATA>
    </PCDATA>
      <element gi="line" gi2="line">
         <PCDATA>When the old pond</PCDATA>
      </element>
      <PCDATA>
    </PCDATA>
      <element gi="line" gi2="line">
         <PCDATA>gets a new frog</PCDATA>
      </element>
      <PCDATA>
    </PCDATA>
      <element gi="line" gi2="line">
         <PCDATA>it's a new pond.</PCDATA>
      </element>
      <PCDATA>
</PCDATA>
   </element>
</document>

We can invent suitable pragmas to allow ourselves to obtain normal XML from parsing with the grammar:

  • name expression - specifies that the name under which a nonterminal is to be serialized is given by the string value of the supplied XPath expression, interpreted with the standard ixml result element as the context node.

  • serialize keyword - specifies that the nonterminal is to be serialized as specified by the keyword (which is assumed to be attribute, element, or the name of some other XPath node test).

  • drop - specifies that the nonterminal so annotated is to be suppressed entirely, along with the entire parse tree dominated by the nonterminal.

With these pragmas, we can annotate the element and attribute rules appropriately:

^ {[name @gi]} element:  start-tag, content, end-tag; sole-tag.
...
-end-tag:  "</", {[drop]} @gi2, (ws, attribute)*, ws?, ">".
...
^ {[serialize attribute]}
  {[name @name]}
  attribute:  @name, ws?, "=", ws?, @value.

Rule rewriting

Use case: Using pragmas to specify that a rule as given is shorthand for a set of other rules. Consider the following simple grammar for arithmetic expressions.

expr: term; expr, addop, term.
term: factor; term, mulop, factor.
factor: number; var; -'(', -expr, -')'.
...

We might find it inconvenient that the number 42 is represented with an XML element tree four elements deep:

<expr>
    <term>
        <factor>
            <number>42</number>
        </factor>
    </term>
</expr>
We might prefer a shallower tree.[7]

One simple rule to simplify the XML representation of sentences in this language is to specify that if an element E has only one child, E should not be tagged and only the child should appear in the XML.

We can do this in ixml by expanding the grammar, splitting each nonterminal into two rules, one producing a visible serialization and one hiding the nonterminal on serialization.

-EXPR: TERM; expr.
expr: EXPR, addop, TERM. 
-TERM: FACTOR; term.
term: TERM, mulop, FACTOR. 
-FACTOR: number; var; -'(', EXPR, -')'. 
...

Now 42 parses more simply as <number>42</number>.

The rewrite is mechanical enough that we can automate it, and error-prone enough that it is worth automating. If a rule has some right-hand sides guaranteed to produce at most one child each and some guaranteed to produce at least two children each, it's split into two rules. The first gets a new nonterminal and has the original single-child right-hand sides as alternatives, as well as a reference to the original nonterminal. It's marked hidden. The second rule gets the original nonterminal. All references to the original nonterminal are changed to be references to the new nonterminal.

If we call the relevant pragma no-unit-rules, or more briefly nur, the grammar takes the following form. In practice, we also need a rule that means don't rewrite the entire rule, but replace references to rules rewritten using nur; we call this second pragma ref.

^ {[nur]} expr: term; expr, addop, term.
^ {[nur]} term: factor; term, mulop, factor.
- {[ref]} factor: number; var; -'(', -expr, -')'.
...

The XML representation of this grammar can plausibly exploit the ability of extension elements to contain an XML representation of the new rules. Both the nur and the ref pragmas within a rule instruct the implementation to replace the enclosing rule with the rules appearing as children of the pragma elements.

    <ixml>
      <rule name="expr" mark="^">
    
        <pragma pname="nur">
          <pragma-data/>
      
          <rule name="EXPR" mark="-">
            <alt><nonterminal name="TERM"/></alt>
            <alt><nonterminal name="expr"/></alt>
          </rule>
      
          <rule name="expr" mark="^">
            <alt>
              <nonterminal name="EXPR"/>
              <nonterminal name="addop"/>
              <nonterminal name="TERM"/>
            </alt>
          </rule> 
        </pragma>
    
        <alt><nonterminal name="term"/></alt>
        <alt>
          <nonterminal name="expr"/>
          <nonterminal name="addop"/>
          <nonterminal name="term"/>
        </alt>
      </rule>
  
      <rule name="term" mark="^">
        <pragma pname="nur">
          <pragma-data/>
          
          <rule name="TERM" mark="-">
            <alt><nonterminal name="factor"/></alt>
            <alt><nonterminal name="term"/></alt>
          </rule>
          
          <rule name="term" mark="^">
            <alt>
              <nonterminal name="TERM"/>
              <nonterminal name="mulop"/>
              <nonterminal name="factor"/>
            </alt>
          </rule>
        </pragma>
        
        <alt><nonterminal name="factor"/></alt>
        <alt>
          <nonterminal name="term"/>
          <nonterminal name="mulop"/>
          <nonterminal name="factor"/>
        </alt>
      </rule>
      
      <rule name="factor" mark="-">
        <pragma pname="ref">
          <pragma-data/>
          <rule name="factor" mark="-">
            <alt><nonterminal name="number"/></alt>
            <alt><nonterminal name="var"/></alt>
            <alt>
              <literal string="(" tmark="-"/>
              <nonterminal name="EXPR" mark="-"/>
              <literal string="-" tmark="-"/>
            </alt>
          </rule>
        </pragma>
        <alt><nonterminal name="number"/></alt>
        <alt><nonterminal name="var"/></alt>
        <alt>
          <literal string="(" tmark="-"/>
          <nonterminal name="expr" mark="-"/>
          <literal string="-" tmark="-"/>
        </alt>
      </rule>
      ...
    </ixml>

The fallback behavior of a processor that doesn't support these pragmas will be to serialize expr and term elements even when they have only one child.

Tokenization annotation and alternative formulations

Use case: We can use pragmas to annotate nonterminals in an ixml grammar to provide a hint to the processor indicating that they define a regular language and can be safely recognized by a greedy regular-expression match.

For example, consider the grammar for a simple programming language. A processor might read programs a little faster if it could read identifiers in a single operation; this will be true if when an identifier is encountered, the identifier will always consist of the longest available sequence of characters legal in an identifier. In the toy Program.ixml grammar used as a running example in Hillman 2020, the rule for identifiers is:

identifier: letter+, S.

We can annotate identifier to signal that it's safe to consume an identifier using a single regular-expression match by using a pragma in a lexical scanning (ls) namespace:

{[token]} identifier:  letter+, S.

The rules for comments in ixml itself offer another wrinkle.

      comment: -"{", (cchar; comment)*, -"}".
      -cchar: ~["{}"].

Within a comment, any sequence of characters matching cchar can be recognized in a single operation; there is no need to look for alternate parses that consume only some of the characters. But there is no nonterminal here that matches all and only non-empty sequences of cchar. In order to use the token annotation here, we must first rewrite the grammar at this point. So we introduce an annotation named rewrite to be attached to a single grammar rule with the meaning that the pragma data provide an alternate form of the rule.

We can now annotate the grammar and supply an alternative formulation of comment that replaces it with two new rules:

      ^ {[rewrite
            comment: -"{", cchars, (comment++cchars, cchars)?, -"}". 
            {[token]} -cchars:  cchar*. 
        ]}
      comment: -"{", (cchar; comment)*, -"}".
      -cchar: ~["{}"].

Or we may find it easier to read if we inject the alternative formulation after, not before, the existing rule:

      comment: -"{", (cchar; comment)*, -"}"
      {[rewrite 
          comment: -"{", cchars, (comment++cchars, cchars)?, -"}". 
          - {[token]} cchars:  cchar*. 
      ]}.
      -cchar: ~["{}"].

Either way, the rewrite contains an alternative formulation of the grammar which recognizes the same sentences and provides the same XML representation but may be processed faster by some processors.

The fallback behavior of a processor that doesn't support these pragmas will be to parse as usual using the grammar as specified.

Note however that there is no way to guarantee or impose an effective requirement that the alternate rules in an rewrite pragma be equivalent to the fallback rules: pragmas may change the behavior of a processor, and they may change the meaning of an expression (or here the meaning of a grammar or part of it).

Text injection

Use case: Using pragmas to specify that additional text should be injected into the output at a particular point (as part of a text node, or attribute value).

The text injection use case stands as an example of how a language may evolve to incorporate features that make some pragmas unnecessary or obsolete. The insertions feature in Invisible XML 1.0 was a relatively late addition to the language. Work on a proposal for pragmas began more than a year earlier. The text injection pragma use case explored the question of whether the pragma mechanism could be used to inject text into the output. And indeed it could. But the insertions feature has made it obsolete.

Pragmas offer implementers and designers an opportunity to experiment with, and test designs for, functionality that may eventually become part of the specification.

What next?

As noted above, the first versions of the pragmas proposal described here were developed and discussed within the Invisible XML community group. After it became clear that the group would not integrate pragmas into Invisible XML 1.0, the proposal was re-formulated as an optional add-on layered on top of ixml, rather than as a part of the ixml specification.

The next steps now are

  • to draft a formal specification of the pragmas framework,

  • to draft stand-alone specifications of some pragmas which appear to be of general interest (both as examples, and in the case of pragmas of general interest to avoid multiple incompatible implementations of the same additional functionality), and

  • to integrate support for the pragmas framework into processors, optionally with support for selected pragmas.

Appendix A. Modified ixml syntax

The ways in which the pragmas proposal changes the syntax of ixml were outlined in the main body of the text; this appendix presents the modified grammar in complete form. Insertions and modifications are given in bold.

{ixml grammar version 2022-06-07, modified for pragmas 2022-07-15}
         ixml: s, prolog?, rule++RS, s.

           -s: (whitespace; comment)*. {Optional spacing}
          -RS: (whitespace; comment)+. {Required spacing}
          -sp: (whitespace; comment; pragma)*.  {Spacing with pragmas}

  -whitespace: -[Zs]; tab; lf; cr.
         -tab: -#9.
          -lf: -#a.
          -cr: -#d.
      comment: -"{", ((comment; ~["[]{}"]), (cchar; comment)*)?, -"}".
       -cchar: ~["{}"].

       prolog: version, s, (ppragma++s, s)?; ppragma++s, s.
      version: -"ixml", RS, -"version", RS, string, s, -'.' .
      ppragma: -"{[+", @pname, (whitespace, pragma-data)?, -"]}". 

         rule: annotation, name, s, -["=:"], s, -alts, (pragma, sp)?, -".".

  -annotation: (pragma, sp)?, (mark, sp)?.
       pragma: -"{[", @pname, (whitespace, pragma-data)?, -"]}". 
       @pname: name.
  pragma-data: (-pragma-char; -bracket-pair)*.
 -pragma-char: ~["{}"].
-bracket-pair: '{', -pragma-data, '}'.

        @mark: ["@^-"].
         alts: alt++(-[";|"], s).
          alt: term**(-",", s).
        -term: factor;
               option;
               repeat0;
               repeat1.
      -factor: terminal;
               nonterminal;
               insertion;
               -"(", s, alts, -")", s.
      repeat0: factor, (-"*", s; -"**", s, sep).
      repeat1: factor, (-"+", s; -"++", s, sep).
       option: factor, -"?", s.
          sep: factor.
  nonterminal: annotation, name, s.

        @name: namestart, namefollower*.
   -namestart: ["_"; L].
-namefollower: namestart; ["-.·‿⁀"; Nd; Mn].

    -terminal: literal; 
               charset.
      literal: quoted;
               encoded.
      -quoted: tannotation, string, s.
 -tannotation: (pragma, sp)?, (tmark, sp)?.

       @tmark: ["^-"].
      @string: -'"', dchar+, -'"';
               -"'", schar+, -"'".
        dchar: ~['"'; #a; #d];
               '"', -'"'. {all characters except line breaks; quotes must be doubled}
        schar: ~["'"; #a; #d];
               "'", -"'". {all characters except line breaks; quotes must be doubled}
     -encoded: tannotation, -"#", hex, s.
         @hex: ["0"-"9"; "a"-"f"; "A"-"F"]+.

     -charset: inclusion; 
               exclusion.
    inclusion: tannotation,          set.
    exclusion: tannotation, -"~", s, set.
         -set: -"[", s,  (member, s)**(-[";|"], s), -"]", s.
       member: string;
               -"#", hex;
               range;
               class.
       -range: from, s, -"-", s, to.
        @from: character.
          @to: character.
   -character: -'"', dchar, -'"';
               -"'", schar, -"'";
               "#", hex.
       -class: code.
        @code: capital, letter?.
     -capital: ["A"-"Z"].
      -letter: ["a"-"z"].
    insertion: -"+", s, (string; -"#", hex), s.

References

[Bray et al. 2009] Bray, T. et al. eds., 2009. Namespaces in XML 1.0 (Third Edition). W3C Recommendation, 8 December 2009.

[Grune/Jacobs 1990/2008] Grune, Dick, and Ceriel J. H. Jacobs. 1990/2008. Parsing techniques: a practical guide. First edition New York et al.: Ellis Horwood, 1990. Second edition [New York]: Springer, 2008.

[Hillman 2020] Hillman, Tomos. XSLT Earley: First Steps to a Declarative Parser Generator. Presented at XML Prague, 2020, Prague, Czech Republic. In XML Prague 2020 Conference Proceedings, pp. 231-249.

[Ichbiah et al. 1986] Ichbiah, Jean D., John G. P. Barnes, Robert J. Firth, and Mike Woodger. 1986. Rationale for the design of the Ada programming language. Ada Joint Program Office: U. S. Government.

[ISO 8879:1986] International Organization for Standardization (ISO). 1986. ISO 8879-1986 (E). Information processing — Text and Office Systems — Standard Generalized Markup Language (SGML). International Organization for Standardization, Geneva, 1986.

[Jensen and Wirth 1974/1985] Jensen, Kathleen, and Niklaus Wirth. 1974, 3d ed. 1985. Pascal user manual and report, revised for the ISO Pascal standard. Third edition. New York, Berlin, Heidelberg, Tokyo: Springer, 1985.

[Kay 2017] Kay. M. ed., 2017. XSL Transformations (XSLT) Version 3.0. W3C Recommendation, 21 March 2017.

[Lindsey 1996] Lindsey, C. H., 1996. A history of ALGOL 68. In Thomas J. Bergin and Richard G. Gibson (eds.) History of Programming Languages II. New York: ACM.

[Melton et al. 2017] Melton, J. et al. eds., 2017. XQueryX 3.1. W3C Recommendation, 21 March 2017.

[Naur et al. 1960] Naur, Peter, ed., et al. 1960. Report on the algorithmic language Algol 60. Communications of the Association for Computing Machinery 3.5 (May 1960): 299-314. doi:https://doi.org/10.1145/367236.367262. (Also published simultaneously in Numerische Mathematik.)

[Pemberton 2013] Pemberton, Steven. Invisible XML. Presented at Balisage: The Markup Conference 2013, Montréal, Canada, August 6 - 9, 2013. In Proceedings of Balisage: The Markup Conference 2013. Balisage Series on Markup Technologies, vol. 10 (2013). doi:https://doi.org/10.4242/BalisageVol10.Pemberton01. On the web at http://www.balisage.net/Proceedings/vol10/html/Pemberton01/BalisageVol10-Pemberton01.html. Revised version (January 2014) at https://homepages.cwi.nl/~steven/Talks/2013/08-07-invisible-xml/invisible-xml-3.html

[Pemberton 2022] Pemberton, Steven. Invisible XML Specification. Published by the Invisible Markup Community Group on the web at https://invisiblexml.org/1.0/

[Robie et al. 2017a] Robie, J, et al. eds., 2017. XML Path Language (XPath) 3.1. W3C Recommendation, 21 March 2017.

[Robie et al. 2017b] Robie, J, et al. eds., 2017. XQuery 3.1: An XML Query Language. W3C Recommendation, 21 March 2017.

[van Wijngaarden et al. 1976] van Wijngaarden, A., et al., ed. 1976. Revised report on the algorithmic language Algol 68. Heidelberg, New York: Springer, 1976.



[1] As with many of the other terminological innovations of Algol 68, the Report offers no explicit explanation of the origin of the name. The Report does include what it calls pragmatic remarks, which are not part of the definition of the language but serve to help the reader to understand the intentions and implications of the definitions, thus serving roughly the same purpose as non-normative notes in some standards and specifications. In a paper on the history of Algol 68, C. H. Lindsey explained that A 'pragmatic remark' is to the Report as a comment is to a program (Lindsey 1996). While not conclusive, this evidence suggests that the pragmatic remark may have provided the motivation for the technical term pragmat, positioning pragmats as information which helps the compiler in its interpretation of the constructs of the language proper.

In adopting the term pragma, later languages may have been influenced by the desire to present a clearer and more easily explained derivation: the authors of Ada, for example, state that A pragma (from the Greek word meaning action) is used to direct the actions of the compiler in particular ways Ichbiah et al. 1986.

[2] It may be noted also that the Pascal standard does not require that strict conformance be the default behavior of the compiler, only that it be possible.

[3] For example the one in sec. 6.6.2 of the first edition of Grune/Jacobs 1990/2008.

[4] Whether pragmas are, by nature, a special kind of comment or a distinct class of things is an ontological question we do not propose to address here. As indicated above, in this paper we follow the distinction made by van Wijngaarden et al. 1976: we use the term pragma to denote objects which convey non-standardized information in a form usefully processable by machine and often with meaningful internal structure, and the term comment to denote such information in a form not usefully processable by machine, typically expressed as remarks in a natural language and addressed to human readers.

[5] The phrase uses no pragmas means, for an implementation which is not pragma-aware, in effect, does not begin any comments with a square bracket.

[6] As noted above: pragmas may affect the behavior of a processor in any way.

[7] The reader who believes this example is artificial is referred to the XQueryX spec (Melton et al. 2017) and its XML representation of an XPath expression like section/title.

×

Bray, T. et al. eds., 2009. Namespaces in XML 1.0 (Third Edition). W3C Recommendation, 8 December 2009.

×

Grune, Dick, and Ceriel J. H. Jacobs. 1990/2008. Parsing techniques: a practical guide. First edition New York et al.: Ellis Horwood, 1990. Second edition [New York]: Springer, 2008.

×

Hillman, Tomos. XSLT Earley: First Steps to a Declarative Parser Generator. Presented at XML Prague, 2020, Prague, Czech Republic. In XML Prague 2020 Conference Proceedings, pp. 231-249.

×

Ichbiah, Jean D., John G. P. Barnes, Robert J. Firth, and Mike Woodger. 1986. Rationale for the design of the Ada programming language. Ada Joint Program Office: U. S. Government.

×

International Organization for Standardization (ISO). 1986. ISO 8879-1986 (E). Information processing — Text and Office Systems — Standard Generalized Markup Language (SGML). International Organization for Standardization, Geneva, 1986.

×

Jensen, Kathleen, and Niklaus Wirth. 1974, 3d ed. 1985. Pascal user manual and report, revised for the ISO Pascal standard. Third edition. New York, Berlin, Heidelberg, Tokyo: Springer, 1985.

×

Kay. M. ed., 2017. XSL Transformations (XSLT) Version 3.0. W3C Recommendation, 21 March 2017.

×

Lindsey, C. H., 1996. A history of ALGOL 68. In Thomas J. Bergin and Richard G. Gibson (eds.) History of Programming Languages II. New York: ACM.

×

Melton, J. et al. eds., 2017. XQueryX 3.1. W3C Recommendation, 21 March 2017.

×

Naur, Peter, ed., et al. 1960. Report on the algorithmic language Algol 60. Communications of the Association for Computing Machinery 3.5 (May 1960): 299-314. doi:https://doi.org/10.1145/367236.367262. (Also published simultaneously in Numerische Mathematik.)

×

Pemberton, Steven. Invisible XML. Presented at Balisage: The Markup Conference 2013, Montréal, Canada, August 6 - 9, 2013. In Proceedings of Balisage: The Markup Conference 2013. Balisage Series on Markup Technologies, vol. 10 (2013). doi:https://doi.org/10.4242/BalisageVol10.Pemberton01. On the web at http://www.balisage.net/Proceedings/vol10/html/Pemberton01/BalisageVol10-Pemberton01.html. Revised version (January 2014) at https://homepages.cwi.nl/~steven/Talks/2013/08-07-invisible-xml/invisible-xml-3.html

×

Pemberton, Steven. Invisible XML Specification. Published by the Invisible Markup Community Group on the web at https://invisiblexml.org/1.0/

×

Robie, J, et al. eds., 2017. XML Path Language (XPath) 3.1. W3C Recommendation, 21 March 2017.

×

Robie, J, et al. eds., 2017. XQuery 3.1: An XML Query Language. W3C Recommendation, 21 March 2017.

×

van Wijngaarden, A., et al., ed. 1976. Revised report on the algorithmic language Algol 68. Heidelberg, New York: Springer, 1976.

Author's keywords for this paper:
Invisible XML; spec development; extensibility