How to cite this paper
Hillman, Tomos, C. M. Sperberg-McQueen, Bethan Tovey-Walsh and Norm Tovey-Walsh. “Designing for change: Pragmas in Invisible XML as an extensibility mechanism.” Presented at Balisage: The Markup Conference 2022, Washington, DC, August 1 - 5, 2022. In Proceedings of Balisage: The Markup Conference 2022. Balisage Series on Markup Technologies, vol. 27 (2022). https://doi.org/10.4242/BalisageVol27.Sperberg-McQueen01.
Balisage: The Markup Conference 2022
August 1 - 5, 2022
Balisage Paper: Designing for change
Pragmas in Invisible XML as an extensibility mechanism
Tomos Hillman
Tom Hillman has worked as an XML practitioner and
consultant for fifteen years, doing everything from
documentation to IT support and administration to workflows
for digital publishing to conference organization to XML
database management and consultancy.
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
C. M. Sperberg-McQueen is the founder of Black Mesa Technologies LLC,
a consultancy specializing in the use of descriptive markup to help
memory institutions preserve cultural heritage information. He co-edited
the XML 1.0 specification, the Guidelines of the Text Encoding Initiative,
and the XML Schema Definition Language (XSDL) 1.1 specification.
Bethan Tovey-Walsh
Bethan Tovey-Walsh is a PhD student in Applied Linguistics and Welsh at Swansea
University. She is funded by the CorCenCC corpus of modern Welsh, and created the
Welsh
part-of-speech tagger now used by the project. She previously worked for OUP as a
content
architect and as a researcher for the Oxford English Dictionary.
Norm Tovey-Walsh
Senior Software Developer
Saxonica
Norm Tovey-Walsh is currently a senior software developer at
Saxonica Ltd, working from his home in Swansea,
Wales. Previously, he was employed by MarkLogic Corporation,
Sun Microsystems, Arbortext, and O’Reilly Media (then O’Reilly
& Associates).
Designing for change: Pragmas in Invisible XML as an extensibility mechanism
copyright © 2022 by Tomos Hillman, C. M. Sperberg-McQueen, Bethan Tovey-Walsh, and
Norman Tovey-Walsh is licensed under CC BY-NC-SA 4.0.
Abstract
Invisible XML (ixml) is a method for treating non-XML
documents as if they were XML. The 1.0 specification for
Invisible XML was announced in June of this year. No technology
foresees all of its use cases, especially in 1.0. How can ixml
allow experimentation, and channel experimentation in useful
ways, to allow ideas to be expressed in ixml grammars that go
beyond what is foreseen, without compromising interoperability
or the value of strict conformance to the specification?
Many programming languages (C, JavaScript, Pascal, XQuery,
etc.) address this question with pragmas. A pragma is a
semi-formal way to instruct a processor/compiler/interpreter how
it should operate. Typical pragmas extend a specification but
are not a part of it. We propose pragmas as an optional add-on
to ixml to allow implementation of non-standardized
functionality in a way that does not interfere with standard
ixml processing. We describe our general framework for pragmas,
some specific pragmas (to illustrate how pragmas can be used),
and a few pragmatic implementations.
Table of Contents
- Introduction
- What is a pragma?
- Some approaches to extensibility
- Requirements, desiderata, and use cases
-
- Use cases
- Requirements and desiderata
-
- Requirements
- Desiderata
- Design questions
- Pragmas proposal
-
- Syntax in ixml
-
- Internal syntax of pragmas
- External syntax: where pragmas may appear
- Syntax in XML
- Pragma scope
- Operational semantics
- Conformance requirements for pragmas
- Examples
-
- Renaming
- Name indirection
- Rule rewriting
- Tokenization annotation and alternative formulations
- Text injection
- What next?
- Appendix A. Modified ixml syntax
Introduction
Strictly limiting the scope of a specification helps keep
the technology simple; prohibiting variation among conforming
processors helps implementers achieve interoperability.
Simplicity and interoperability may lead to success, success to a
broader user community, a broader user community to demands for
broader functionality and further development of the
specification. This is the virtuous spiral many technology
developers hope to achieve.
Successful extension of a technology to address new use
cases and incorporate new functionality will, in general, require
some experimental implementations of the new functionality.
If the initial specification is tightly focused on its core use
cases and very strict about prohibiting non-conforming behavior,
however, any such experimentation will be non-conforming, which
brings two risks: implementers may be reluctant to experiment with
new behavior, which means later versions of the spec may
lack a firm grounding in experience, or implementers and users may
come to regard conformance to the specification as irrelevant to
the really interesting work of solving particular problems and
providing useful capabilities. If the initial specification is
too lax on conformance requirements, on the other hand,
interoperability is likely to suffer and user communities will
form (if they form at all) around particular implementations
rather than around the technology as specified.
We present a concrete design for extensibility in Invisible
XML (Pemberton 2022), in the form of
a proposal for pragmas
, a
mechanism designed to allow out-of-band communication between a
grammar writer and an ixml processor. An author, for example,
might know that a particular rule is amenable to some optimization, or
that they would prefer ambiguity to be resolved in a particular
way, or that they wish to employ a processor extension of some
sort.
We begin with a description of what we mean by the term
pragma (section “What is a pragma?”),
followed by a short description of some different approaches to
the general problem of extensibility in different technologies and
specifications (section “Some approaches to extensibility”). We then proceed to a
sketch of the requirements (as we understand them) for pragmas in
ixml, illustrated with several specific use cases (section “Requirements, desiderata, and use cases”). Then we present the pragmas proposal itself.
A few worked examples illustrate how the pragmas proposal outlined
here could in principle be used in practice (section “Examples”). We conclude (section “What next?”) with
some speculations on future developments.
The proposal described here has grown out of work in the
World Wide Web Consortium's community group on Invisible XML, and
we thank our colleagues in the community group for discussions of
pragmas, extensibility, and related topics.
What is a pragma?
By pragma we mean, in
general, a construct in a formal language which conveys
non-standard or out-of-band information to processing software in
a way not defined by the specification of the language in which
the pragma is embedded.
That description may need some unpacking:
-
A pragma is a syntactic construct. That is, it is
defined by the grammar of the language, so that any parser for
the language can and should recognize pragmas when they are
encountered, if only for the purpose of ignoring them.
-
It conveys information to processing software. That is,
pragmas are not typically intended solely for human
consumption.
Note that it is impossible to enforce a strict
separation between information intended for humans and
information intended for software, and so this point must be
taken as a description of a general tendency and not as a
testable or enforceable rule. But one of the key differences
between pragmas and comments is that in general comments are
directed at human beings and are to be ignored by software,
while in normal usage pragmas serve to convey information to a
processor and are thus typically less free-form than
comments.
-
The information conveyed by a pragma is typically
non-standard.
This too describes a tendency rather than an enforceable
rule. Nothing can prevent someone from using a pragma to
convey information which could be conveyed by the standard
mechanisms of the language as defined. But if the information
in question can be expressed without a pragma, it would be
unnecessary, verging on eccentric, to go to the effort of
expressing it in a non-standard way.
Because the interpretation of pragmas is not defined by
the specification of the language, the usual rule is that
pragmas have no effect on the standard meaning of the document
in which they are embedded and can be ignored (e.g., by
software which does not understand
them).
The term appears to have entered the vocabulary of computing
from Algol 68 van Wijngaarden et al. 1976. which defines a
construct it calls a pragmat
(apparently short for pragmatic remark
or
pragmatic comment
).
A pragment is a comment or a pragmat. No semantics of pragments is given and therefore the
meaning ... of any program is
quite unaffected by their presence. It is indeed the intention
that comments should be
entirely ignored by the implementation, their sole purpose being
the enlightenment of the human interpreter of the program.
Pragmats may, on the
other hand, convey to the implementation some piece of
information affecting some aspect of the meaning of the
program which is not defined by
this Report, for example: ....
They may also be used to convey to the implementation that
the source text is to be augmented with some other text, or
edited in some way, for example: ....
The interpretation of pragmats is not defined in this Report,
but is left to the discretion of the implementer, who ought, at
least, to provide some means whereby all further pragmats may be ignored, ....
Many but not all programming languages defined more recently
provide for pragmas, sometimes under other names (directives,
declarations); in others, the comment construct is used to convey
pragmatic information. The Wikipedia article on Directive
(programming)
has an unsystematic but informative survey.
Typical use cases for programming-language pragmas include hints
that a certain kind of optimization might usefully be
applied.
Some approaches to extensibility
Designs and specifications for earlier computing
technologies have taken a variety of approaches to extensions and
to the provision of extensibility mechanisms, with a variety of
outcomes. It should be noted that this section presents a series
of examples illustrating some points in the abstract design space.
It is not a historical survey and should not be misunderstood as
attempting to be one.
Sometimes, a major functional area of the technology was
left undefined, in the expectation that implementers would fill
the gap, and sometimes perhaps in the belief that only
implementers working in a particular environment would be in a
position to work out the necessary details fully.
The Algol 60 report (Naur et al. 1960) provided no
mechanisms for input or output; it was expected that
implementations would extend the language in ways suitable for the
I/O facilities of the host environment. The designers of C
famously made the same decision; the C compiler developed by
Kernighan and Ritchie provided a standard I/O
(stdio) library but expected
(apparently in a sort of let a hundred flowers bloom, let a
hundred schools of thought contend
frame of mind) that
different implementers would choose different ways of managing
I/O, with different libraries reflecting different ways of
handling the task. Pressure from users (i.e., programmers using C
compilers) eventually forced all C compilers to provide a version
of stdio, and forced the relevant standards committees to
standardize that library.
The ISO Pascal standard includes an interesting provision in
the list of things a Pascal compiler must do to comply with ISO
7185 (quoted from Jensen and Wirth 1974/1985):
It [must be] able to process as an error any use
of an extension or of an implementation-dependent
feature.
Two things seem striking here: first the requirement that it be
possible to turn off all extensions, which allows users to check
to make sure their program does not depend on vendor extensions,
and second the quiet assumption (without any discussion that I
have found) that there will
of
course be extensions to the language, in some
processors if not in all. The balancing of interests here seems
worth bearing in mind: implementers may have an interest in
extending the language, and so extensions are implicitly tolerated
in a conforming processor.
Users, on the other hand, have an interest in
portability and in avoiding lock-in, so conforming processors must
be able to turn extensions off.
SGML took a different approach (ISO 8879:1986):
with its processing instructions, ISO 8879 provided a mechanism
that allowed users (and SGML editors) to insert non-standard
information into documents and mark it as such, which allows other
applications to ignore the information so marked. By requiring
that processing instructions begin with a defined name, XML
attempted to make it a little easier for processors which use
processing instructions to know at a glance whether the
instruction is one they should pay attention to or one they should
ignore.
Programming-language processors have often felt a need for
some similar mechanism for inserting processor-specific
annotations into programs. Because programming language
syntaxes often lack anything analogous to processing instructions, these
processor-specific (or at least non-standardized) annotations are
often embedded in what syntactically are comments. Thus a Pascal
program might begin with the
comment:
{SC+: distinguish between upper and lower case}
In the absence of any inter-implementer agreement
on how to distinguish one implementation's annotations from
another's, of course, such mechanisms may lead to
collisions.
A specification frequently mentioned as having found a successful
formula for extensibility is the original HTML specification,
which defined a set of element types and required that if an HTML
processor encountered an element of an unknown type, it should
ignore that element's tags. This provision allowed browser makers
to experiment with support for new elements, which in turn allowed for
swift development of new functionalities, both good and bad (the
blink
element is seldom regarded as a triumph of good
markup design), although it also tended to make the actual
specification of HTML less important than whatever browser makers
were supporting on any given day. The HTML rule works less well
in cases where the best approach would be for the entire element
to be ignored, rather than just its start- and end-tags. But this
flaw illustrates an important point about extensibility: finding
some path for extensibility can be very useful, even if it is
manifestly imperfect.
Some XML-based syntaxes have taken a similar, though less
flamboyantly anarchic, approach to extensibility and non-standard
content. XSLT (Kay 2017), for example, allows XSLT stylesheets to contain
extension elements whose syntax and semantics are
implementation-defined. It also allows attributes in any non-XSL
namespace to appear on any element in the XSL namespace.
XSLT demonstrates that it is possible to give the author
even more control. XSLT provides an explicit fallback mechanism that
allows a stylesheet to use later (e.g., version 2.0) constructs when relevant while
still telling a processor what to do if it does not understand the
base expression. It also provides a “use-when” mechanism that allows
the stylesheet author to delimit areas of the stylesheet where
extensions are used so that they are targeted only at specific processors
that are known to understand them.
The XML Schema Definition language similarly allows foreign
attributes on all elements, and for more complex annotations it
provides an appinfo
element available at key
locations, into which schema authors can insert arbitrarily
complex material. The namespace-qualified names built into the
XML stack in the interests of distributed extensibility are also
useful here.
Because XPath (Robie et al. 2017a) and
XQuery (Robie et al. 2017b) do not use an XML-based syntax,
providing for such extensibility is somewhat harder for them. But
namespace-qualified names (QNames, for short)
do provide a simple mechanism that allows non-standard
functions to be available in a processor, and compile-time and
run-time facilities for testing the availability of a function
make it possible for users of XSLT and XQuery to adjust to the set
of available functions. XQuery also provides extension expressions, which consist of a
series of pragmas followed by a
fallback expression. The pragmas, each guarded with a qualified
name, can contain expressions using extensions to the base
language; a processor which understands none of the pragmas will
evaluate the fallback expression. The XQuery specification is
unusual in disavowing any expectation that the pragmas and the
fallback expression will always produce the same result; the
extensions used in the pragmas may provide functionality not
available in XQuery. The standard interpretation of a query is of
course unaffected by extension expressions, but what a processor
actually does may well be affected. Since there is no way to
prevent this happening in any case (short of solving the Halting
Problem), XQuery's clear-eyed realism on the topic seems to us to
take the right approach.
These are not the only possible approaches. There is a
continuum between the most restrictive possible interpretation:
all extensions are errors, and the most liberal: anything that
doesn’t conform to the specification in any way can be interpreted
however the implementation likes. Different languages appear at
different places along this continuum.
From this unsystematic survey, we think several lessons may
be drawn:
-
Providing a mechanism for non-standard
information can be useful, whether it is used for setting
options in a processor or extending the base
language.
It is important enough that it is often
better to have an imperfect extension mechanism than to have
none at all.
-
When extensions are tolerated, interoperability
can be preserved if implementations are required to support a mode
in which all extensions are ignored.
-
It's helpful if there is a simple way for
processors to identify extensions in materials they are
processing and decide reliably whether they are extensions
supported by the processor or not.
-
It's important to be clear about what processors
are to do if they don't understand an extension. The ability to
specify fallback behaviors case by case can be
helpful.
These examples illustrate, we hope, the design space within
which we believe the pragmas proposal presented here is to be
situated. Our proposal is inspired in part by the
xsl:fallback
and use-when
mechanisms of
XSLT and the extension expression
and annotation mechanisms of
XQuery. SGML and XML processing instructions have also contributed
to our thinking.
Because the ixml specification itself has no provision for
pragmas, we follow the common practice of conveying non-standardized
information as magic comments
: that is, strings
which are treated as comments by standard processors, but which have
a specific structure which allows processors to recognize them
as pragmas.
Because pragmas as described here will be handled by
standard ixml processors as comments and ignored, the use of
pragmas does not in itself make any ixml grammar non-conformant.
Requirements, desiderata, and use cases
In this section, we discuss
what requirements we think a proposal for pragmas must meet. We
also identify some concrete examples of information not provided
for by ixml as specified, but of potential interest to users or
implementations. In some cases, there is external evidence that
the information is of interest, because there have been proposals
to integrate it into the ixml specification itself.
As was explained above,
the general idea of pragmas is to provide a channel for
information that is not a required part of the ixml specification
but can be used by some implementations to provide useful
behavior, without interfering with the operation of other
implementations for which the information is irrelevant. The additional information contained in
pragmas may be used to control options in a processor, in roughly
the same way as pragmas and structured comments in C or Pascal
programs may be used to control optimization levels in some
compilers, or to extend the specification and provide additional
functionality, just as extension expressions in XQuery can be used
to invoke non-standard functionality to an XQuery processor and
just as extension elements in XSLT can be used to specify
non-standard behavior in an XSLT processor.
On this view, pragmas are a form of annotation, and we use
the terms pragma and annotation accordingly.
Use cases
Among the use cases that motivate the proposal are
these.
Note that some of these use cases may in practice be
handled by future changes to the core syntax of ixml (and one has in
fact been handled by a change already made).
We include them in the list of use cases
for pragmas not because we think pragmas are the best imaginable way to
handle them but because they are (a) plausible ideas for things
one might want to do which are (b) not supported by ixml in its
current form (or in one case, its earlier form), and thus (c)
natural examples of the kinds of things an extension mechanism
like pragmas ought ideally to be able to support.
-
Renaming
Using pragmas to specify that an element or attribute
name serializing a nonterminal should be given a name
different from the nonterminal itself.
-
Name indirection
Using pragmas to specify that an element or attribute
name should be taken not from the grammar but from the
input, specifically from the string value of a given
nonterminal.
-
Rule rewriting
Using pragmas to specify that a rule as given is
shorthand for a set of other rules, which can be obtained by
rewriting the rule as given.
-
Tokenization annotation
Using pragmas to annotate nonterminals in an ixml
grammar to indicate that they (a) define a regular language
and (b) can be safely recognized by a greedy
regular-expression match.
-
Alternative formulations
Using pragmas to provide alternative formulations of
rules in an ixml grammar to allow different annotation or
better optimization.
-
Text injection
Using pragmas to indicate that a particular string
should be injected into the XML representation of the input
as (part of) a text node or attribute value.
(This can help make the output of an ixml parser
conform to a pre-existing schema.)
After the preparation of this pragmas proposal, the
ixml specification was changed to support text injection,
which illustrates the point that what is described and
implemented at one point as a non-standard extension to a
specification may later become standard.
-
Attribute grammar specification
Using pragmas to annotate a grammar with information
about grammatical attributes to be associated with nodes of
the parse tree, whether they are inherited from an ancestor
or an elder sibling or synthesized from the children of a
node, and what values should be assigned to
them. Grammatical attributes are not to be confused with XML
attributes, although in particular cases it may be helpful
to render a grammatical attribute as an XML attribute.
Some of these use cases seem most naturally handled by
annotations which apply to a grammar as a whole, some by
annotations which apply to individual rules, and some by
annotations which apply to individual symbols in the
grammar.
We do not currently see a strong use case for annotations
which apply to arbitrary expressions in a grammar.
Requirements and desiderata
Our tentative list of requirements and desiderata is as
follows.
By requirement we mean a
property or functionality which must be achieved for a pragmas
proposal to be worth adopting. By desideratum we mean a property or
functionality that should be included if possible, but which
does not doom the proposal to pointlessness if it proves
impossible to achieve.
Requirements
-
It must be straightforward for processors to ignore
pragmas they do not understand, and to determine whether
they understand
a given pragma or not.
-
It must be clear to human readers and software which
expressions in standard ixml notation are and are not
affected or overridden by a given pragma.
-
For any occurrence of a pragma in a grammar, it must
be clear both what should be done by a processor that
understands and processes the pragma and what should be done
by a processor that does not understand and process the
pragma. We refer to the latter as the fallback expression.
Desiderata
-
Ideally, the result of evaluating the fallback
expression should be a useful and meaningful result, but
this is more a matter for the individual writing a grammar
than for this proposal. The desideratum for a pragmas
proposal is to make it easy (or at least not unnecessarily
hard) to write useful fallbacks.
-
It should ideally be possible to specify pragmas as
annotations applying to a symbol, a rule, or a grammar as a
whole, and it should be possible to know which is which. It
is not required that the distinction be a syntactic one,
however, since it can also be expressed by the semantics of
the particular pragma.
-
It should ideally be possible for processors to
generate the XML representation of an ixml grammar
containing pragmas, even if they do not understand the
pragmas contained. And conversely it should ideally be
possible for processors to write out the ixml form of an XML
grammar containing pragmas, even if the processor does not
understand the pragmas appearing in the grammar.
Design questions
Several design questions can be distinguished; they are
not completely orthogonal.
-
What information should be encodable with pragmas?
-
What syntax should pragmas have in Invisible XML?
-
What representation should pragmas have in the XML
form of a grammar?
-
Where can pragmas appear?
Pragmas proposal
Pragmas are a syntactic device to allow grammar writers to
communicate with processors in non-standard ways without
interfering with the operation of other processors. To avoid
interference with other processors, two requirements arise:
-
Pragmas must be syntactically identifiable as
such.
-
Also, it must be possible for processors to distinguish
pragmas directed at them from other pragmas. This proposal
uses URIs to allow grammar writers and implementations to
avoid collisions.
Pragmas may affect the behavior of a processor in any way,
either in ways that leave the meaning of a grammar unchanged or in
ways that change the meaning of the grammar in which the pragmas
appear.
Syntax in ixml
Extensibility mechanisms are designed to facilitate independent
invention. At the same time, a processor which recognizes an extension
pragma may behave differently because of that pragma. It follows that
pragmas will benefit from some form of distributed naming
mechanism. In an XML context, the obvious candidate for distributed
naming is the namespace-qualified name or QName. The TEI
“p
” element is distinct from the XHTML “p
”
element because they are in different namespaces.
Invisible XML doesn’t provide any support for namespaces, so we
must look elsewhere. In principle, the pragmas proposal could
invent a pragma-based mechanism for defining namespace prefixes
and then use QNames in pragmas. But such a mechanism wouldn’t
extend to the nonterminals in a grammar without breaking
syntactic compatibility with Invisible XML 1.0. There are at
least some voices in the community that favor adding a namespace
mechanism to Invisible XML, so it seems wise to leave that space
open for future versions of Invisible XML.
The part of qualified names that guarantees distributed naming
and thus distributed extensibility is the use of URIs to identify
namespaces. As long as people coin names only in the
parts of URI space where they have the authority to construct
names, name collisions can be avoided. So we can take a step back
from qualified names and employ the URI directly for distributed
naming.
Internal syntax of pragmas
Comments in Invisible XML are enclosed in braces, { a comment }
.
Pragmas are enclosed in braces and square
brackets, {[a pragma]}
, to make them appear as comments to a processor that doesn’t
understand pragmas and at the same time to distinguish them from
“ordinary comments” to a processor that does understand pragmas.
Pragmas contain a name, and
optionally additional data, which takes the form of a sequence
of brace-balanced characters. The relevant part of
the ixml grammar is:
pragma: -"{[", @pname, (whitespace, pragma-data)?, -"]}".
@pname: name.
pragma-data: (-pragma-char; -bracket-pair)*.
-pragma-char: ~["{}"].
-bracket-pair: '{', -pragma-data, '}'.
For example, the following are both syntactically well
formed pragmas:
Here we must pause and consider what mechanism we will use to establish that
a pragma name (for example, “blue” or “color”) is associated with a URI.
We assert that the pragma named “pragma
” is special (in a manner
entirely analogous to the way that Namespaces in XML
(Bray et al. 2009) asserts that the
namespace prefix “xmlns
” is special). This pragma is used to
associate a pragma name with a URI:
{[+pragma myPragma "https://example.com/pragmas/mine"]}
(We shall come back to the significance of the leading “+” shortly; briefly, it is a way to distinguish a pragma that
appears in the prolog, and applies to the entire grammar, from one
that merely appears before the first rule.)
From this point forth, the pragma named myPragma
is
taken to be the one identified by the URI specified. Like namespace
prefixes in QNames, the in-grammar name of the pragma is arbitrary; it
is the association with the URI that identifies it. The pragma data
that follows the name, if there is any, is interpreted according to
the rules for that pragma, as specified by the inventor of the
pragma. It is regarded as an error if a pragma is used before a URI
association is made. A pragma-aware processor should report this error
to the author.
An Invisible XML grammar might define an arbitrary number of
pragmas this way. It is worth observing that for cases where it might
be inconvenient for authors to define a great many pragmas with
distinct URIs, there’s nothing that prevents an implementation from
specifying a single pragma and using the pragma data to distinguish
between different effects, much as many modern command line programs
use “subcommands” (for example, git checkout
, git
status
, git push
etc.) instead of having many
distinct commands.
It is a consequence of the syntax that pragmas can contain
nested pragmas, as shown here:
{[rewrite
comment: -"{", cchars, (comment++cchars, cchars)?, -"}".
{[token]} -cchars: cchar*.
]}
Here, in fact, the pragma contains a nested pragma,
though the nesting is only apparent to a processor which
understands the rewrite
pragma and knows to parse its pragma data as a sequence of
rules in ixml notation. A processor which does not understand the rewrite pragma will merely know that
the pragma data here contains a sequence of characters, which
happens to include two nested pairs of braces. That suffices.
And of course a processor which does not handle pragmas at all
will treat the entire thing as a comment, containing two
nested comments.
External syntax: where pragmas may appear
Pragmas may appear:
-
immediately before a terminal or nonterminal symbol
in the right-hand side of a rule, before or after its mark
if any, or
-
immediately before the nonterminal symbol on the
left-hand side of a rule, before or after its mark if any,
or
-
after the final alternative of a rule, before the
full stop ending the rule, or
-
before the first rule of the grammar.
In the final case, it must be possible to distinguish between a pragma that
applies to the first rule of a grammar and a pragma that
precedes it but applies to the grammar as a whole. We do that
by adding one more syntactic convention: a pragma that begins “
{[+
”
can only appear at the beginning of a grammar and applies to the grammar as a whole.
Changes to the grammar of ixml
We allow pragmas to appear in specific places, where
we interpret them as applying to specific
parts of the grammar. Each of these requires some changes to the grammar of
ixml. To allow pragmas immediately before symbols, we change the
grammatical definitions of symbols. First, the changes for nonterminals:
nonterminal: annotation, name, s.
-annotation: (pragma, sp)?, (mark, sp)?.
-sp: (whitespace; comment; pragma)*.
Pragmas and marks are grouped together as
annotation, and the nonterminal
sp is defined for whitespace that may
contain pragmas.
The changes for terminals are similar; since terminal marks are
distinct from those for nonterminals, the additional nonterminals
tmark and
tannotation are needed.
-quoted: tannotation, string, s.
-encoded: tannotation, -"#", @hex, s.
inclusion: tannotation, set.
exclusion: tannotation, -"~", s, set.
-tannotation: (pragma, sp)?, (tmark, sp)?.
To allow pragmas on the left-hand side of a rule and
before its closing full stop, we modify the definition of
rule:
rule: annotation, name, s,
-["=:"], s, -alts, (pragma, sp)?, -".".
To distinguish pragmas which apply to the entire grammar
from pragmas occurring on the left-hand side
of the first rule, we modify the definition of prolog to include prolog
pragmas
(ppragma for
short), which are distinguished from normal pragmas by
having a plus sign as part of their starting delimiter.
-prolog: version, s, ppragma++s, s.
ppragma: -"{[+", @pname, (whitespace, pragma-data)?, -"]}".
Why not just allow pragmas to appear where comments can appear?
At this point, some readers may be asking why we don't
take the apparently simpler approach of just defining
pragmas as whitespace, like comments, and allowing them
wherever comments can appear. After all, pragmas can
be viewed as a kind of comment, can they not?
Yes, pragmas can be viewed as a kind of comments, in
as much as, like comments, you can ignore them if you don’t
care about pragmas, or if you encounter a pragma you don’t
recognize, or if the moon is full.
But at the same time no, pragmas cannot really be
viewed that way in practice. Implementations which don't
recognize pragmas will parse them as comments, but for
implementations which actually implement any pragmas, it’s
not sufficient to just leave them as comments in the
grammar. It’s easy to demonstrate why with an example.
Consider:
{[+pragma my "http://example.com/pragmas/g342"]}
{[my example rule pragma]}
symbol: A .
A: {[my example symbol 'a' pragma]} 'a',
{[my example symbol B pragma]} B.
B: .
If you parse this with an ixml grammar that knows
nothing about pragmas, those are comments, and the result
is:
<ixml>
<comment>[+pragma my "http://example.com/pragmas/g342"]</comment>
<comment>[my example rule pragma]</comment>
<rule name="symbol">
<alt>
<nonterminal name="A"/>
</alt>
</rule>
<rule name="A">
<comment>[my example symbol 'a' pragma]</comment>
<alt>
<literal string="a"/>
<comment>[my example symbol B pragma]</comment>
<nonterminal name="B"/>
</alt>
</rule>
<rule name="B">
<alt/>
</rule>
</ixml>
This is unsatisfactory in a couple of ways.
First, it’s necessary to resort to re-parsing the comment to
distinguish between the prolog pragmas that are intended to
apply to the grammar as a whole and the pragmas that are
supposed to apply to the first rule.
Second, the pragmas are not reliably associated with their
targets.
Two of the pragmas are the immediate left siblings of their
targets (my example rule pragma
and my
example symbol B pragma
),so perhaps we could say that
pragmas apply to the next construct, but that doesn’t work
for the ‘a
’ pragma because its immediate right
sibling is the <alt>
. And the prolog pragma
is different again: it's the child of its target.
By extending the ixml grammar to distinguish pragmas
from comments, we can do much better:
<ixml>
<prolog>
<ppragma pname="pragma">
<pragma-data>my "http://example.com/pragmas/g342"</pragma-data>
</ppragma>
</prolog>
<rule name="symbol">
<pragma pname="my">
<pragma-data>example rule pragma</pragma-data>
</pragma>
<alt>
<nonterminal name="A"/>
</alt>
</rule>
<rule name="A">
<alt>
<literal string="a">
<pragma pname="my">
<pragma-data>example symbol 'a' pragma</pragma-data>
</pragma>
</literal>
<nonterminal name="B">
<pragma pname="my">
<pragma-data>example symbol B pragma</pragma-data>
</pragma>
</nonterminal>
</alt>
</rule>
<rule name="B">
<alt/>
</rule>
</ixml>
Now each pragma is a child (or in the case of prolog
pragmas, the grandchild) of the element to which it
applies.
In order to make the XML form of grammars with pragmas
more useful, therefore, the proposal here modifies the
grammar of ixml as described. The changes made guarantee
that every input which matches the modified grammar also
matches the standard ixml specification grammar, and every
conforming ixml grammar which uses no pragmas has the same
XML structure in a pragma-aware processor as in a standard
ixml processor.
Syntax in XML
Following the normal rules of ixml, pragmas are serialized
as elements named pragma
or ppragma
(for prolog pragmas), with an attribute named pname
and an optional child element named pragma-data
. In
addition, in XML grammars pragma
elements may
contain any number of XML elements following the
pragma-data
element.
For example:
<pragma pname="blue"/>
or
<pragma pname="color">
<pragma-data>blue</pragma-data>
</pragma>
or
<pragma pname="rewrite">
<pragma-data>
comment: -"{", cchars, (comment++cchars, cchars)?, -"}".
{[token]} -cchars: cchar*.
</pragma-data>
</pragma>
Processors which do not implement the pragma in question
will as a matter of course produce pragma
elements
with just the one child element (or none). But processors which
implement a given pragma are free to inject additional XML
elements into the XML form of the pragma. It is to be assumed
that the XML elements contain no additional information, only a
mechanically derived XML form which makes the information in the
pragma easier to process. It is to be expected that any software
to serialize XML grammars in ixml form will discard the
additional XML elements.
For example, note that a processor which understands the
rewrite pragma (shown above
in an example) might prefer to produce a different XML
representation for it, e.g., one in which the embedded grammar
rules are parsed into their normal XML representation. For such a processor,
the XML representation might be:
<pragma pname="rewrite">
<pragma-data>
comment: -"{", cchars, (comment++cchars, cchars)?, -"}".
{[token]} -cchars: cchar+.
</pragma-data>
</pragma>
<rule name="comment">
<alt>
<literal tmark="-" string="{"/>
<nonterminal name="cchars"/>
<option>
<alts>
<alt>
<repeat1>
<nonterminal name="comment"/>
<sep>
<nonterminal name="cchars"/>
</sep>
</repeat1>
<nonterminal name="cchars"/>
</alt>
</alts>
</option>
<literal tmark="-" string="}"/>
</alt>
</rule>
<rule mark="-" name="cchars">
<pragma pname="token"/>
<alt>
<repeat0>
<nonterminal name="cchar"/>
</repeat0>
</alt>
</rule>
Note that because the additional XML elements within the
pragma are just redundant XML representations of the pragma
data, an application to rewrite XML grammars in
ixml form will lose no information when transcribing this XML
pragma as
{[rewrite
comment: -"{", cchars, (comment++cchars, cchars)?, -"}".
{[token]} -cchars: cchar*.
]}
Pragma scope
In this proposal, pragmas always apply explicitly to some
part of a grammar:
The relation between a pragma and the part of the grammar
to which it applies is reflected in the XML form of a grammar:
ordinary pragmas appear as child elements and prolog pragmas as
grandchild elements of the part of the grammar they apply to (an
element named ixml
, rule
,
nonterminal
, literal
,
inclusion
, or exclusion
).
These associations between pragmas and parts of grammars
are specified here for clarity and to enable clearer discussion
of pragmas, but they have no effect on the operational semantics
of ixml processors. If a processor does not implement a given
pragma, or any pragmas at all, it will not be affected by the
pragmas, regardless of what they apply to, and a processor that
does understand a given pragma may be able to tell from its
definition what changes in behavior it requests and what it
applies to. The associations given above are thus of most
direct use to those specifying the meaning of specific pragmas.
Operational semantics
In describing the operational semantics of pragmas, we
distinguish different classes of ixml processor:
-
standard ixml
processors treat pragmas syntactically as
comments and ignore them in the same way as they ignore all
comments. Informally, they do not understand
any pragmas, and their only obligation is not to trip over
pragmas when they encounter them.
-
pragma-aware
processors recognize pragmas syntactically and modify their
behavior in accordance with some pragmas. Informally, they
understand
some pragmas but not all. For each
pragma they recognize, they must determine whether it is one
they understand
and implement, or not.
With regard to a given pragma, processors either implement that pragma or they do not. A
processor implements a pragma
if and only if it adjusts its behavior as specified by that
pragma. In the ideal case there will be some written
specification of the pragma which describes the operational
effect of the pragma clearly. This proposal assumes that a
processor can use the URI of a pragma, possibly in conjunction
with the pragma data, to determine
whether the processor implements the pragma or not and thus
decide whether to modify its normal operation or not.
Pragma-aware processors MUST accept pragmas when they
occur in the ixml form of a grammar, and (if they are producing
an XML form of the grammar) must produce the correct XML form of
each pragma, just as they produce the corresponding XML form for
any construct in the grammar.
Conformance requirements for pragmas
The conformance requirements mentioned in this section
apply to pragma-aware processors; the qualifier
pragma-aware
is sometimes omitted for
brevity.
Pragma-aware processors MUST be capable, at user option,
of ignoring all pragmas and processing a grammar using the
standard rules of ixml.
Processors which accept ixml grammars MUST accept pragmas
in the ixml form of a grammar, whether they understand or
implement the specific pragmas or not.
Processors which accept XML grammars MUST accept pragmas
in the XML form of a grammar, whether they understand or
implement the specific pragmas or not.
If a pragma which the processor does not understand or
implement is present in a grammar used to parse input, the
processor MUST process the grammar in the same way as if the
pragma were not present.
When ixml grammars are processed as input using the
processor's built-in grammar, processors MUST produce the
correct XML form of each pragma, just as they produce the
corresponding XML form for any construct in the grammar,
except as the processor's
behavior is affected by the presence of pragmas in the grammar
for ixml used to parse the input.
Examples
The examples in this section describe some scenarios
in which we can imagine an implementation wanting to support
behavior that goes beyond what is in the current version of
the ixml specification. They illustrate how the pragma
mechanisms described above could be used to invoke the
behavior in question.
They are thus intended to persuade the reader that
the mechanisms described above suffice for some plausible
use cases. They are not
intended as full specifications of the syntax and semantics
of the pragmas described, although some of them have in
fact been implemented.
Note
In the future, we expect to elaborate the description of
some of these pragmas and publish them as specifications of
particular pragmas which may be implemented by more than one
processor. We anticipate doing this by describing pragmas in the
vendor-neutral namespace
https://gyfre.org/ns
with the
conventional name
gyfre
. Gyfre
is the
name of the invisible servant in the Middle English poem
Sir Launfal.
Renaming
Use case: Using pragmas
to specify that an element or attribute name serializing a
nonterminal should be given a name different from the
nonterminal itself.
In the grammar below, the two forms of month have
different syntaxes, so they are required to have different
nonterminal names, and so they are required to be serialized
using different XML element names.
We define a renaming pragma which specifies the name to be
used when serializing a nonterminal as XML. A parser which does
not support the pragma will produce results in which some months
are named month
and others nmonth
; a
parser which does support the pragma will call them all
month
.
{[+pragma rename
"https://lists.w3.org/Archives/Public/public-ixml/2021Oct/0014.html"]}
date: day, " ", month, " ", year.
day: d, d?.
month: "January"; "February"; "March";
"April"; "May"; "June";
"July"; "August"; "September";
"October"; "November"; "December".
year: d, d, d, d.
iso: year, "-", {[rename month]} nmonth, "-", day.
nmonth: d, d.
The fallback behavior of a parser that does not support
these pragmas will be to produce output using both the element
name month
and the element name
nmonth
.
Name indirection
Use case: Using pragmas
to specify that an element or attribute name should be taken not
from the grammar but from the string value of a given
nonterminal.
Consider the following grammar which recognizes a superset
of a simple subset of XML. It's a subset of XML for simplicity,
and it's a superset of the subset because a grammar written at
this level cannot enforce all of the well-formedness constraints of
XML.
{ A grammar for a small subset of XML, as an illustration. }
document: ws?, element, ws? .
element: starttag, content, endtag; soletag .
-starttag: -"<", @gi, (ws, attribute)*, ws?, -">".
-endtag: -"</", @gi2, (ws, attribute)*, ws?, -">".
-soletag: -"<", @gi, (ws, attribute)*, ws?, -"/>".
attribute: @name, ws?, -"=", ws?, @value.
@value: dqstring; sqstring.
-dqstring: dq, ~['"']*, dq.
-sqstring: sq, ~["'"]*, sq.
-dq: -['"'].
-sq: -["'"].
{ allow at most one PCDATA block between pieces of markup }
-content: PCDATA?,
((processing-instruction; comment; element)++(PCDATA?),
PCDATA?)?.
PCDATA: (~["<>&"]; "&"; "<"; ">"; "'"; """)+.
processing-instruction: "<?", @name, ws, @pi-data, "?>".
comment: "<--", comment-data, "-->".
gi: name.
gi2: name.
{ name is left as an exercise for the reader. }
ws: (#20; #A; #C; #9)+.
Among the input sequences which should be accepted by this
grammar is the following XML representation of a haiku.
<haiku author="Basho" date="1686">
<line>When the old pond</line>
<line>gets a new frog</line>
<line>it's a new pond.</line>
</haiku>
We might like an ixml processor to read this and produce
the same XML that any XML parser would produce. (This desire
makes sense only when the ixml processor's results are supplied
to a user in a DOM or XDM or SAX or other XML API or model. If
they are supplied as an XML character stream, we might as well
feed the XML straight to the downstream user; we don't need to
parse it.) What the grammar above will produce has a clear
structural similarity to
the input XML, but it is not the same:
<document xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
<element gi="haiku" gi2="haiku">
<attribute name="author" value="Basho"/>
<attribute name="date" value="1686"/>
<PCDATA>
</PCDATA>
<element gi="line" gi2="line">
<PCDATA>When the old pond</PCDATA>
</element>
<PCDATA>
</PCDATA>
<element gi="line" gi2="line">
<PCDATA>gets a new frog</PCDATA>
</element>
<PCDATA>
</PCDATA>
<element gi="line" gi2="line">
<PCDATA>it's a new pond.</PCDATA>
</element>
<PCDATA>
</PCDATA>
</element>
</document>
We can invent suitable pragmas to allow ourselves to
obtain normal XML from parsing with the grammar:
-
name
expression - specifies that the
name under which a nonterminal is to be serialized is
given by the string value of the supplied XPath expression,
interpreted with the standard ixml result element as the
context node.
-
serialize
keyword - specifies that the
nonterminal is to be serialized as specified by the
keyword (which is assumed to be attribute
,
element
, or the name of some other XPath node
test).
-
drop
- specifies that the nonterminal so
annotated is to be suppressed entirely, along with the
entire parse tree dominated by the nonterminal.
With these pragmas, we can annotate the element and attribute rules appropriately:
^ {[name @gi]} element: start-tag, content, end-tag; sole-tag.
...
-end-tag: "</", {[drop]} @gi2, (ws, attribute)*, ws?, ">".
...
^ {[serialize attribute]}
{[name @name]}
attribute: @name, ws?, "=", ws?, @value.
Rule rewriting
Use case: Using pragmas
to specify that a rule as given is shorthand for a set of other
rules. Consider the following simple grammar for arithmetic
expressions.
expr: term; expr, addop, term.
term: factor; term, mulop, factor.
factor: number; var; -'(', -expr, -')'.
...
We might find it inconvenient that the number 42 is
represented with an XML element tree four elements deep:
<expr>
<term>
<factor>
<number>42</number>
</factor>
</term>
</expr>
We might prefer a shallower tree.
One simple rule to simplify the XML representation of
sentences in this language is to specify that if an element
E has only one child, E should not be tagged and only the child
should appear in the XML.
We can do this in ixml by expanding the grammar, splitting
each nonterminal into two rules, one producing a visible
serialization and one hiding the nonterminal on serialization.
-EXPR: TERM; expr.
expr: EXPR, addop, TERM.
-TERM: FACTOR; term.
term: TERM, mulop, FACTOR.
-FACTOR: number; var; -'(', EXPR, -')'.
...
Now 42 parses more simply as
<number>42</number>
.
The rewrite is mechanical enough that we can automate it,
and error-prone enough that it is worth automating. If a rule
has some right-hand sides guaranteed to produce at most one
child each and some guaranteed to produce at least two children
each, it's split into two rules. The first gets a new
nonterminal and has the original single-child right-hand sides
as alternatives, as well as a reference to the original
nonterminal. It's marked hidden. The second rule gets the
original nonterminal. All references to the original
nonterminal are changed to be references to the new
nonterminal.
If we call the relevant pragma no-unit-rules, or more briefly
nur, the grammar takes the
following form. In practice, we also need a
rule that means don't rewrite the entire rule, but
replace references to rules rewritten using nur
; we call this second pragma
ref.
^ {[nur]} expr: term; expr, addop, term.
^ {[nur]} term: factor; term, mulop, factor.
- {[ref]} factor: number; var; -'(', -expr, -')'.
...
The XML representation of this grammar can plausibly
exploit the ability of extension elements to contain an XML
representation of the new rules. Both the nur
and the ref
pragmas within a rule instruct the
implementation to replace the enclosing rule with the rules
appearing as children of the pragma elements.
<ixml>
<rule name="expr" mark="^">
<pragma pname="nur">
<pragma-data/>
<rule name="EXPR" mark="-">
<alt><nonterminal name="TERM"/></alt>
<alt><nonterminal name="expr"/></alt>
</rule>
<rule name="expr" mark="^">
<alt>
<nonterminal name="EXPR"/>
<nonterminal name="addop"/>
<nonterminal name="TERM"/>
</alt>
</rule>
</pragma>
<alt><nonterminal name="term"/></alt>
<alt>
<nonterminal name="expr"/>
<nonterminal name="addop"/>
<nonterminal name="term"/>
</alt>
</rule>
<rule name="term" mark="^">
<pragma pname="nur">
<pragma-data/>
<rule name="TERM" mark="-">
<alt><nonterminal name="factor"/></alt>
<alt><nonterminal name="term"/></alt>
</rule>
<rule name="term" mark="^">
<alt>
<nonterminal name="TERM"/>
<nonterminal name="mulop"/>
<nonterminal name="factor"/>
</alt>
</rule>
</pragma>
<alt><nonterminal name="factor"/></alt>
<alt>
<nonterminal name="term"/>
<nonterminal name="mulop"/>
<nonterminal name="factor"/>
</alt>
</rule>
<rule name="factor" mark="-">
<pragma pname="ref">
<pragma-data/>
<rule name="factor" mark="-">
<alt><nonterminal name="number"/></alt>
<alt><nonterminal name="var"/></alt>
<alt>
<literal string="(" tmark="-"/>
<nonterminal name="EXPR" mark="-"/>
<literal string="-" tmark="-"/>
</alt>
</rule>
</pragma>
<alt><nonterminal name="number"/></alt>
<alt><nonterminal name="var"/></alt>
<alt>
<literal string="(" tmark="-"/>
<nonterminal name="expr" mark="-"/>
<literal string="-" tmark="-"/>
</alt>
</rule>
...
</ixml>
The fallback behavior of a processor that doesn't support
these pragmas will be to serialize expr
and
term
elements even when they have only one
child.
Tokenization annotation and alternative formulations
Use case: We can use
pragmas to annotate nonterminals in an ixml grammar to provide a
hint to the processor indicating that they define a regular
language and can be safely recognized by a greedy
regular-expression match.
For example, consider the grammar for a simple programming
language. A processor might read programs a little faster if it
could read identifiers in a single operation; this will be true
if when an identifier is encountered, the identifier will always
consist of the longest available sequence of characters legal in
an identifier. In the toy Program.ixml grammar used as a running
example in Hillman 2020,
the rule for identifiers is:
identifier: letter+, S.
We can annotate identifier to signal that it's safe to
consume an identifier using a single regular-expression match by
using a pragma in a lexical scanning
(ls)
namespace:
{[token]} identifier: letter+, S.
The rules for comments in ixml itself offer another
wrinkle.
comment: -"{", (cchar; comment)*, -"}".
-cchar: ~["{}"].
Within a comment, any sequence of characters matching
cchar can be recognized in a
single operation; there is no need to look for alternate parses
that consume only some of the characters. But there is no
nonterminal here that matches all and only non-empty sequences
of cchar. In order to use the
token annotation here, we
must first rewrite the grammar at this point. So we introduce
an annotation named rewrite
to be attached to a single grammar rule with the meaning that
the pragma data provide an alternate form of the rule.
We can now annotate the grammar and supply an alternative
formulation of comment that
replaces it with two new rules:
^ {[rewrite
comment: -"{", cchars, (comment++cchars, cchars)?, -"}".
{[token]} -cchars: cchar*.
]}
comment: -"{", (cchar; comment)*, -"}".
-cchar: ~["{}"].
Or we may find it easier to read if we inject the
alternative formulation after, not before, the existing rule:
comment: -"{", (cchar; comment)*, -"}"
{[rewrite
comment: -"{", cchars, (comment++cchars, cchars)?, -"}".
- {[token]} cchars: cchar*.
]}.
-cchar: ~["{}"].
Either way, the rewrite contains an alternative
formulation of the grammar which recognizes the same sentences
and provides the same XML representation but may be processed
faster by some processors.
The fallback behavior of a processor that doesn't support
these pragmas will be to parse as usual using the grammar as
specified.
Note however that there is no way to guarantee or impose
an effective requirement that the alternate rules in an
rewrite pragma be equivalent
to the fallback rules: pragmas may change the behavior of a
processor, and they may change the meaning of an expression (or
here the meaning of a grammar or part of it).
Text injection
Use case: Using pragmas
to specify that additional text should be injected into the output
at a particular point (as part of a text node, or attribute value).
The text injection use case stands as an example of how a
language may evolve to incorporate features that make some
pragmas unnecessary or obsolete. The insertions feature in Invisible
XML 1.0 was a relatively late addition to the language. Work on
a proposal for pragmas began more than a year
earlier. The text injection pragma use case explored the
question of whether the pragma mechanism could be used to inject
text into the output. And indeed it could. But the insertions
feature has made it obsolete.
Pragmas offer implementers and designers an opportunity to
experiment with, and test designs for, functionality that may
eventually become part of the specification.
What next?
As noted above, the first versions of the pragmas proposal
described here were developed and discussed within the Invisible
XML community group. After it became clear that the group would
not integrate pragmas into Invisible XML 1.0, the proposal was
re-formulated as an optional add-on layered on top of ixml, rather
than as a part of the ixml specification.
The next steps now are
-
to draft a formal specification of
the pragmas framework,
-
to draft stand-alone specifications of some pragmas
which appear to be of general interest (both as examples, and
in the case of pragmas of general interest to avoid multiple
incompatible implementations of the same additional
functionality), and
-
to integrate support for the pragmas framework into
processors, optionally with support for selected
pragmas.
Appendix A. Modified ixml syntax
The ways in which the pragmas proposal changes the syntax
of ixml were outlined in the main body of the text; this appendix
presents the modified grammar in complete form. Insertions
and modifications are given in bold.
{ixml grammar version 2022-06-07, modified for pragmas 2022-07-15}
ixml: s, prolog?, rule++RS, s.
-s: (whitespace; comment)*. {Optional spacing}
-RS: (whitespace; comment)+. {Required spacing}
-sp: (whitespace; comment; pragma)*. {Spacing with pragmas}
-whitespace: -[Zs]; tab; lf; cr.
-tab: -#9.
-lf: -#a.
-cr: -#d.
comment: -"{", ((comment; ~["[]{}"]), (cchar; comment)*)?, -"}".
-cchar: ~["{}"].
prolog: version, s, (ppragma++s, s)?; ppragma++s, s.
version: -"ixml", RS, -"version", RS, string, s, -'.' .
ppragma: -"{[+", @pname, (whitespace, pragma-data)?, -"]}".
rule: annotation, name, s, -["=:"], s, -alts, (pragma, sp)?, -".".
-annotation: (pragma, sp)?, (mark, sp)?.
pragma: -"{[", @pname, (whitespace, pragma-data)?, -"]}".
@pname: name.
pragma-data: (-pragma-char; -bracket-pair)*.
-pragma-char: ~["{}"].
-bracket-pair: '{', -pragma-data, '}'.
@mark: ["@^-"].
alts: alt++(-[";|"], s).
alt: term**(-",", s).
-term: factor;
option;
repeat0;
repeat1.
-factor: terminal;
nonterminal;
insertion;
-"(", s, alts, -")", s.
repeat0: factor, (-"*", s; -"**", s, sep).
repeat1: factor, (-"+", s; -"++", s, sep).
option: factor, -"?", s.
sep: factor.
nonterminal: annotation, name, s.
@name: namestart, namefollower*.
-namestart: ["_"; L].
-namefollower: namestart; ["-.·‿⁀"; Nd; Mn].
-terminal: literal;
charset.
literal: quoted;
encoded.
-quoted: tannotation, string, s.
-tannotation: (pragma, sp)?, (tmark, sp)?.
@tmark: ["^-"].
@string: -'"', dchar+, -'"';
-"'", schar+, -"'".
dchar: ~['"'; #a; #d];
'"', -'"'. {all characters except line breaks; quotes must be doubled}
schar: ~["'"; #a; #d];
"'", -"'". {all characters except line breaks; quotes must be doubled}
-encoded: tannotation, -"#", hex, s.
@hex: ["0"-"9"; "a"-"f"; "A"-"F"]+.
-charset: inclusion;
exclusion.
inclusion: tannotation, set.
exclusion: tannotation, -"~", s, set.
-set: -"[", s, (member, s)**(-[";|"], s), -"]", s.
member: string;
-"#", hex;
range;
class.
-range: from, s, -"-", s, to.
@from: character.
@to: character.
-character: -'"', dchar, -'"';
-"'", schar, -"'";
"#", hex.
-class: code.
@code: capital, letter?.
-capital: ["A"-"Z"].
-letter: ["a"-"z"].
insertion: -"+", s, (string; -"#", hex), s.
References
[Bray et al. 2009]
Bray, T. et al. eds., 2009.
Namespaces in XML 1.0 (Third Edition).
W3C Recommendation, 8 December 2009.
[Grune/Jacobs 1990/2008]
Grune, Dick, and Ceriel J. H. Jacobs.
1990/2008.
Parsing techniques: a practical guide.
First edition New York et al.: Ellis Horwood, 1990.
Second edition [New York]: Springer, 2008.
[Hillman 2020]
Hillman, Tomos.
XSLT Earley: First Steps to a Declarative Parser Generator
.
Presented at XML Prague, 2020, Prague, Czech Republic.
In
XML Prague 2020 Conference Proceedings, pp. 231-249.
[Ichbiah et al. 1986]
Ichbiah, Jean D., John G. P. Barnes, Robert J. Firth, and Mike Woodger. 1986.
Rationale for the design of the Ada programming language.
Ada Joint Program Office: U. S. Government.
[ISO 8879:1986]
International Organization for Standardization (ISO).
1986.
ISO 8879-1986
(E). Information processing — Text and Office Systems —
Standard Generalized Markup Language (SGML). International
Organization for Standardization, Geneva, 1986.
[Jensen and Wirth 1974/1985]
Jensen, Kathleen, and Niklaus Wirth.
1974, 3d ed. 1985.
Pascal user manual and report,
revised for the ISO Pascal standard.
Third edition.
New York, Berlin, Heidelberg, Tokyo: Springer, 1985.
[Kay 2017]
Kay. M. ed., 2017.
XSL Transformations (XSLT) Version 3.0.
W3C Recommendation, 21 March 2017.
[Lindsey 1996]
Lindsey, C. H., 1996.
A history of ALGOL 68.
In Thomas J. Bergin and Richard G. Gibson (eds.)
History of Programming Languages II. New York: ACM.
[Melton et al. 2017]
Melton, J. et al. eds., 2017.
XQueryX 3.1.
W3C Recommendation, 21 March 2017.
[Naur et al. 1960]
Naur, Peter, ed., et al.
1960.
Report on the algorithmic language Algol 60.
Communications of the Association for
Computing Machinery 3.5 (May 1960): 299-314.
doi:https://doi.org/10.1145/367236.367262.
(Also published simultaneously in
Numerische Mathematik.)
[Pemberton 2013]
Pemberton, Steven.
Invisible XML
.
Presented at Balisage: The Markup Conference 2013,
Montréal, Canada, August 6 - 9, 2013.
In
Proceedings of Balisage: The Markup Conference 2013.
Balisage Series on Markup Technologies, vol. 10 (2013).
doi:https://doi.org/10.4242/BalisageVol10.Pemberton01.
On the web at
http://www.balisage.net/Proceedings/vol10/html/Pemberton01/BalisageVol10-Pemberton01.html.
Revised version (January 2014) at
https://homepages.cwi.nl/~steven/Talks/2013/08-07-invisible-xml/invisible-xml-3.html
[Pemberton 2022]
Pemberton, Steven. Invisible XML Specification
.
Published by the Invisible Markup Community Group
on the web at
https://invisiblexml.org/1.0/
[Robie et al. 2017a]
Robie, J, et al. eds., 2017.
XML Path Language (XPath) 3.1.
W3C Recommendation, 21 March 2017.
[Robie et al. 2017b]
Robie, J, et al. eds., 2017.
XQuery 3.1: An XML Query Language.
W3C Recommendation, 21 March 2017.
[van Wijngaarden et al. 1976]
van Wijngaarden, A., et al., ed.
1976.
Revised report on the algorithmic language Algol 68.
Heidelberg, New York: Springer, 1976.
×
Bray, T. et al. eds., 2009.
Namespaces in XML 1.0 (Third Edition).
W3C Recommendation, 8 December 2009.
×
Grune, Dick, and Ceriel J. H. Jacobs.
1990/2008.
Parsing techniques: a practical guide.
First edition New York et al.: Ellis Horwood, 1990.
Second edition [New York]: Springer, 2008.
×
Ichbiah, Jean D., John G. P. Barnes, Robert J. Firth, and Mike Woodger. 1986.
Rationale for the design of the Ada programming language.
Ada Joint Program Office: U. S. Government.
×
International Organization for Standardization (ISO).
1986.
ISO 8879-1986
(E). Information processing — Text and Office Systems —
Standard Generalized Markup Language (SGML). International
Organization for Standardization, Geneva, 1986.
×
Jensen, Kathleen, and Niklaus Wirth.
1974, 3d ed. 1985.
Pascal user manual and report,
revised for the ISO Pascal standard.
Third edition.
New York, Berlin, Heidelberg, Tokyo: Springer, 1985.
×
Kay. M. ed., 2017.
XSL Transformations (XSLT) Version 3.0.
W3C Recommendation, 21 March 2017.
×
Lindsey, C. H., 1996.
A history of ALGOL 68.
In Thomas J. Bergin and Richard G. Gibson (eds.)
History of Programming Languages II. New York: ACM.
×
Melton, J. et al. eds., 2017.
XQueryX 3.1.
W3C Recommendation, 21 March 2017.
×
Naur, Peter, ed., et al.
1960.
Report on the algorithmic language Algol 60.
Communications of the Association for
Computing Machinery 3.5 (May 1960): 299-314.
doi:https://doi.org/10.1145/367236.367262.
(Also published simultaneously in
Numerische Mathematik.)
×
Robie, J, et al. eds., 2017.
XML Path Language (XPath) 3.1.
W3C Recommendation, 21 March 2017.
×
Robie, J, et al. eds., 2017.
XQuery 3.1: An XML Query Language.
W3C Recommendation, 21 March 2017.
×
van Wijngaarden, A., et al., ed.
1976.
Revised report on the algorithmic language Algol 68.
Heidelberg, New York: Springer, 1976.