Introduction
Whereas the computing community invents a new programming language almost every week, and the best ideas from these many experiments find their way into perhaps one programming language a year that sees the light of day outside the project that conceived it, new markup languages are rather rare, and most attempts to create them (such as the MicroXML project1) have self-imposed constraints of compatibility that limit the freedom of the designers to find new ways of doing things, even in areas where existing designs are universally acknowledged to be problematic.
Invited to run a course at a summer school in August 2012 for a high-achieving group of German undergraduates, I decided to take the opportunity to remedy this. While enjoying the thin air of the Swiss Alps in the Romansch-speaking village of Ftan at 1700m above sea level, the students spent the first week learning the technologies in the XML stack, and the second week designing a replacement. The result was FtanML. [1]
FtanML Goals
Some of the design goals the students set themselves at the end of the first week were:
-
The language would be as good as JSON3 in handling typed data, and as good as XML in handling documents.
-
The language would be more concise than XML, while still being human-readable.
-
Both a syntax and a data model would be defined; the data model must map readily to data structures available in most modern programming languages.
Perhaps as important was a non-goal of which we had to remind ourselves frequently: compatibility with the past was not an objective. We did not want to repeat other people's mistakes for the sake of compatibility, whether at the level of documents, parsers, APIs, editing tools, or simply user expectations. Associated with this goal was the implicit decision that we would not compromise technical quality in the interests of market acceptance. The aim was to do it right, and we would not measure success by the level of adoption. Having said that, there was no point in being needlessly different when there was nothing wrong with existing designs.
During the second week of the course we defined the FtanML markup language and object model, and implemented a parser using JavaCC. In the weeks after the course, some of the students rewrote the parser in Scala, and together we worked on extending the system with a type/constraint language. Inevitably, with the students dispersed to their various institutions, momentum was lost, but I decided that there were enough good ideas that it was worth bringing the design to some kind of completion. This paper provides an overview of the language rather than a complete specification (which remains as work to be done). A Scala implementation covering a significant subset is available at [2].
Requirements Background
XML has been remarkably successful and is widely used. It meets a wide variety of needs, achieves a high level of interoperability, and is not expensive to implement.
Nevertheless, over a period of 15 years' use, the drawbacks and limitations of XML have become well known, and are acknowledged by XML's critics and enthusiasts alike. Perhaps the most notable limitations and frustrations are:
-
XML has been widely adopted as a serialization format for structured data, but its data model has a poor fit to the type systems of most popular programming languages. Hence alternatives such as JSON and YAML4.
-
XML is over-complex. Many of its features are rarely used, or used only in very simple ways, but still make everything more complicated. Hence MicroXML.
-
XML cannot handle overlap or graph structures. Hence LMNL5 and GODDAG6.
-
XML is verbose and inefficient. Hence the various Binary XML contenders, including Fast Infoset7 and EXI8, as well as the adoption of custom non-XML syntax for various applications such as RelaxNG and RDF in direct competition with an XML syntax for the same information.
-
XML is syntax without an agreed data model. No-one knows, for example, whether CDATA sections should be treated as information-bearing or not. Similarly for comments. Hence the myriad XML data models such as DOM and XDM, all of them different.
So there's clearly room for improvement. A standard, once entrenched, rarely gives way to a technically superior alternative: the QWERTY keyboard is an oft-cited example, and XML will probably be no exception. However, there's room for diversity, and the aim of this exercise is to explore what is possible. It doesn't tackle all the problems noted above (for example, there's nothing on overlap or graph structures); but it tries to address most of them.
FtanML: The Markup Language
This section presents the syntax of FtanML. We'll present the "data-only" core of the language at this stage, but with some forwards references to how the language is subsequently extended to enable active scripting of documents.
A document (the unit of input to the parser) is a sequence of Unicode characters conforming to the grammar defined in this section. The encoding of characters as octets (or as scratches on clay tablets) is out of scope — it belongs in a different layer of the protocol stack. But if in doubt, UTF-8 is recommended.
The document must match the value
production.
value ::= null | boolean | number | string | list | element | richText
As this production shows, there are seven kinds of value, which we will present in turn, starting with the simplest. The term "rich text" means text with interspersed markup: what the markup community traditionally calls "mixed content".
Later we will introduce an eighth kind of value, namely functions. But first, let's start with an example.
FtanML Example: the Purchase Order
This is what the purchase order from the XML Schema Primer2 might look like in FtanML.
<purchaseOrder orderDate="1999-10-20" shipTo = <country="US" [ <name "Alice Smith"> <street "123 Maple Street"> <city "Mill Valley"> <state "CA"> <zip 90952> ]> billTo = <country="US" [ <name "Robert Smith"> <street "8 Oak Avenue"> <city "Old Town"> <state "PA"> <zip 95819> ]> comment = |<emph |Hurry|>, my lawn is going wild| items = [ < partNum="872-AA" productName="Lawnmower" quantity=1 USPrice=148.95 comment=|Confirm this is <strong |electric|>| > < partNum="926-AA" productName="Baby Monitor" quantity=1 USPrice=39.98 shipDate="1999-05-21" > ] >
This example follows the example given in the XML Schema Primer very closely; I've only made one change, which is to use rich text in the comment fields. Let's compare it with the XML version:
-
End tags reduce to a simple ">".
-
The content of an element, and the content of an attribute, can be either a string (in single or double quotes), a number, a boolean (not used in this example), rich text (delimited with vertical bars), an element, or a list of elements (inter alia). When elements have element content, the child elements are enclosed in a list marked by square brackets.
-
Since an attribute can contain anything an element can contain, it's possible to use structured attributes, and I have taken advantage of this. I have chosen to use attributes rather than child elements in cases where ordering does not matter, and where there is only one child of the parent element with a given name: specifically for the top-level properties of a purchase order, and for the properties of each item. Where there is some significance in the ordering, as with the components of an address, I chose to use child elements.
-
In the list of items, the original XML has an element named
items
, whose children are all elements nameditem
. Since the name of the child element is always the same, it is redundant, so I chose to leave it out: the content of theitems
attribute is now a list of anonymous elements. -
There's a difference between a singleton and a list of length one. Lists are always explicitly marked with square brackets. That might be a little inconvenient for authors, but it makes life a lot easier for the programmer at the receiving end. (You could choose to allow the
items
attribute to contain a single item rather than a list if only one item has been ordered, but the program reading the data would then have to cater for both possibilities.) -
For the purpose of the example I have followed the XML Schema Primer in defining the ZIP code as a number, though in reality it should be a string of digits, which is not the same thing.
-
There's no ambiguity about where whitespace is and is not significant. It's only significant if it appears in a string, or in rich text.
If we compare this with how it might have been done in JSON, there are two main differences. Firstly, JSON provides no satisfactory way to handle the mixed content comments. Secondly, with JSON we would have to make a choice how to represent the addresses: either use an object (i.e. a map), in which case ordering information is lost, or use an array in which case the components have no names. A minor difference with JSON, or at least with official JSON, is that we don't need quotes around the element and attribute names.
Now let's look at the individual constructs of FtanML.
The null value
null ::= "null"
There is a single value in this class, denoted by the keyword "null". It is borrowed directly from JSON, but plays a wider part in the data model.
Boolean values
boolean ::= "true" | "false"
There are two boolean values, denoted by the keywords "true" and "false".
Numeric values
number ::= "-"? digits ("." digits)? ([eE] [+-]? digits)? digits ::= [0-9]+
The production rule for numbers is a little different from both the
DoubleLiteral
of XPath 2.0 (it requires digits both before and
after the decimal point), and the equivalent in JSON (it allows leading zeros). The
value space is not binary floating point, but decimal. Specifically, it is the set
of values that can be represented in the form N * 10^M
where
N
and M
are integers, and N
is not a
multiple of ten. Implementations may impose limits on this infinite set.
Why decimals? Because that's what most human beings on the planet use in their everyday lives. Floating-point binary is designed for machines, not for humans. Also, because it survives round-trip conversion to and from text without ambiguity or loss of precision. However, the use of decimals gives a problem with the design goal that it should be easy to program using conventional programming languages. We take a hit here: in the case of programming languages with no decimal data type, numbers may be converted to whatever number system that language uses. But the native language for processing FtanML, namely FtanSkrit, treats the values as decimals.
Strings
string ::= ('"' (charRep - '"')* '"') | ("'" (charRep - "'")* "'") charRep ::= (char - "\") | escape char ::= (any Unicode character) escape ::= (see prose)
Strings are enclosed in either double or single quotes. The value space for strings is the set of all sequences of Unicode characters. In the FtanML representation of a string, these characters are represented as themselves, except in the case of characters that have a special meaning, notably the string delimiter, and the escape character "\".
Escape sequences fall into a number of categories:
-
Whitespace escapes:
\n
,\r
,\t
,\s
, and\S
represent newline, carriage return, tab, space, and non-breaking space respectively. -
Formatting escapes:
\
followed by a sequence of whitespace characters represents nothing. This means that a FtanML editor can reformat the text for display purposes by inserting or removing escaped newlines without changing the actual content. -
Special character escapes:
\\
for backslash,\"
for quotation mark,\'
for apostrophe,\|
for vertical bar,\`
for backtick,\<
for a left angle bracket,\[
for a left square bracket,\{
for a left curly brace. -
Unicode codepoint escapes:
\xHHHHH;
represents the Unicode codepoint whose hexadecimal value isHHHHH
. This may be any number of digits, followed by a semicolon. (Unlike JSON, non-BMP characters are represented by the actual codepoint, not by a surrogate pair.) -
Cells:
\[§....§]
where§
is any character that does not appear in the string. This is analogous to XML's CDATA section, except that it can also be used in attributes: it allows a literal string to appear without escaping of special characters. For example a sequence of four backslashes might be written\[⟡\\\\⟡]
. Cells are handy for things such as regular expressions and Windows filenames, and for authoring papers that describe new markup languages.[2]
The only characters that must be escaped in strings are \
,
{
, and the character used as the string delimiter ("
or '
). We'll come on to the significance of curly braces
later.
Lists
A list is a sequence of values. The values may be of any of the seven kinds (null, boolean, number, string, list, element, or rich text).
The unabbreviated syntax is the same as for arrays in JSON:
list ::= "[" (value ("," value)* )? "]"
For example, [1, 3, "London", null]
Two abbreviations are allowed:
-
Commas may be omitted, so
[1 2 3]
is equivalent to[1,2,3]
and[<first><last>]
is equivalent to[<first>,<last>]
. -
The value
null
is implicit if there is nothing between two commas, or before the first comma, or after the last. So[,,]
is equivalent to[null,null,null]
The effect of these two rules is that the abbreviated syntax for lists becomes:
lists ::= "[" ( value | ",")* "]"
Whitespace is needed between two values only where necessary to terminate a token; specifically, when one value ends with an alphanumeric character and the next starts with an alphanumeric character.
Elements
Elements serve the same purpose as objects (maps) in JSON and elements in XML.
element ::= "<" name? (name "=" value)* content? ">" content ::= value
Elements have three parts: an optional name, a set of name/value pairs called attributes, and an optional value referred to as the element's content.
The values of attributes can be of any type: not just strings as in XML, but numbers, booleans, lists, elements, rich text. An attribute with the value null is deemed equivalent to omitting the attribute.[3]
Attribute names within an element must be distinct.
Like attributes, the content value can be of any type.
As with lists, whitespace is needed only where necessary to terminate a token.
We'll have more to say on element and attribute names later. For the moment, suffice it to say that the name can be any non-empty string. If the name contains special characters it can be written within backticks (a convention borrowed from the SQL world).
Here are some examples of elements. (We haven't explained rich text yet, so we won't use it in any of our examples):
Table I
Example | Explanation |
---|---|
<> | An empty element (no name, attributes, or content) |
<br> | An empty element named br |
<age 23> | An element whose name is age and whose content
value is the number 23
|
<colors ["red", "green", "blue"]> | An element whose name is colors and whose
content value is a list of three strings
|
<x=0.13 y=0.57> | An unnamed element containing two attributes, both numeric |
<polygon coords=[[1,1], [1,3], [3,2]]> | An element named polygon with an attribute named
coords whose content value is a list; the list
contains three sublists, and each sublist contains two
numbers.
|
<[<i><j><k>]> | An unnamed element whose content value is a list of three elements. Note the omission of the optional commas. |
<`Graduate Trainee` `date of birth`="1995-01-01"> | An element where both the element name and attribute name contain spaces. |
Rich Text
Rich Text (known in the XML world as mixed content) consists of characters and markup, or more specifically a sequence whose members are either characters or elements.
richText ::= "|" (charRep | element) "|"
Rich text is written between vertical bars.[4]
For example: |H<sub|2|>O|
represents text consisting of the
character "H", an element whose name is sub
and whose content is the
rich text "2", and the character "O".
Escapes can be used in rich text just as they can in strings. Any recognized escape sequence may be used; the only characters that must be escaped are "\", "|", "{", and "<".
Whitespace
FtanML (unlike XML) is explicit about the difference between significant and insignificant whitespace.
Whitespace appearing directly within a string or within rich text is significant and is retained in the data model — except that a sequence of whitespace characters preceded by a backslash is ignored (this is formatting whitespace, used only to make the text more easily readable on screen or paper). Whitespace between tokens in a list or element is insignificant and is not retained. Whitespace is never required between tokens unless necessary for disambiguation.
Note that because elements may be embedded in rich text, these rules apply recursively. Whitespace characters appearing between the tokens of an element that itself appears within rich text are not significant; it is the immediate container that matters. Support for rich text means that unlike JSON, this is not a two-level grammar where it makes sense to think of a tokenization phase followed by a syntax analysis phase, with whitespace being discarded during tokenization.
Names and Namespaces
As stated earlier, the name of an element or attribute may be any string. Names without special characters are called simple names; those containing special characters must be written with enclosing backticks (grave accent, x60), and are called quoted names.
The rule for a simple name is that it must begin with a letter or underscore, and continue with letters, digits, or underscore. The terms "letter" and "digit" are defined by reference to Unicode character categories.
A quoted name may use escaped characters in the same way as a string literal. The only characters that must be escaped are the backslash and backtick.
name ::= simpleName | quotedName simpleName ::= [\p{L}_][\p{L}\p{D}_]* quotedName ::= "`" ((charRep - "`") | Escape)+ "`"
A name written in a FtanML document, with or without backticks, cannot be zero length; in the data model, however, the content value is modelled as an attribute with a zero-length name.
There are no namespaces in FtanML.
As a matter of convention, it is recommended that an element or attribute intended
to be used in an alien context, that is, a context where the containing element is
part of a different vocabulary defined by a different specification,
should be made unique by use of a "reverse-DNS" qualified name along the lines of
org_w3c_xsl_transform
.
By contrast, in the normal case where an element or attribute
always has a containing element whose name is defined as part of the same vocabulary,
short names such as status
or
name
are perfectly adequate and cause no ambiguity.
For interoperability with XML, there may be cases where it is desirable to use the same names for elements and attributes as defined in an XML vocabulary. There are two ways this might be done:
-
The XML expanded name can be used in Clark notation, enclosed in backticks. For example:
[<`{http://www.w3.org/1999/XSL/Transform}stylesheet` version="2.0"...>
-
Prefixes and namespace declaration attributes may be used, following XML conventions:
<`xsl:stylesheet` `xmlns:xsl`="http://www.w3.org/1999/XSL/Transform" version="2.0"...>
. The FtanML system will not attach any meaning to such namespace declaration attributes, but it is capable of representing them if required. Note that any name containing a colon (or various other characters such as ".") needs to be backtick-quoted.
Data Model
The data model for FtanML corresponds closely to the syntactic structure.
Null values, booleans, strings, and numbers need no further explanation.
A list is an ordered sequence of values; a list of length one is not the same thing as its singleton member.
An element comprises a name (which is a string, or absent) and a set of zero or more name/value pairs, the element's attributes. The content value of the element is modelled as an attribute whose name is the zero-length string. Attributes whose value is null are treated as absent.
Rich text is modelled as a sequence of strings and elements, in which no string is zero-length, and no two strings are immediately adjacent. But note that rich text is a distinct data type and is distinguishable from a list of strings and elements. [5]
All values in the model are immutable; modifications always involve creating new
values rather than modifying existing values. There is no notion of identity; it is
not
meaningful to ask whether two lists both containing the values [1,2,3]
are
"the same list", and this is also true for elements.
These concepts have mappings to the data structures of popular programming language
that in most cases are fairly obvious. There are a few exceptions: some languages
do not
have a natural way of representing decimal numbers; others have difficulty representing
Unicode strings, especially strings in which the NUL character (x00
) is
permitted. The way in which such conflicts are resolved is outside the scope of this
paper.
A noteworthy feature of the data model is that there are no "parent pointers". It is not possible to navigate from a value to its container. Closely related to this, values have no "identity" in the sense of object-oriented data models. In this respect the data model follows JSON rather than the various models used to represent XML. The absence of parent references and object identity creates some challenges, but has many benefits in establishing a purely functional semantics for the processing language, and in enabling efficient transformation: it means, for example, that copying a subtree from one element to another is a very cheap operation, because the physical data can be shared. [6]
The Schema Language: FtanGram
The schema language can be used to define constraints on values, including constraints on entire documents. This is the only purpose of a schema; validation returns a true or false answer, perhaps with a stream of error messages as a side effect, but it does not change the data being validated in any way, except perhaps as an internal optimization.
A type is thus a predicate; it distinguishes values that match the type from those that do not.
A schema is a set of named types. The seven named types null
,
boolean
, number
, string
, list
,
element
, and text
are always available; other types are
user-defined.
Types have a representation as FtanML elements, and we will use this representation in discussing types. However, the element used to represent a type must not be confused with the type itself.
The convention for type representations is to use elements such as
<number gt=0 le=1000>
, where number
is the
name of a base type, and attributes such as gt=0
and le=1000
define constraints. These attributes are referred to as facets. If there are multiple attributes, they define multiple
constraints, which are independent and orthogonal. In this example, the gt
facet defines a minimum value (exclusive), while the le
facet defines a
maximum value (inclusive). Specifying a base type is often unnecessary — in this example
every value that can be greater than zero is necessarily a number, so every value
that
satisfies the predicate will also satisfy the base type. However, including the base
type can still be useful to aid clarity.
Although we speak of "base type" here, there is no type hierarchy. One value can belong to any number of types, and although it may be true that one type subsumes another, the language makes no use of the fact. Naming a base type in a type representation merely indicates that to satisfy the type, a value must satisfy all the constraints imposed by the base type in addition to the facets explicitly listed.
Before we get into a detailed exposition, we'll again start with an example.
FtanGram Example: the Purchase Order Schema
In this section we present the schema for the purchase order shown earlier. This is based on the example schema in the XSD primer, modified to correspond with the way we restructured the instance document to take advantage of FtanML.
<org_ftanml_schema [ <import "ftan_calendar.ftg"> <types purchaseOrderType = <element form=<purchaseOrder shipTo=<addressType> billTo=<addressType> comment=<nullable<text elements=<inlineType>>> items=<occurs=[1,] <itemType>> > addressType = <element form=< country=<eq="US"> <seq [ <element form=<name <string>>>, <element form=<street <string>>>, <element form=<city <string>>>, <element form=<state <string>>>, <element form=<zip <number>>>]> > itemType = <element form=< partNum=<SKUType> productName=<string> quantity=<number ge=1 lt=100 step=1> USPrice=<number ge=0 step=0.01> comment=<nullable<text elements=<inlineType>>> shipDate=<nullable<org_ftanml_calendar_dateType>> > > inlineType = <element elemName=<enum=["ital", "bold"]> form=<<inlineType>> > SKUType = <string pattern="\[#\d{3}-[A-Z]{2}#]"> > ]>
Looking at this in a little detail, we see:
-
A schema is a set of named types. Some of these types are defined inline, some (in this case
org_ftanml_calendar_dateType
) are imported from an external type library. -
Elements are defined using the
form
attribute. The value of this attribute is a proforma element. The name of the proforma element matches the name of the instance element; the attributes of the proforma element define the types of the attributes of the instance element; and the content value of the proforma element defines the type of the content value of the instance element. -
An optional attribute is given a type such as
<nullable<T>>
. This reflects the fact that an absent attribute is equivalent to an attribute that has the explicit value of null; so as well as the normal type of the attribute, the schema must also allow it to take the value null. -
Note the use of a "cell" for escaping the regular expression in the pattern facet for
SKUType
. This helps to avoid clutter in a string that makes generous use of special characters, especially backslashes.
Constructing Types
The construct <value>
represents a type that matches
every value.
Given types T
, U
, V
, the construct
<anyOf [T, U, V]>
represents the union of these types,
while <allOf [T, U, V]>
represents their intersection.
For example, <anyOf [<number>, <string>]>
allows numbers
and strings, while <allOf [<positive>, <even>]>
allows
values provided they satisfy both the (user-defined) types positive
and
even
.
For convenience, the construct <nullable <T>>
is equivalent
to <anyOf [<T>, <null>]>
: that is, either T
or
null. Thus <nullable <number>>
matches either a number, or
null.
An enumeration type can be defined using the construct
<enum=[A,B,C,...]>
. For example,
<enum=["red", "green", "blue"]>
matches the three
specified strings and nothing else. A singleton enumeration can be defined with the
eq
facet: for example <eq="">
matches the
zero-length string only.
The construct <not <T>>
denotes a type that matches all
values that are not instances of T
. This can be useful in constructing
more complex types; for example <not<eq="">>
matches all
non-empty strings, while <allOf [<number>, <not <eq=0>>]>
matches values that are numbers and that are not equal to zero.
The most general way of defining a restriction is with an assertion facet, for
example: <assert={$.startsWith("abc")}>
. To understand
assertions, however, we need to look at the scripting language, which comes later
in
the paper. (The curly braces signal that the value is a function; this represents
an
extension to the base FtanML syntax which is used only in scripts.)
Restricting numbers
Numeric ranges may be defined using the four attributes ge
,
gt
, le
, and lt
, corresponding to the XML
Schema facets minInclusive
, minExclusive
,
maxInclusive
, and maxExclusive
, together with the
facets eq
and ne
which are applicable to all values. For
example, the type consisting of numbers in the range 0 to 100 inclusive may defined
as <number ge=0 le=100>
. (As mentioned earlier, the element
name number
is redundant, because only a number can satisfy the other
constraints.)
A step
facet constrains the number to be an integer multiple of the
given increment. The most common values (both found in our example schema) are 1,
which requires the value to be an integer, and 0.01, which is often suitable for
currency amounts. Specifying step=17.2
would be unusual, but is
perfectly legal. The facet does not constrain the way the value is written, for
example an integer can be validly written as 1.00000
.
Restricting strings
Strings may be restricted using a regular expression, for example
<string pattern="[A-Z]*">
. There are no special facets
for defining a minimum, fixed, or maximum length, since regular expressions are
sufficient for this purpose.
Restricting lists
A list can be constrained with a grammar. A grammar is a facet like any other:
just another way of defining a restriction on the content, and it is defined in the
same way: <list grammar=....>
. A simple grammar might allow a list
to consist of a sequence of zero or more numbers. This would be defined like
this:
<list grammar=<number occurs=[0,]>>
To take another example, a grammar might require a value to be a list comprising a string, a number, and a boolean. Here is the definition:
<list grammar=<seq [<string>, <number>, <boolean>]>>
Unlike most schema languages in the XML world, grammars can constrain any sequence of values, not only a sequence of elements. In principle, if there are subtypes of string representing nouns, verbs, and so on, then a grammar could constrain a list to contain a sequence of words making up an English sentence.
The "alphabet" of the grammar — the set of tokens it recognizes — is the set of types. The fact that a value might belong to more than one of these types does not matter. The grammar exists not to define an unambiguous parse tree of the input, but only to determine whether the input is valid against the type definition or not.
A grammar can be represented as a tree of particles. Each particle consists of a term (what does it match?), and a repetition indicator (how often does it match?). For leaf particles, the term is a type. Non-leaf particles are either sequence particles or choice particles, and in each case the term is the set of child particles in the tree.
The value of the grammar
facet is an element representing
the root particle in this tree.
The three kinds of particle are represented as follows:
-
A sequence particle is represented by an element named
seq
; an optionaloccurs
attribute; and content which is a list containing the child particles in the tree. For example:<seq occurs=[0,] [<white>,<black>]>
, which matches an alternating sequence of values of types<white>
and<black>
. -
A choice particle is represented by an element named
choice
; an optionaloccurs
attribute; and content which is a list containing the child particles in the tree. For example:<choice occurs=[0,] [<white>,<black>]>
, which matches sequence of values, each of which can be either of<white>
or<black>
type. -
A leaf particle is represented by the same element used to describe the type, augmented if necessary with an
occurs
attribute. For example<number>
, or<number occurs=10>
. Theoccurs
attribute defaults to 1; it appears alongside the attributes defining facets of the type, though it is not really a property of the type, but rather of the particle referring to the type.
The value of the occurs
attribute is either an integer (indicating a
fixed number of occurrences), or a list of size two (indicating a range with a
minimum and maximum). The first item must be an integer, the second can be either
another integer, or null to indicate an unbounded range. For example
[0,1]
indicates an optional particle (zero or one occurrences),
[0,]
indicates zero or more, and [1,]
indicates one or
more. The default is occurs=1
.
Some further examples of grammars are shown in the table below:
Table II
Example | Explanation |
---|---|
<seq [<string>, <number>, <number>]> | A string followed by two numbers |
<seq [<string>, <number occurs=2>]> | A string followed by two numbers |
<occurs=[0,] <seq [<string>, <number>]>> | An alternating sequence of strings and numbers |
<enum=["red", "green", "blue"] occurs=[1,]> | A sequence of one or more strings each taken from a defined set of colour values |
<occurs=[0,100] <choice [<string>, <number>]>> | A list of up to 100 items, each of which may be either a
string or a number. Note that when the sub-particles of a choice
are leaf particles, an alternative approach is to define a union
type using <anyOf> |
Many of these examples serve the purpose that in XML Schema would be achieved using simple types of variety list or union. But of course, in the document markup tradition, grammars are commonly used to define sequences of elements, and we will see examples of this in the next section.
Restricting elements
The simplest way to place restrictions on elements is by use of the form
facet.
Its value is an element, known as a proforma, which works as follows:
-
The name of the proforma element constrains the name of the target element.
-
The attributes of the proforma element constrain the attributes of the target element.
-
The content value of the proforma element constrains the content value of the target element.
For example, the proforma:
<img height=<number> width=<number> <null>>
represents an element whose name must be "img", whose height
and
width
attributes must be numbers, and whose content value must be
absent (null).
This proforma can be used to define an element type like this:
<element form=<img height=<number> width=<number> <null>>>
Like all facets, a proforma can only define restrictions. If the proforma includes no element name, then it places no restrictions on the element name. If a particular attribute is not present in the proforma, then it places no restrictions on the presence or content of that attribute. If the proforma has no content value, then the content value of the target element is unconstrained.
If an attribute is to be optional, this can be indicated by permitting null as the
value: for example writing height=<nullable<number>>
indicates that the height
attribute must either be a number, or null.
Recall that omitting an attribute is the same as giving it a value of null.
Some additional facets are available for elements for use where the proforma construct is insufficiently expressive:
-
The
elemName
facet defines the type of the element name.For example
<element elemName=<enum=["i", "b", "u"]>>
constrains the element name to be one of the names listed. -
The
attName
facet defines the type that all attribute names must conform to (for example, as an enumeration, or by means of a pattern). This is the easiest way of prohibiting attributes from appearing (the other way is to constrain the value to be null). For example,attName=<ne="xmlns">
would disallow the attribute namexmlns
; this constraint could also be expressed in the proforma asxmlns=<null>
. -
For convenience, as an alternative to using a proforma, the content of the element can be constrained using the
content
facet. The value is a type. For example,content=<boolean>
constrains the content to be the boolean value true or false, whilecontent=<null>
constrains it to be null (which can be achieved either by omitting the content, or using the FtanML keywordnull
).
We can now see how to define an element type that participates in the content model
of another
element type. Suppose we have an element named items
whose children are
elements named item
with string content. We can define the type of items
like this:
<element form= <items grammar=<element form=<item <string>> occurs=[0,]> > >
(I find it useful when writing such constructs to ensure that every angle bracket is aligned either vertically or horizontally with its partner, and to limit the nesting of angle brackets on a single line to about 3.)
Content models like this would quickly become unwieldy if the whole structure had to be defined inline. In addition, it would not be possible to reuse types in different parts of the model. It is therefore possible for the definition of one type to refer to other types by name. The above example could be expressed using named types in a schema, thus:
<types itemsType = <element form=<items <grammar=<itemType occurs=[0,]>>>> itemType = <element form=<item <string>>> >
Restricting Rich Text
Most of the time, the only restriction that needs to be placed on rich
text is to define what elements may appear within it. This is done with
an elements
facet, whose value is a type.
All elements appearing in the text must conform to this type.
We don't expect it to be used very often, but FtanML also allows rich text to
be constrained with a grammar. The rules for defining a grammar are exactly the same
as for lists, and they define the grammar when the text is considered as a list
containing strings and elements. For example, a grammar might define that the first
thing to appear is a headword
element, and after that there are no
constraints.
Uniqueness and Referential Constraints
As with XML Schema, definition of constraints takes advantage of the processing language, so this section contains some forward references to facilities not yet introduced.
Uniqueness is enforced by a function-valued facet. For example:
<list unique={$@id}>
expresses a contraint on a list of elements stating that among the elements in
this list, all attributes named id
must have distinct values. Null
values are excluded. This facet can be applied to lists and elements; in each case
the supplied function is used as a mapping function, and is applied to each item in
the list or each attribute of the element, as if by the !
operator; the
value is invalid if the resulting list contains any duplicates. So a simple
constraint that all the numbers in a list of numbers be unique can be expressed as
unique={$}
; a constraint that the names of the attributes in an
element should each start with a different letter can be written
unique={substring($, 0, 1)}
, and a constraint that all the non-null
attributes of an element should have distinct values can be expressed as
unique={$2}
(when a mapping function is applied to an element, the
first argument $
is the attribute name, and the second argument
$2
is the attribute value).
Referential constraints are enforced by a similar facet whose value is a pair of functions, one of which selects the references (foreign keys) and one the target identifiers (primary keys):
<ref=<from={$@ref} to={$@id}>>
The rule is that the set of values selected by the from
function
(again excluding any nulls) must be a subset of the set of values selected by the
to
function.
Queries and Transformations: the FtanSkrit Processing Language
FtanSkrit is a functional, weakly-typed, Turing-complete programming language for manipulating instances of the FtanML data model. It is an expression language with full orthogonality: any expression can be used as an operand of any other expression, subject only to rules on operator precedence and type constraints.
A program in FtanSkrit is written as a function (a function which typically takes a source document as input and produces a result document as its result). The body of a function is an expression, and this exposition of the language will focus on the different kinds of expression that can be written.
The data model that can result from parsing a FtanML document, as we saw earlier, can contain seven types of value: null, boolean, number, string, list, element, and text. We also mentioned an eighth type of value, namely a function. Functions can appear in the data model anywhere that the other seven types of value can appear, for example as the value of an attribute in an element, or as the value of an item in a list.
Because expressions can be nested arbitrarily, it's not easy to define the different classes of expression without forward references to concepts that haven't been explained yet, and it's also rather difficult to know where to begin. But because functions are so important and central, that's where I'll start.
Functions
There are two important kinds of expression associated with functions: function declarations and function calls.
Function Declarations
A function is written as an expression enclosed in curly braces. Here's a
function that computes the sum of its two arguments: {$1 +
$2}
References to parameters are written $1
, $2
etc, where
$N
refers to the Nth supplied argument in the function call.
The expression $
can be used in place of $1
to refer
to the first argument, and is particularly useful for functions that expect a
single argument. It can be used in rather the same way as .
(the
context item) in XPath, and plays a similar role to _
in languages
such as Perl or Scala.
For example, a function that returns true if the supplied element has a name
might be written {name($) != null}
.
Functions have no name, but can be bound to named variables, in which case the variable name serves effectively as a function name. Functions in the system library are bound to predefined variables.
A function does not have a fixed arity. The example function {$1 +
$2}
expects two arguments, but it can be called with more than two
arguments (excess arguments are ignored), or with fewer than two (unsupplied
arguments default to null).
The expression $$
returns all the supplied arguments in the form
of a list. This makes it possible to write functions that take a variable number
of arguments: the actual number is accessible as count($$)
.
Functions can refer to variables defined outside the function body, which become part of the closure of the function, to be used when it is evaluated.
Within the body of a function, the variable self
is bound to the
function itself. This makes it easy to write anonymous recursive functions: for
example a function to compute the sum of its arguments can be written as
{if empty($$) then 0 else $ + self(tail($$))}
. We'll see later
how to write mutually-recursive functions.
Because a function is an expression, it can be used anywhere an expression can appear; for example as the value of an attribute in an element. This allows an element to be used as a map from strings to functions, which is very like Javascript's notion of an object. This enables a kind of polymorphism.
Sometimes it is useful to design a function so that parameters are supplied by
name rather than positionally. The can be achieved by writing the function to
accept an element as its argument. The caller might supply the arguments like
this: f(<x=2 y=3>)
; and in the function body the supplied values
can then be referenced as $@x
or $@y
.
Functions do not declare their arguments explicitly. As a matter of convention,
when writing a public function it is good practice to bind the supplied parameters
to variables along with a type check. For example the following implementation
of the indexOf
function starts by giving names to its arguments and checking their type,
which simultaneously makes the function more robust and more readable.
[7]
let indexOf = { let Array = $1.as(<occurs=[0,] <number>>); let Search = $2.as(<number>); 0..count(Array)-1?{Array[$]=Search} }
Because argument types are not declared, it's up to the implementor of a function what to do when the caller supplies arguments of the wrong type. There are no implicit conversions defined as part of the call mechanism. The preferred approach is to throw an error, which can be readily achieved using the coding style in the above example.
Function Calls
If F
is an expression that evaluates to a function, then the
function may be called with arguments x
and y
using
the expression F(x, y)
.
If f
is a variable whose value is a function, and if the function
has at least one argument, then a function call can be written either as
f(x,y)
or as x.f(y)
.
As in XPath 3.0, partial function application (currying) is possible by
supplying ?
for one of the arguments: contains(?, ':')
returns a function whose effect is to test whether its first argument is a
string that contains a colon.
Some built-in functions can also be invoked using an infix operator. For
example the +
operator corresponds to the plus
function; a + b
has the same meaning as plus(a, b)
or
a.plus(b)
. All the operators in the language, including higher-order operators, are defined
in terms of functions, to allow them to be passed as arguments to higher-order
functions.
The names of built-in functions always use the ASCII alphabet; for some operators we have allowed ourselves the luxury of reaching beyond ASCII, but users can always avoid relying on such operators and can use the function name instead.
List and Element Constructors
The syntax of lists and elements is extended so that expressions may appear anywhere the FtanML syntax allows a value.
For example, the expression {[$, $+1, $+2]}(5)
returns the list [5,
6, 7].
Lists in FtanML are not automatically flattened, so the expression [1 to 5,
10]
produces the length-2 list [[1,2,3,4,5],10]
rather than
the length-6 list [1,2,3,4,5,10]
. The latter result can be achieved
either by applying the flatten()
function explicitly, or by using list
concatenation/append operators: for example (1..5).append(10)
.
In an element constructor, expressions can be used to compute the values of
attributes, but cannot be used to compute their names. The value can be expressed
either as a parenthesized expression, or using a string or text value containing
expressions embedded in curly braces: <img size=(x+1) caption="Figure
{n}">
. The same applies to the content value. Note that curly braces are
used only for inline expansion of strings and text (and for writing functions); to
compute general structured content, parenthesized expressions should be used. The
expression:
<job-titles (distinct-values(employee@job-title))>
might generate
<job-titles ["Manager", "Programmer", "Bottle-Washer"]>
A null value for an attribute indicates the effective absence of the attribute, so
the expression <size x=(a+1) y=(if a=2 then 3 else null)>
might
produce the output <size x=3>
.
More specifically, in an element constructor, the value of an attribute, or of the content value, can take any of the following forms:
-
A literal value, for example
a=3
ora="blue"
ora=false
. -
A string or text value with embedded expressions enclosed between curly braces, for example
a="Chapter {n}"
The value of the attribute is obtained by evaluating the embedded expressions as strings and inserting the resulting strings into the text. -
A list constructor, for example
a=[n, n+1, n+2]
. -
An element constructor, for example
a=<x=(n+1) y=(n+2)>
-
A parenthesized expression, for example
a=(n+1)
-
A function, for example
a={$+1}
. In this case the value of the attribute is the function, not the result of evaluating the function.
Where element constructors cannot be used because the element or attribute names are not known statically, functions can be used to construct an element. For example:
element("img").add("x", 3).add("y", 5).add("", "An image") }
Here the function call element("img")
constructs an element with a
given name, and the add()
function adds an attribute with a given name
and value (copying the element to create a new element). The last call adds the
element content, represented as an attribute with an empty name. It should be
remembered that although we use the term "element", FtanML elements will not only
be
used in the way that XML elements are traditionally used, but also in the way that
maps are used in other programming languages, where the keys (attribute names) are
highly dynamic: indeed, to satisfy the kind of use cases for which maps are being
added to XSLT 3.0.
Rich text (mixed content) is constructed as a list of strings and elements, which
is then converted to rich text by applying the toText()
function.
Conditional Expressions
The syntax for a conditional expression is:
"if" expression "then" expression "else" expression
There is no need for parentheses (though you can use them if you like, for old time's sake). The "else" branch is mandatory, partly to avoid choosing an arbitrary default (null?) and partly to prevent the dangling-else ambiguity when conditional expressions are nested. For example:
if $ = 0 then x else y
A simple try/catch construct is provided:
"try" expression "catch" function
which returns the result of the expression unless an error occurs during its evaluation, in which case the catch function is called, supplying error information as its argument, in the form of an element with attributes representing the error code and error description.
For example, the following catches a divide-by-zero error (we assume use of the XPath error codes), and returns null if it occurs; otherwise the error is re-thrown:
try (x.div(n)) catch {if $@code="FOAR0001" then null else error($)}
A function orElse
allows a default to be substituted when a value
is null. For example a.orElse(0)
returns the value of a
unless it is null, in which case it returns zero. This function could be defined as
{if $1=null then $2 else $1}
.
Variables
Variables have simple names (no "$" prefix, no backticks). The names
true
, false
, and null
are reserved: they
are used syntactically like variables, but have fixed predefined values. Language
keywords such as if
and let
are also reserved: unlike
XPath, this is possible because bare names are not used to refer to elements and
attributes in input documents.
Variables may be declared using the construct:
LetExpression ::= "let" name "=" expression; expression
which evaluates the second expression with the named variable bound to the
value of the first expression; for example let x=2; x+x
returns
4.
Variables declared in this way are available only after they have been declared. An alternative style of declaration allows forwards references to variables, which is necessary when writing recursive functions. This style uses element notation:
let < even = {if $=0 then true else odd(abs($)-1)} odd = {if $=0 then false else even(abs($)-1)} >; even(32)
With this approach, all the variables declared as attributes of the same element are in scope within each others' declarations, failing dynamically (or in the worst case, failing to finish) if this results in non-terminating recursion.
Equality and Other Comparisons
The eq
function (operator =
) is defined over all
values. To be equal, two values must have the same fundamental type (this means, for
example, that the string "John"
is not equal to the rich text
|John|
). Strings are compared codepoint-by-codepoint. Lists are
equal if they have the same size and their items are pairwise equal. Elements are
equal if they have the same name, and if there is a one-to-one correspondence of
attributes in which both the attribute names and the corresponding values are equal
(the content value is treated here as an attribute). Two texts are equal if the two
sequences of strings and elements making up the texts are pairwise equal.
Defining equality for functions requires further work. Some languages such as
ML and Haskell make equality of functions undefined, but this would mean that
equality of lists and elements containing functions also becomes undefined.
Currently my preference is to make equality of functions implementation-defined,
subject to the proviso that two functions can only be equal if all invocations are
guaranteed to return equal results. It would be useful to attempt a more careful
definition, for example one that guarantees that the result of the expression
let a=b; a=b
is always true, but formalizing this is not easy
without introducing some notion of identity.
The ne
function (operator !=
) is the inverse of
eq
.
Ordering (specifically, the functions le
, lt
,
ge
, gt
, and their corresponding operators
<=
, <
, >=
,
>=
) is defined over numbers and strings only. Strings are
sorted in Unicode codepoint sequence.
Testing whether a value V
is present in a list A
(the equivalent of the =
operator in XPath) is sufficiently common that
we provide a function, in(V, A)
with corresponding operator ∊ (x220A).
The function in(V, A)
can be defined as {let V=$1; let A=$2;
exists(A?{$=V})}
. (This uses a filter operator which we will introduce in
due course.)
A collation is modelled as a set of functions. Specifically, a collation for a
particular language, say Swedish, is obtained using the function call
collation(<lang="se">)
. This returns an element, whose
attributes are functions. One of these functions is a sort function, so to sort a
list of strings using Swedish collation, one can write
collation(<lang="se">)@sort(input)
. Other functions available as
attributes of a collation include comparison functions eq
,
le
, etc, and collation-sensitive versions of other functions that
involve string comparison such as in
, min
,
max
, indexOf
, contains
,
startsWith
, endsWith
, substringAfter
,
substringBefore
.
Comparing A = null
returns true if A
is null,
false otherwise. (This is not as obvious as it might seem, given the semantics in
some
other languages.)
Operations involving Types
The function isA
tests whether a value belongs to a given type.
Types are represented using the element representation introduced in an earlier
section. For example, x.isA(<number ge=0>)
returns true if the
value of x
is a number and satisfies the facet ge=0
.
Recall that a type is a predicate, not a label associated with a value, so this
tests whether the value meets all the constraints defined by the type, not
whether the value carries any particular type label.
For convenience, the functions isNull
, isBoolean
,
isNumber
, isString
, isList
,
isElement
, isText
, and isFunction
are
available to test for membership of the primitive types.
The function as
is similar to isA
, but instead of
returning a boolean indicating whether or not the value is a member of the type,
it returns the value unchanged if this is the case, and throws an error
otherwise. We saw this function used to check the arguments to a function call.
For convenience, the functions asNull
, asBoolean
,
asNumber
, asString
, asList
,
asElement
, asText
, and asFunction
are
available to test for membership of the primitive types. In each case, they return
the argument unchanged if it matches the corresponding type, or throw an error
otherwise.
Functions for conversion of values have names such as toString
,
toList
, and toText
. There are no general rules here;
as in other languages, the rules for what can be converted to what are inherently
ad-hoc.
The parse
function takes a FtanML lexical representation of a value
and returns the corresponding value; conversely, serialize
takes a
value and returns its FtanML lexical representation.
Boolean Functions and Operators
The functions and
, or
, and not
are
available. The first two have equivalent operators &&
and ||
.
The argument must either be a boolean or null; there is no implicit conversion
to boolean as in XPath. If an argument is null, the operators implement three-valued
logic as in SQL, for example (null||true)
is true
.
Order of evaluation is not defined; programmers should not assume that the second argument will only be evaluated if it is required. (This rule might seem unnecessary in the absence of side-effects, but it becomes important when defining the terminating conditions of a recursive function. Like XPath, we choose to allow optimizers the freedom to re-order the terms in a predicate, which can be important when indexes are available on large data sets.)
Numeric Functions and Operators
The functions plus
, minus
, times
,
div
, idiv
, and mod
are defined; the first
four have corresponding operators +
, -
, *
,
and /
. Arithmetic is performed in decimal. Division by zero is an
error; the precision of the result of division is implementation-defined, as are
limits on the value space.
Additional functions abs
, round
, floor
,
ceiling
have the same effect as in XPath.
The function parse
may be used to convert a string to a number.
Writing parse(X).asA(<number>)
checks that the value was
indeed numeric.
Supplying null to an arithmetic function or operator results in the value null. Supplying any other non-numeric value causes an error. There is no implicit conversion of strings to numbers.
Aggregation functions sum
, avg
, min
,
max
work as in XPath.
The function to
or its equivalent operator ..
returns a list of consecutive integers, for example 1..5
returns the
list [1,2,3,4,5]
.
String Functions and Operators
The toString
function can be applied to any value without
exception to convert it to a string. If the input is already a string, it is
returned unchanged. If the input is a boolean, number, or null, the result is the
same as the FtanML literal representation of the value. If the input is a list, the
result is string-join($!toString, " ")
, that is, the space-separated
concatenation of the string values of the members of the array. If the input is an
element, the result is string(content($))
, that is, the string value of
the element's content. If the input is rich text, the result is
string-join($.toList()!toString)
, that is, the concatenation
(without space separation) of the string-values of the strings and elements making
up the text. If the value is a function, the resulting string begins with "{" and
ends with "}", and is otherwise implementation-defined; it is not necessarily a
string that can be parsed and executed as a function.
String concatenation can be achieved conveniently using string templates, for
example "See section {s} in chapter {n}"
. This mechanism can be used
wherever a string literal is permitted. The result of the enclosed expressions is
converted to a string if necessary, by using the toString
function.
Generally FtanSkrit avoids implicit conversion. For example, if rich text is to be compared to a string, it must be converted to a string explicitly. When null is supplied to a function that expects a string, it is generally treated as a zero-length string (but this is a convention adopted by the functions in the built-in function library; it is not a feature of the language).
Other functions are available as in XPath. Counting of characters in a string, however, starts at zero. The basic built-in functions use codepoint collation; equivalents using a different collation can be obtained as attributes of the collation.
Functions and Operators on Lists
The number of items in a list A
is obtained using
count(A)
(or equivalently, A.count()
).
The construct A[n]
selects the n'th item in list A
.
This construct is never used for filtering, only for subscripting. If n
is out of range, the expression returns null. This operation is also available as
a
function itemAt(A, n)
.
The function cat(a, a)
(operator ++
) concatenates
two lists. The function append(a, i)
(operator +~
) appends
an item to a list, while prepend(i, a)
(operator ~+
)
prepends. Thus for example 0 ~+ [1,2,3] ++ [4,5,6] +~ 7
returns the
list [0, 1, 2, 3, 4, 5, 6, 7]
.
The function head(a)
is equivalent to a[0]
, while
tail(a)
equates to remove(a, 0)
.
The function flatten
flattens a list: it creates a new list in
which any non-list items in the argument list are copied to the new list, and any
list items are processed by copying their contents. This only works one level deep.
So flatten([[1,2],[3,[4,5]]])
returns
[1,2,3,[4,5]]
.
Functions index-of
, remove
, subsequence
work as in XPath, except that indexing starts at zero. The
insert-before
function inserts a single item at a specified
position; if the supplied item is a list, it becomes a nested list (there is no
flattening).
The function toList
works as follows: if the argument is a list,
it is returned unchanged. If the argument is rich text, it is converted to a list
whose members are (non-zero-length) strings and elements. In other cases, the
function creates and returns a singleton list in which the argument is the only
item. This function is useful because it makes it easier to process different types
of content in the same way: a single element looks the same as a list of elements
of
length one, which looks the same as mixed content comprising a single element; a
single string looks the same as mixed content containing no elements.
The function forEach
, or the equivalent operator !
,
applies a function to every item in a list. So forEach([1,2,3], {$+1})
returns [2,3,4]
; this can also be written [1,2,3] ! {$+1}
.
Similarly, [1,2,3]!toString
returns ["1", "2", "3"]
. Note
that this is a non-flattening mapping operation; the result list will contain
exactly the same number of items as the input.
Another example: (1..5)!{<br>}
returns
[<br>, <br>, <br>, <br>, <br>]
The function select
, or the equivalent operator ?
,
applies a function to every item in a list and returns a list containing those items
for which the function returns true. So select([1,2,3], {$>=2})
returns
[2,3]
; this can also be written [1,2,3]?{$>=2}
.
Similarly, [1,2,"London"]?isNumber
returns [1,2]
.
The result of the select
operator or function is always a list,
even if only one item is selected. If it is known that the predicate will select
exactly one item, it is necessary to extract that item from the result, typically
by
a call on head
, or by using the subscript operation
(A?P)[0]
. Because this is a common operation, the operator
??
is provided, equivalent to ?
followed by
head()
: it selects the first item found, or null if nothing was
matched. A query to find a singleton can now be written, for example
items??{$@id='xyz'}
.
The functions all
and some
can be used in
conjunction with the forEach
(!
) operator to perform
universal and existential quantification: they test whether a list consists entirely
of boolean true values (all), or contains at least one true value (some). So, for
example all([1,2,3]!{$>0})
returns true, while
some([1,2,3]!{$=0})
returns false.
Functions fold-left
and fold-right
are available as
in XPath 3.0.
Functions and Operators on Elements
The function name(E)
returns the name of an element
E
, or null if it is unnamed. The syntax E.name()
can also be used, of course.
The mapping and filtering operators (!
and ?
) apply to
elements as well as to lists. In this case they expect a two argument function to
be
supplied, and call this with the attribute name as the first argument and the
attribute value as the second. The mapping operator returns a list with as many
members as there are non-null attributes in the input; the filter operator returns
an element with the same name as original, and with a subset of its attributes.
These operators treat the content value as just another attribute. They provide the
most general and powerful means of processing elements, and other operations can be
defined in terms of these two. For example, the function content(E), which returns
the content of an element, could be defined as E?{$1=""}!{$2}.
For example E?{$.in("id", "code", "status"}
returns a copy of element
E, retaining only the three specified attributes.
If E
is an element and xyz
is the name of an attribute
(known at the time the program is written), then E@xyz
returns the
value of the attribute. It returns the value only, not an "attribute node" as in
XPath; if the attribute is not present, it returns null. If the name needs to be
dynamically computed, this can be achieved using an expression in parentheses, for
example E@(X@name)
returns the attribute of E
whose name
is given by the expression (X@name)
. The construct E^
is
an abbreviation for E@``
— it returns the value of the attribute whose
name is the zero-length string, that is, the content value.
Filtering a list of elements to select those with a given name is likely to be a
common operation. The syntax L?{name($)=N}
achieves this but is a
little cumbersome, and becomes more so if the list can also include values other
than elements. So we provide the construct :N
, where N
is
a name, which represents a function that returns true when its first argument is an
element with the name N
, and false otherwise. So given a list of
elements L
, we can now select those having the name N by writing
L?:N
. If we know there will be only one such element, we can select
it using L??:N
.
So if PO
is the purchase order presented in section 2.1, then
PO@shipTo^??:name
gives the value "Alice Smith"
, while
PO@items[0]@partNum
gives "872-AA"
.
The following example selects from a list of elements those having a particular
attribute value: PO@items?{$@USprice > 20.00}
.
The @
operator performs implicit mapping. Specifically: if the
left-hand operand is a list L
, then any lists contained in this list
are expanded recursively. Any values in the expanded list that are not elements are
ignored, so we end up with a list of elements; call this LL
. The value
of the expression L@name
is then defined as LL!{$@name}
.
Note that the result may be a list of lists; it is not flattened.
Returning again to the purchase order example, this means that
PO@items@partNum
returns the list of strings ["872-AA",
"926-AA"]
.
The postfix operator //
represents the deepContent()
function, which is the flattened transitive closure of the content()
function. Specifically, if the result of content()
is a list, then any
element in that list has its own descendants inserted immediately after it. So the
function descendants(E)
can be defined as content(E)!{$ ~+ (if
$.isA(<element>) then $.descendants() else [])}
. So if E
is an element, then E//?:status
will select all descendant elements
named status
, and E//?[$@id=12]
will select all descendant
elements having the id
attribute equal to 12.
The postfix operator @@
similarly gives the transitive closure of
attributes()
.
As already mentioned, the function forEach
and the equivalent
operator !
are overloaded for elements to process all the attributes of
an element (including the content). The second operand is a function which is called
with two arguments, the attribute name and the attribute value. For example, given
an element E the expression E!{$}
returns the names of its
attributes.
Similarly the function select
and the equivalent operator
?
are overloaded to process all the attributes, and return an
element in which only a subset of the original attributes are present.
The function element(name)
constructs an element with a given
name (which may be null). It can also be called with two arguments:
element(name, attributes)
. The second argument is a list of
attributes, each attribute being represented by a two-member list containing the
name and the value.
The function add(element, name, value)
takes an existing element,
and adds an attribute with the given name and value, replacing any existing
attribute with the same value. A new element is returned. Calls to this function can
conveniently be chained: element("rect").add("id", "a001").add("color",
"black")
.
For convenience, the function addContent
is defined as
add(?, "", ?)
.
So, for example, we can convert attributes to a list of child elements like this:
let elementsToAttributes = { let E = $.asElement(); element(E.name()).setContent(E!{element($1).setContent($2)}) }
The semantics of these constructions in FtanSkrit are different from the corresponding operations in XPath, but hopefully they will have a familiar feel.
Future Features
FtanML as presented in this paper packs a large amount of functionality into a small language. It doesn't offer everything that anyone might ask for, and nor should it: keeping it small is important. Nevertheless, there are things one might want to add, and which have not been ruled out.
-
An extensible mechanism for data types is needed: for example, representing dates as values. Schema validation can confirm that a date is a valid string, but for processing one would like to manipulate it as a date, not just as a string. Similarly, support for binary data is important to some applications; and it would be nice if URIs were recognizable as such. A general mechanism for extending the set of types (perhaps along the lines suggested in [Jeni Tennison]), would be undoubtedly useful.
-
What about pointers and links? XML has a sorry tale to tell in this area, but that doesn't mean it can't be done better. Arguably links and anchors should be first-class constructs marked as such in the syntax, rather than a semantic overlay affecting the interpretation of strings. Both intra-document and inter-document links are needed, and they should ideally be handled using a single mechanism. Support for the kind of referential integrity found in relational databases is as important as support for the hyperlinking traditions of the markup community, and there is no good reason why the two mechanisms should be distinct.
-
In the scripting language, there is an obvious need for rule-based processing in the style popularised by XSLT. In this paper, I have concentrated on presenting a small functional core for the scripting language, but I would like to see rule-based processing superimposed, and I see no reason why this should not be achievable.
Conclusions and Summary
We have introduced three languages as replacements for the central pillars of the markup edifice: FtanML as the markup language, FtanGram as its schema language, and FtanSkrit as its query and transformation language. Let's try now to assess what we've achieved.
Firstly, FtanML compared to XML. FtanML is considerably smaller as a specification, but it's also more powerful. It gets rid of the same unwanted things that MicroXML gets rid of (namespaces, comments, processing instructions, entities, DTDs), but by allowing attributes and element content to be any value, the data model is much richer, more orthogonal, and more expressive. It also solves the whitespace issue (which whitespace is significant?). By dropping end tags, the language is a lot less verbose, which is particularly noticeable when it is used for highly structured information, as in the FtanGram syntax. There's a lot of general tidying-up in little areas like escaping of special characters.
Does verbosity matter? We think it does. The fact that XML is bulky and hard to read is a significant factor leading to the adoption of alternative syntaxes for languages such as RDF and RelaxNG, and is a big turn-off for people coming newly to XSLT. Even if specialist editors can reduce the burden of entering the markup, the amount of noise on the page affects the ability of a human reader to absorb information quickly. This is not to say that the most concise syntax is optimal, of course: we might have swung too far. XML had human readability as one of its goals, and we should remember that readability is not a binary attribute; there are degrees of readability, and readability also depends greatly on the familiarity of the reader with the notation.
Compared to JSON, FtanML's main contribution is that it adds support for mixed content. And element names, which are very handy when modelling document structures.
Compared to the XPath data model (XDM), the FtanML model has more capability and greater orthogonality. The core structuring constructs (elements and lists) are powerful enough for all computational requirements. XSLT and XQuery have found a need to extend the core XML-based model with other constructs such as maps and lists; the FtanML model does not have this awkward duality between constructs that can be directly serialized in the markup, and constructs used only for internal processing.
FtanGram learns from RelaxNG the importance of designing a schema language to do validation and nothing else. Unlike RelaxNG, it's able to take advantage of the simplification and orthogonality of the data model. The unification of facilities for "simple types" and "complex types" is particularly appealing, allowing a smaller number of constructs to be combined in more powerful ways to create richer functionality. The idea that element names as well as attributes and content are something that can be constrained by a type is also a useful simplification. FtanGram also attempts to show that by making the markup language itself more powerful and less verbose, the need for a "compact syntax" (that is, a syntax using constructs other than those available in the target language) is eliminated.
FtanSkrit is broadly equivalent in capability to XQuery, but with a stronger reliance on higher-order functions and operators in preference to custom syntax. It currently lacks any mechanism comparable to XSLT's template rules, but we have ideas for how that could be added.
There will always be debates about strong versus weak typing, static versus dynamic. I believe that FtanML's dynamic typing approach fits better with the philosophy that with markup, you can have as much or as little schema machinery as you want. The XPath ability to mix typed and untyped data is one solution to the problem of spanning the worlds of structured and unstructured data, but it is something of a camel.
Are there any downsides? Some may find the languages excessively terse; highly compact syntax is not easy on the eye. The absence of named end tags in FtanML can lead to long strings of closing angle-brackets which are hard to match up without the support of a syntax-driven editor. Generally, though, we feel that FtanML with its sister languages FtanGram and FtanSkrit together form a markup system that has more than the power of the equivalent XML stack, with much greater integrity of design, simplicity, orthogonality, efficiency, and usability.
Implementation
A Scala implementation is available as open source software. It can be downloaded from here:
https://github.com/organizations/FtanML-WG
The implementation is not 100% complete, and is intended as a proof of concept rather than as production quality software. It includes a complete parser for FtanML which constructs a tree represenation of the object model; an implementation of all the FtanGram types and facets (including the grammar facet), but not the schema language itself; and an implementation of most of the FtanSkrit processing language, though with some relatively unimportant functions and operators omitted.
Acknowledgements
FtanML was invented by a group of students from German universities taught by the author, with Stephanie Haupt as co-tutor, during a summer school in Ftan, Switzerland in August 2012, organised by the Max Weber Programm, Bayern. The students deserve much of the credit, if only for challenging things that I had assumed to be self-evident: they were Max Altgelt, Julien Bergner, Lukas Graf, Dominik Helm, Axel Kroschk, Uwe von Lüpke, My-Tien Nguyen, Sebastian Meßmer, Suhanyaa Nitkunanantharajah, Jan Landelin Pawellek, and Martin Schmitt. They were a most impressive team and a pleasure to work with: absorbing knowledge quickly, researching information thoroughly, generating ideas constantly, reaching consensus amicably, writing parsers correctly, making decisions wisely, and communicating bilingually. I am particularly indebted to Sebastian Meßmer for helping me climb the Scala learning curve.
References
[1] MicroXML. Ed. James Clark and John Cowan, 2012. W3C. https://dvcs.w3.org/hg/microxml/raw-file/tip/spec/microxml.html.
[2] XML Schema Part 0 Primer, Second Edition. Ed. David C. Fallside and Priscilla Walmsley. 28 Oct 2004. W3C. http://www.w3.org/TR/xmlschema-0/.
[3] Introducing JSON. http://www.json.org
[4] YAML: YAML ain't markup language. http://yaml.org
[5] LMNL: Layered Markup and Annotation Language. http://www.lmnl-markup.com
[6] ODDAG: A Data Structure for Overlapping Hierarchies. C. M. Sperberg-McQueen and C. Huitfeld. 2004. Springer.
[7] Information technology -- Generic applications of ASN.1: Fast Infoset. ISO/IEC 24824-1:2007
[8] Efficient XML Interchange (EXI) Format 1.0. 10 Mar 2011. W3C. http://www.w3.org/TR/2011/REC-exi-20110310/
[1] Ftan is a place name, not an acronym, and while words beginning "Ft" are uncommon in English, the pronunciation comes easily with practice.
[2] It's called a cell because escaping is not allowed.
[3] This decision means that JSON is not a pure subset of FtanML, because JSON distinguishes an absent entry in an object from an entry whose value is null. However, the decision makes programming simpler, and makes sense semantically.
[4] This is a change from the original FtanML design. Originally rich text was introduced by a vertical bar, and ended with the ">" delimiter marking the end of the element. This design prevented rich text appearing as the value of an attribute, or being used as a value in the scripting language. The revised design restores orthogonality by allowing rich text to appear where any value can appear.
[5] Modelling rich text as a list of strings and elements is convenient in some situations, especially because it's the only representation available using the data types of many programming languages. The main drawback is that it's not convenient when we want to treat the data as a simple string, and ignore the markup. So we make it a distinct data type, that can easily be converted either to a list or to a string for processing when required.
[6] I have previously [x] discussed the possibility of writing an XSLT optimizer in XSLT; I concluded that the only thing preventing this was the inefficiency of the XSLT processing model in cases where it is necessary to make many passes over a tree, with each pass effecting a small change. Allowing subtrees to be shared between the source and result of the transformation could eliminate this problem.
[7] It would be easy enough to add syntax for a more verbose function declaration with an explicit signature. But at this stage, it's important (a) to keep the language small, and (b) to provide a very concise syntax for functions, allowing them to be used as freely and easily as predicates and steps are used in XPath.