FtanML: The Markup Language
This section presents the syntax of FtanML. We'll present the "data-only" core of
the
language at this stage, but with some forwards references to how the language is
subsequently extended to enable active scripting of documents.
A document (the unit of input to the parser) is a sequence of Unicode characters
conforming to the grammar defined in this section. The encoding of characters as octets
(or as scratches on clay tablets) is out of scope — it belongs in a different layer
of
the protocol stack. But if in doubt, UTF-8 is recommended.
The document must match the value
production.
value ::= null | boolean | number | string | list | element | richText
As this production shows, there are seven kinds of value, which we will present in
turn, starting with the simplest. The term "rich text" means text with interspersed
markup: what the markup community traditionally calls "mixed content".
Later we will introduce an eighth kind of value, namely functions. But first, let's
start with an example.
FtanML Example: the Purchase Order
This is what the purchase order from the XML Schema Primer2 might look like in FtanML.
<purchaseOrder
orderDate="1999-10-20"
shipTo = <country="US" [
<name "Alice Smith">
<street "123 Maple Street">
<city "Mill Valley">
<state "CA">
<zip 90952>
]>
billTo = <country="US" [
<name "Robert Smith">
<street "8 Oak Avenue">
<city "Old Town">
<state "PA">
<zip 95819>
]>
comment = |<emph |Hurry|>, my lawn is going wild|
items = [
< partNum="872-AA"
productName="Lawnmower"
quantity=1
USPrice=148.95
comment=|Confirm this is <strong |electric|>|
>
< partNum="926-AA"
productName="Baby Monitor"
quantity=1
USPrice=39.98
shipDate="1999-05-21"
>
]
>
This example follows the example given in the XML Schema Primer very closely; I've
only made one change, which is to use rich text in the comment fields. Let's compare
it with the XML version:
-
End tags reduce to a simple ">".
-
The content of an element, and the content of an attribute, can be either
a string (in single or double quotes), a number, a boolean (not used in this
example), rich text (delimited with vertical bars), an element, or a list of
elements (inter alia). When elements have element content, the child
elements are enclosed in a list marked by square brackets.
-
Since an attribute
can contain anything an element can contain, it's possible to use structured attributes,
and I have
taken advantage of this.
I have chosen to use attributes rather than child elements in cases where ordering
does not
matter, and where there is only one child of the parent element with a given name:
specifically
for the top-level properties of a purchase order, and for the properties of each item.
Where
there is some significance in the ordering, as with the components of an address,
I chose
to use child elements.
-
In the list of items, the original XML has an element named
items
, whose children are all elements named
item
. Since the name of the child element is always the
same, it is redundant, so I chose to leave it out: the content of the
items
attribute is now a list of anonymous elements.
-
There's a difference between a singleton and a list of length one. Lists
are always explicitly marked with square brackets. That might be a little
inconvenient for authors, but it makes life a lot easier for the programmer
at the receiving end. (You could choose to allow the items
attribute to contain a single item rather than a list if only one item has
been ordered, but the program reading the data would then have to cater for
both possibilities.)
-
For the purpose of the example I have followed the XML Schema Primer in
defining the ZIP code as a number, though in reality it should be a string
of digits, which is not the same thing.
-
There's no ambiguity about where whitespace is and is not significant. It's only significant
if it
appears in a string, or in rich text.
If we compare this with how it might have been done in JSON, there are two main
differences. Firstly, JSON provides no satisfactory way to handle the mixed content
comments. Secondly, with JSON we would have to make a choice how to represent the
addresses: either use an object (i.e. a map), in which case ordering information is
lost, or use an array in which case the components have no names. A minor difference
with JSON, or at least with official JSON, is that we don't need quotes around the
element and attribute names.
Now let's look at the individual constructs of FtanML.
The null value
null ::= "null"
There is a single value in this class, denoted by the keyword "null". It is
borrowed directly from JSON, but plays a wider part in the data model.
Boolean values
boolean ::= "true" | "false"
There are two boolean values, denoted by the keywords "true" and "false".
Numeric values
number ::= "-"? digits ("." digits)? ([eE] [+-]? digits)?
digits ::= [0-9]+
The production rule for numbers is a little different from both the
DoubleLiteral
of XPath 2.0 (it requires digits both before and
after the decimal point), and the equivalent in JSON (it allows leading zeros). The
value space is not binary floating point, but decimal. Specifically, it is the set
of values that can be represented in the form N * 10^M
where
N
and M
are integers, and N
is not a
multiple of ten. Implementations may impose limits on this infinite set.
Why decimals? Because that's what most human beings on the planet use in their
everyday lives. Floating-point binary is designed for machines, not for humans.
Also, because it survives round-trip conversion to and from text without ambiguity
or loss of precision. However, the use of decimals gives a problem with the design
goal that it should be easy to program using conventional programming languages. We
take a hit here: in the case of programming languages with no decimal data type,
numbers may be converted to whatever number system that language uses. But the
native language for processing FtanML, namely FtanSkrit, treats the values as
decimals.
Strings
string ::= ('"' (charRep - '"')* '"') | ("'" (charRep - "'")* "'")
charRep ::= (char - "\") | escape
char ::= (any Unicode character)
escape ::= (see prose)
Strings are enclosed in either double or single quotes. The value space for
strings is the set of all sequences of Unicode characters. In the FtanML
representation of a string, these characters are represented as themselves, except
in the case of characters that have a special meaning, notably the string delimiter,
and the escape character "\".
Escape sequences fall into a number of categories:
-
Whitespace escapes: \n
, \r
, \t
,
\s
, and \S
represent newline, carriage
return, tab, space, and non-breaking space respectively.
-
Formatting escapes: \
followed by a sequence of
whitespace characters represents nothing. This means that a FtanML
editor can reformat the text for display purposes by inserting or
removing escaped newlines without changing the actual content.
-
Special character escapes: \\
for backslash,
\"
for quotation mark, \'
for apostrophe,
\|
for vertical bar, \`
for backtick,
\<
for a left angle bracket, \[
for a
left square bracket, \{
for a left curly brace.
-
Unicode codepoint escapes: \xHHHHH;
represents the
Unicode codepoint whose hexadecimal value is HHHHH
. This
may be any number of digits, followed by a semicolon. (Unlike JSON,
non-BMP characters are represented by the actual codepoint, not by a
surrogate pair.)
-
Cells: \[§....§]
where §
is any character
that does not appear in the string. This is analogous to XML's CDATA
section, except that it can also be used in attributes: it allows a
literal string to appear without escaping of special characters. For
example a sequence of four backslashes might be written
\[⟡\\\\⟡]
. Cells are handy for things such as regular
expressions and Windows filenames, and for authoring papers that
describe new markup languages.
The only characters that must be escaped in strings are \
,
{
, and the character used as the string delimiter ("
or '
). We'll come on to the significance of curly braces
later.
Lists
A list is a sequence of values. The values may be of any of the seven kinds (null,
boolean, number, string, list, element, or rich text).
The unabbreviated syntax is the same as for arrays in JSON:
list ::= "[" (value ("," value)* )? "]"
For example, [1, 3, "London", null]
Two abbreviations are allowed:
-
Commas may be omitted, so [1 2 3]
is equivalent to
[1,2,3]
and
[<first><last>]
is
equivalent to [<first>,<last>]
.
-
The value null
is implicit if there is nothing between
two commas, or before the first comma, or after the last. So
[,,]
is equivalent to
[null,null,null]
The effect of these two rules is that the abbreviated syntax for lists
becomes:
lists ::= "[" ( value | ",")* "]"
Whitespace is needed between two values only where necessary to terminate a token;
specifically, when
one value ends with an alphanumeric character and the next starts with an alphanumeric
character.
Elements
Elements serve the same purpose as objects (maps) in JSON and elements in
XML.
element ::= "<" name? (name "=" value)* content? ">"
content ::= value
Elements have three parts: an optional name, a set of name/value pairs called
attributes, and an optional value referred to as the element's content.
The values of attributes can be of any type: not just strings as in XML, but
numbers, booleans, lists, elements, rich text. An attribute with the value null is
deemed equivalent to omitting the attribute.
Attribute names within an element must be distinct.
Like attributes, the content value can be of any type.
As with lists, whitespace is needed only where necessary to terminate a
token.
We'll have more to say on element and attribute names later. For the moment,
suffice it to say that the name can be any non-empty string. If the name contains
special characters it can be written within backticks (a convention borrowed from
the SQL world).
Here are some examples of elements. (We haven't explained rich text yet, so we
won't use it in any of our examples):
Table I
Example |
Explanation |
<> |
An empty element (no name, attributes, or content) |
<br> |
An empty element named br |
<age 23> |
An element whose name is age and whose content
value is the number 23
|
<colors ["red", "green", "blue"]> |
An element whose name is colors and whose
content value is a list of three strings
|
<x=0.13 y=0.57> |
An unnamed element containing two attributes, both
numeric
|
<polygon coords=[[1,1], [1,3], [3,2]]> |
An element named polygon with an attribute named
coords whose content value is a list; the list
contains three sublists, and each sublist contains two
numbers.
|
<[<i><j><k>]> |
An unnamed element whose content value is a list of three
elements. Note the omission of the optional commas.
|
<`Graduate Trainee` `date of birth`="1995-01-01"> |
An element where both the element name and attribute name contain spaces. |
Rich Text
Rich Text (known in the XML world as mixed content) consists of characters and markup,
or more specifically a sequence whose members are either characters or elements.
richText ::= "|" (charRep | element) "|"
Rich text is written between vertical bars.
For example: |H<sub|2|>O|
represents text consisting of the
character "H", an element whose name is sub
and whose content is the
rich text "2", and the character "O".
Escapes can be used in rich text just as they can in strings. Any recognized
escape sequence may be used; the only characters that must be escaped are "\", "|",
"{", and "<".
Whitespace
FtanML (unlike XML) is explicit about the difference between significant and
insignificant whitespace.
Whitespace appearing directly within a string or within rich text is significant
and is retained in the data model — except that a sequence of whitespace characters
preceded by a backslash is ignored (this is formatting whitespace, used only to make
the text more easily readable on screen or paper). Whitespace between tokens in a
list or element is insignificant and is not retained. Whitespace is never required
between tokens unless necessary for disambiguation.
Note that because elements may be embedded in rich text, these rules apply
recursively. Whitespace characters appearing between the tokens of an element that
itself appears within rich text are not significant; it is the immediate container
that matters. Support for rich text means that unlike JSON, this is not a two-level
grammar where it makes sense to think of a tokenization phase followed by a syntax
analysis phase, with whitespace being discarded during tokenization.
Names and Namespaces
As stated earlier, the name of an element or attribute may be any string. Names without
special characters are called simple names; those containing special characters must
be
written with enclosing backticks (grave accent, x60), and are called quoted names.
The rule for a simple name is that it must begin with a letter or underscore, and
continue with letters, digits, or underscore. The terms "letter" and "digit" are
defined by reference to Unicode character categories.
A quoted name may use escaped characters in the same way as a string literal. The
only characters
that must be escaped are the backslash and backtick.
name ::= simpleName | quotedName
simpleName ::= [\p{L}_][\p{L}\p{D}_]*
quotedName ::= "`" ((charRep - "`") | Escape)+ "`"
A name written in a FtanML document, with or without backticks, cannot be zero
length; in the data model, however, the content value is modelled as an attribute
with a zero-length name.
There are no namespaces in FtanML.
As a matter of convention, it is recommended that an element or attribute intended
to be used in an alien context, that is, a context where the containing element is
part of a different vocabulary defined by a different specification,
should be made unique by use of a "reverse-DNS" qualified name along the lines of
org_w3c_xsl_transform
.
By contrast, in the normal case where an element or attribute
always has a containing element whose name is defined as part of the same vocabulary,
short names such as status
or
name
are perfectly adequate and cause no ambiguity.
For interoperability with XML, there may be cases where it is desirable to
use the same names for elements and attributes as defined in an XML vocabulary. There
are two ways
this might be done:
-
The XML expanded name can be used in Clark notation, enclosed in
backticks. For example:
[<`{http://www.w3.org/1999/XSL/Transform}stylesheet` version="2.0"...>
-
Prefixes and namespace declaration attributes may be used, following XML conventions:
<`xsl:stylesheet` `xmlns:xsl`="http://www.w3.org/1999/XSL/Transform" version="2.0"...>
.
The FtanML system will not attach any meaning to such namespace declaration
attributes, but it is capable of representing them if required.
Note that any name containing a colon (or various other characters such as ".") needs
to be
backtick-quoted.
The Schema Language: FtanGram
The schema language can be used to define constraints on values, including constraints
on entire documents. This is the only purpose of a schema; validation returns a true
or
false answer, perhaps with a stream of error messages as a side effect, but it does
not
change the data being validated in any way, except perhaps as an internal
optimization.
A type is thus a predicate; it distinguishes values that match the type from those
that do not.
A schema is a set of named types. The seven named types null
,
boolean
, number
, string
, list
,
element
, and text
are always available; other types are
user-defined.
Types have a representation as FtanML elements, and we will use this representation
in
discussing types. However, the element used to represent a type must not be confused
with the type itself.
The convention for type representations is to use elements such as
<number gt=0 le=1000>
, where number
is the
name of a base type, and attributes such as gt=0
and le=1000
define constraints. These attributes are referred to as facets. If there are multiple attributes, they define multiple
constraints, which are independent and orthogonal. In this example, the gt
facet defines a minimum value (exclusive), while the le
facet defines a
maximum value (inclusive). Specifying a base type is often unnecessary — in this example
every value that can be greater than zero is necessarily a number, so every value
that
satisfies the predicate will also satisfy the base type. However, including the base
type can still be useful to aid clarity.
Although we speak of "base type" here, there is no type hierarchy. One value can
belong to any number of types, and although it may be true that one type subsumes
another, the language makes no use of the fact. Naming a base type in a type
representation merely indicates that to satisfy the type, a value must satisfy all
the
constraints imposed by the base type in addition to the facets explicitly listed.
Before we get into a detailed exposition, we'll again start with an example.
FtanGram Example: the Purchase Order Schema
In this section we present the schema for the purchase order shown earlier.
This is based on the example schema in the XSD primer, modified to correspond
with the way we restructured the instance document to take advantage of FtanML.
<org_ftanml_schema [
<import "ftan_calendar.ftg">
<types
purchaseOrderType =
<element form=<purchaseOrder
shipTo=<addressType>
billTo=<addressType>
comment=<nullable<text elements=<inlineType>>>
items=<occurs=[1,] <itemType>>
>
addressType =
<element form=< country=<eq="US">
<seq [ <element form=<name <string>>>,
<element form=<street <string>>>,
<element form=<city <string>>>,
<element form=<state <string>>>,
<element form=<zip <number>>>]>
>
itemType =
<element form=< partNum=<SKUType>
productName=<string>
quantity=<number ge=1 lt=100 step=1>
USPrice=<number ge=0 step=0.01>
comment=<nullable<text elements=<inlineType>>>
shipDate=<nullable<org_ftanml_calendar_dateType>>
>
>
inlineType =
<element elemName=<enum=["ital", "bold"]>
form=<<inlineType>>
>
SKUType = <string pattern="\[#\d{3}-[A-Z]{2}#]">
>
]>
Looking at this in a little detail, we see:
-
A schema is a set of named types. Some of these types are defined inline, some (in
this case
org_ftanml_calendar_dateType
) are imported from an external type library.
-
Elements are defined using the form
attribute. The value of this attribute is a proforma element. The name
of the proforma element matches the name of the instance element; the
attributes of the proforma element define the types of the attributes of
the instance element; and the content value of the proforma element defines
the type of the content value of the instance element.
-
An optional attribute is given a type such as
<nullable<T>>
. This reflects the fact that
an absent attribute is equivalent to an attribute that has the explicit
value of null; so as well as the normal type of the attribute, the
schema must also allow it to take the value null.
-
Note the use of a "cell" for escaping the regular expression in the pattern facet
for SKUType
. This helps to avoid clutter in a string that makes generous use
of special characters, especially backslashes.
Constructing Types
The construct <value>
represents a type that matches
every value.
Given types T
, U
, V
, the construct
<anyOf [T, U, V]>
represents the union of these types,
while <allOf [T, U, V]>
represents their intersection.
For example, <anyOf [<number>, <string>]>
allows numbers
and strings, while <allOf [<positive>, <even>]>
allows
values provided they satisfy both the (user-defined) types positive
and
even
.
For convenience, the construct <nullable <T>>
is equivalent
to <anyOf [<T>, <null>]>
: that is, either T
or
null. Thus <nullable <number>>
matches either a number, or
null.
An enumeration type can be defined using the construct
<enum=[A,B,C,...]>
. For example,
<enum=["red", "green", "blue"]>
matches the three
specified strings and nothing else. A singleton enumeration can be defined with the
eq
facet: for example <eq="">
matches the
zero-length string only.
The construct <not <T>>
denotes a type that matches all
values that are not instances of T
. This can be useful in constructing
more complex types; for example <not<eq="">>
matches all
non-empty strings, while <allOf [<number>, <not <eq=0>>]>
matches values that are numbers and that are not equal to zero.
The most general way of defining a restriction is with an assertion facet, for
example: <assert={$.startsWith("abc")}>
. To understand
assertions, however, we need to look at the scripting language, which comes later
in
the paper. (The curly braces signal that the value is a function; this represents
an
extension to the base FtanML syntax which is used only in scripts.)
Restricting numbers
Numeric ranges may be defined using the four attributes ge
,
gt
, le
, and lt
, corresponding to the XML
Schema facets minInclusive
, minExclusive
,
maxInclusive
, and maxExclusive
, together with the
facets eq
and ne
which are applicable to all values. For
example, the type consisting of numbers in the range 0 to 100 inclusive may defined
as <number ge=0 le=100>
. (As mentioned earlier, the element
name number
is redundant, because only a number can satisfy the other
constraints.)
A step
facet constrains the number to be an integer multiple of the
given increment. The most common values (both found in our example schema) are 1,
which requires the value to be an integer, and 0.01, which is often suitable for
currency amounts. Specifying step=17.2
would be unusual, but is
perfectly legal. The facet does not constrain the way the value is written, for
example an integer can be validly written as 1.00000
.
Restricting strings
Strings may be restricted using a regular expression, for example
<string pattern="[A-Z]*">
. There are no special facets
for defining a minimum, fixed, or maximum length, since regular expressions are
sufficient for this purpose.
Restricting lists
A list can be constrained with a grammar. A grammar is a facet like any other:
just another way of defining a restriction on the content, and it is defined in the
same way: <list grammar=....>
. A simple grammar might allow a list
to consist of a sequence of zero or more numbers. This would be defined like
this:
<list grammar=<number occurs=[0,]>>
To take another example, a grammar might require a value to be a list comprising a
string, a number, and a boolean. Here is the definition:
<list grammar=<seq [<string>, <number>, <boolean>]>>
Unlike most schema languages in the XML world, grammars can constrain any
sequence of values, not only a sequence of elements. In principle, if there are
subtypes of string representing nouns, verbs, and so on, then a grammar could
constrain a list to contain a sequence of words making up an English
sentence.
The "alphabet" of the grammar — the set of tokens it recognizes — is the
set of types. The fact that a value might belong to more than one of these types does
not
matter. The grammar exists not to define an unambiguous parse tree of the input, but
only to determine whether the input is valid against the type definition or not.
A grammar can be represented as a tree of particles. Each particle consists of
a term (what does it match?), and a repetition indicator (how often does it match?).
For leaf particles, the term is a type. Non-leaf particles are either sequence
particles or choice particles, and in each case the term is the set of child
particles in the tree.
The value of the grammar
facet is an element representing
the root particle in this tree.
The three kinds of particle are represented as follows:
-
A sequence particle is represented by an element named seq
; an optional
occurs
attribute; and content which is a list containing
the child particles in the tree. For example:
<seq occurs=[0,] [<white>,<black>]>
, which
matches an alternating sequence of values of types <white>
and <black>
.
-
A choice particle is represented by an element named choice
; an optional
occurs
attribute; and content which is a list containing
the child particles in the tree. For example:
<choice occurs=[0,] [<white>,<black>]>
, which
matches sequence of values, each of which can be either of
<white>
or <black>
type.
-
A leaf particle is represented by the same element used to describe the type, augmented
if
necessary with an occurs
attribute. For example
<number>
, or <number occurs=10>
.
The occurs
attribute defaults to 1; it appears alongside the
attributes defining facets of the type, though it is not really a property
of the type, but rather of the particle referring to the type.
The value of the occurs
attribute is either an integer (indicating a
fixed number of occurrences), or a list of size two (indicating a range with a
minimum and maximum). The first item must be an integer, the second can be either
another integer, or null to indicate an unbounded range. For example
[0,1]
indicates an optional particle (zero or one occurrences),
[0,]
indicates zero or more, and [1,]
indicates one or
more. The default is occurs=1
.
Some further examples of grammars are shown in the table below:
Table II
Example |
Explanation |
<seq [<string>, <number>, <number>]> |
A string followed by two numbers |
<seq [<string>, <number occurs=2>]> |
A string followed by two numbers |
<occurs=[0,] <seq [<string>, <number>]>> |
An alternating sequence of strings and numbers |
<enum=["red", "green", "blue"] occurs=[1,]> |
A sequence of one or more strings each taken from a defined
set of colour values
|
<occurs=[0,100] <choice [<string>, <number>]>> |
A list of up to 100 items, each of which may be either a
string or a number. Note that when the sub-particles of a choice
are leaf particles, an alternative approach is to define a union
type using <anyOf> |
Many of these examples serve the purpose that in XML Schema would be achieved
using simple types of variety list or union. But of course, in the document markup
tradition, grammars are commonly used to define sequences of elements, and we will
see examples of this in the next section.
Restricting elements
The simplest way to place restrictions on elements is by use of the form
facet.
Its value is an element, known as a proforma, which works as follows:
-
The name of the proforma element constrains the name of the target element.
-
The attributes of the proforma element constrain the attributes of the target element.
-
The content value of the proforma element constrains the content value of the target
element.
For example, the proforma:
<img height=<number> width=<number> <null>>
represents an element whose name must be "img", whose height
and
width
attributes must be numbers, and whose content value must be
absent (null).
This proforma can be used to define an element type like this:
<element form=<img height=<number> width=<number> <null>>>
Like all facets, a proforma can only define restrictions. If the proforma includes
no element name, then it places no restrictions on the element name. If a particular
attribute is not present in the proforma, then it places no restrictions on the
presence or content of that attribute. If the proforma has no content value, then
the content value of the target element is unconstrained.
If an attribute is to be optional, this can be indicated by permitting null as the
value: for example writing height=<nullable<number>>
indicates that the height
attribute must either be a number, or null.
Recall that omitting an attribute is the same as giving it a value of null.
Some additional facets are available for elements for use where the proforma construct
is
insufficiently expressive:
-
The elemName
facet defines the type of the element name.
For example
<element elemName=<enum=["i", "b", "u"]>>
constrains the element
name to be one of the names listed.
-
The attName
facet defines the type that all attribute names must conform to (for
example, as an enumeration, or by means of a pattern). This is the easiest
way of prohibiting attributes from appearing (the other way is to constrain
the value to be null). For example, attName=<ne="xmlns">
would disallow the attribute name xmlns
; this constraint could
also be expressed in the proforma as xmlns=<null>
.
-
For convenience, as an alternative to using a proforma, the content of the element
can be
constrained using the content
facet. The value is a type. For
example, content=<boolean>
constrains the content to be the
boolean value true or false, while content=<null>
constrains
it to be null (which can be achieved either by omitting the content, or
using the FtanML keyword null
).
We can now see how to define an element type that participates in the content model
of another
element type. Suppose we have an element named items
whose children are
elements named item
with string content. We can define the type of items
like this:
<element form=
<items
grammar=<element form=<item <string>> occurs=[0,]>
>
>
(I find it useful when writing such constructs to ensure that every angle bracket
is aligned
either vertically or horizontally with its partner, and to limit the nesting of angle
brackets on a single
line to about 3.)
Content models like this would quickly become unwieldy if the whole structure had
to be defined inline. In addition, it would not be possible to reuse types in
different parts of the model. It is therefore possible for the definition of one
type to refer to other types by name. The above example could be expressed using named
types in a schema, thus:
<types
itemsType = <element form=<items <grammar=<itemType occurs=[0,]>>>>
itemType = <element form=<item <string>>>
>
Restricting Rich Text
Most of the time, the only restriction that needs to be placed on rich
text is to define what elements may appear within it. This is done with
an elements
facet, whose value is a type.
All elements appearing in the text must conform to this type.
We don't expect it to be used very often, but FtanML also allows rich text to
be constrained with a grammar. The rules for defining a grammar are exactly the same
as for lists, and they define the grammar when the text is considered as a list
containing strings and elements. For example, a grammar might define that the first
thing to appear is a headword
element, and after that there are no
constraints.
Uniqueness and Referential Constraints
As with XML Schema, definition of constraints takes advantage of the processing
language, so this section contains some forward references to facilities not yet
introduced.
Uniqueness is enforced by a function-valued facet. For example:
<list unique={$@id}>
expresses a contraint on a list of elements stating that among the elements in
this list, all attributes named id
must have distinct values. Null
values are excluded. This facet can be applied to lists and elements; in each case
the supplied function is used as a mapping function, and is applied to each item in
the list or each attribute of the element, as if by the !
operator; the
value is invalid if the resulting list contains any duplicates. So a simple
constraint that all the numbers in a list of numbers be unique can be expressed as
unique={$}
; a constraint that the names of the attributes in an
element should each start with a different letter can be written
unique={substring($, 0, 1)}
, and a constraint that all the non-null
attributes of an element should have distinct values can be expressed as
unique={$2}
(when a mapping function is applied to an element, the
first argument $
is the attribute name, and the second argument
$2
is the attribute value).
Referential constraints are enforced by a similar facet whose value is a pair of
functions, one of which selects the references (foreign keys) and one the target
identifiers (primary keys):
<ref=<from={$@ref} to={$@id}>>
The rule is that the set of values selected by the from
function
(again excluding any nulls) must be a subset of the set of values selected by the
to
function.
Queries and Transformations: the FtanSkrit Processing Language
FtanSkrit is a functional, weakly-typed, Turing-complete programming language for
manipulating instances of the FtanML data model. It is an expression language with
full orthogonality: any expression can be used as an operand of any other
expression, subject only to rules on operator precedence and type
constraints.
A program in FtanSkrit is written as a function (a function which typically takes
a source document as input and produces a result document as its result). The body
of a function is an expression, and this exposition of the language will focus on
the different kinds of expression that can be written.
The data model that can result from parsing a FtanML document, as we saw earlier,
can contain seven types of value: null, boolean, number, string, list, element, and
text. We also mentioned an eighth type of value, namely a function. Functions can
appear
in the data model anywhere that the other seven types of value can appear, for example
as the value of an attribute in an element, or as the value of an item in a list.
Because expressions can be nested arbitrarily, it's not easy to define the different
classes of expression without forward references to concepts that haven't been explained
yet, and it's also rather difficult to know where to begin. But because functions
are so
important and central, that's where I'll start.
Functions
There are two important kinds of expression associated with functions: function
declarations and function calls.
Function Declarations
A function is written as an expression enclosed in curly braces. Here's a
function that computes the sum of its two arguments: {$1 +
$2}
References to parameters are written $1
, $2
etc, where
$N
refers to the Nth supplied argument in the function call.
The expression $
can be used in place of $1
to refer
to the first argument, and is particularly useful for functions that expect a
single argument. It can be used in rather the same way as .
(the
context item) in XPath, and plays a similar role to _
in languages
such as Perl or Scala.
For example, a function that returns true if the supplied element has a name
might be written {name($) != null}
.
Functions have no name, but can be bound to named variables, in which case the
variable name serves effectively as a function name. Functions in the system library
are bound to predefined variables.
A function does not have a fixed arity. The example function {$1 +
$2}
expects two arguments, but it can be called with more than two
arguments (excess arguments are ignored), or with fewer than two (unsupplied
arguments default to null).
The expression $$
returns all the supplied arguments in the form
of a list. This makes it possible to write functions that take a variable number
of arguments: the actual number is accessible as count($$)
.
Functions can refer to variables defined outside the function body, which
become part of the closure of the function, to be used when it is
evaluated.
Within the body of a function, the variable self
is bound to the
function itself. This makes it easy to write anonymous recursive functions: for
example a function to compute the sum of its arguments can be written as
{if empty($$) then 0 else $ + self(tail($$))}
. We'll see later
how to write mutually-recursive functions.
Because a function is an expression, it can be used anywhere an expression can
appear; for example as the value of an attribute in an element. This allows an
element to be used as a map from strings to functions, which is very like
Javascript's notion of an object. This enables a kind of polymorphism.
Sometimes it is useful to design a function so that parameters are supplied by
name rather than positionally. The can be achieved by writing the function to
accept an element as its argument. The caller might supply the arguments like
this: f(<x=2 y=3>)
; and in the function body the supplied values
can then be referenced as $@x
or $@y
.
Functions do not declare their arguments explicitly. As a matter of convention,
when writing a public function it is good practice to bind the supplied parameters
to variables along with a type check. For example the following implementation
of the indexOf
function starts by giving names to its arguments and checking their type,
which simultaneously makes the function more robust and more readable.
let indexOf = {
let Array = $1.as(<occurs=[0,] <number>>);
let Search = $2.as(<number>);
0..count(Array)-1?{Array[$]=Search}
}
Because argument types are not declared, it's up to the implementor of a function
what to do when the caller supplies arguments of the wrong type. There are no implicit
conversions
defined as part of the call mechanism. The preferred approach is to throw an error,
which
can be readily achieved using the coding style in the above example.
Function Calls
If F
is an expression that evaluates to a function, then the
function may be called with arguments x
and y
using
the expression F(x, y)
.
If f
is a variable whose value is a function, and if the function
has at least one argument, then a function call can be written either as
f(x,y)
or as x.f(y)
.
As in XPath 3.0, partial function application (currying) is possible by
supplying ?
for one of the arguments: contains(?, ':')
returns a function whose effect is to test whether its first argument is a
string that contains a colon.
Some built-in functions can also be invoked using an infix operator. For
example the +
operator corresponds to the plus
function; a + b
has the same meaning as plus(a, b)
or
a.plus(b)
. All the operators in the language, including higher-order operators, are defined
in terms of functions, to allow them to be passed as arguments to higher-order
functions.
The names of built-in functions always use the ASCII alphabet; for
some operators we have allowed ourselves the luxury of reaching beyond ASCII,
but users can always avoid relying on such operators and can use the function
name instead.
List and Element Constructors
The syntax of lists and elements is extended so that expressions may appear
anywhere the FtanML syntax allows a value.
For example, the expression {[$, $+1, $+2]}(5)
returns the list [5,
6, 7].
Lists in FtanML are not automatically flattened, so the expression [1 to 5,
10]
produces the length-2 list [[1,2,3,4,5],10]
rather than
the length-6 list [1,2,3,4,5,10]
. The latter result can be achieved
either by applying the flatten()
function explicitly, or by using list
concatenation/append operators: for example (1..5).append(10)
.
In an element constructor, expressions can be used to compute the values of
attributes, but cannot be used to compute their names. The value can be expressed
either as a parenthesized expression, or using a string or text value containing
expressions embedded in curly braces: <img size=(x+1) caption="Figure
{n}">
. The same applies to the content value. Note that curly braces are
used only for inline expansion of strings and text (and for writing functions); to
compute general structured content, parenthesized expressions should be used. The
expression:
<job-titles (distinct-values(employee@job-title))>
might generate
<job-titles ["Manager", "Programmer", "Bottle-Washer"]>
A null value for an attribute indicates the effective absence of the attribute, so
the expression <size x=(a+1) y=(if a=2 then 3 else null)>
might
produce the output <size x=3>
.
More specifically, in an element constructor, the value of an attribute, or of the
content value, can take any of the following forms:
-
A literal value, for example a=3
or a="blue"
or
a=false
.
-
A string or text value with embedded expressions enclosed between curly braces, for
example
a="Chapter {n}"
The value of the attribute is obtained by
evaluating the embedded expressions as strings and inserting the resulting
strings into the text.
-
A list constructor, for example a=[n, n+1, n+2]
.
-
An element constructor, for example a=<x=(n+1) y=(n+2)>
-
A parenthesized expression, for example a=(n+1)
-
A function, for example a={$+1}
. In this case the value of
the attribute is the function, not the result of evaluating the
function.
Where element constructors cannot be used because the element or attribute names
are not known statically, functions can be used to construct an element. For
example:
element("img").add("x", 3).add("y", 5).add("", "An image")
}
Here the function call element("img")
constructs an element with a
given name, and the add()
function adds an attribute with a given name
and value (copying the element to create a new element). The last call adds the
element content, represented as an attribute with an empty name. It should be
remembered that although we use the term "element", FtanML elements will not only
be
used in the way that XML elements are traditionally used, but also in the way that
maps are used in other programming languages, where the keys (attribute names) are
highly dynamic: indeed, to satisfy the kind of use cases for which maps are being
added to XSLT 3.0.
Rich text (mixed content) is constructed as a list of strings and elements, which
is then converted to rich text by applying the toText()
function.
Conditional Expressions
The syntax for a conditional expression is:
"if" expression "then" expression "else" expression
There is no need for parentheses (though you can use them if you like, for old
time's sake). The "else" branch is mandatory, partly to avoid choosing an arbitrary
default (null?) and partly to prevent the dangling-else ambiguity when conditional
expressions are nested. For example:
if $ = 0 then x else y
A simple try/catch construct is provided:
"try" expression "catch" function
which returns the result of the expression unless an error occurs during its
evaluation, in which case the catch function is called, supplying error
information as its argument, in the form of an element with attributes
representing the error code and error description.
For example, the following catches a divide-by-zero error (we assume use of
the XPath error codes), and returns null if it occurs; otherwise the error is
re-thrown:
try (x.div(n)) catch {if $@code="FOAR0001" then null else error($)}
A function orElse
allows a default to be substituted when a value
is null. For example a.orElse(0)
returns the value of a
unless it is null, in which case it returns zero. This function could be defined as
{if $1=null then $2 else $1}
.
Variables
Variables have simple names (no "$" prefix, no backticks). The names
true
, false
, and null
are reserved: they
are used syntactically like variables, but have fixed predefined values. Language
keywords such as if
and let
are also reserved: unlike
XPath, this is possible because bare names are not used to refer to elements and
attributes in input documents.
Variables may be declared using the construct:
LetExpression ::= "let" name "=" expression; expression
which evaluates the second expression with the named variable bound to the
value of the first expression; for example let x=2; x+x
returns
4.
Variables declared in this way are available only after they have been
declared. An alternative style of declaration allows forwards references to
variables, which is necessary when writing recursive functions. This style uses
element notation:
let <
even = {if $=0 then true else odd(abs($)-1)}
odd = {if $=0 then false else even(abs($)-1)}
>;
even(32)
With this approach, all the variables declared as attributes of the same
element are in scope within each others' declarations, failing dynamically (or
in the worst case, failing to finish) if this results in non-terminating
recursion.
Equality and Other Comparisons
The eq
function (operator =
) is defined over all
values. To be equal, two values must have the same fundamental type (this means, for
example, that the string "John"
is not equal to the rich text
|John|
). Strings are compared codepoint-by-codepoint. Lists are
equal if they have the same size and their items are pairwise equal. Elements are
equal if they have the same name, and if there is a one-to-one correspondence of
attributes in which both the attribute names and the corresponding values are equal
(the content value is treated here as an attribute). Two texts are equal if the two
sequences of strings and elements making up the texts are pairwise equal.
Defining equality for functions requires further work. Some languages such as
ML and Haskell make equality of functions undefined, but this would mean that
equality of lists and elements containing functions also becomes undefined.
Currently my preference is to make equality of functions implementation-defined,
subject to the proviso that two functions can only be equal if all invocations are
guaranteed to return equal results. It would be useful to attempt a more careful
definition, for example one that guarantees that the result of the expression
let a=b; a=b
is always true, but formalizing this is not easy
without introducing some notion of identity.
The ne
function (operator !=
) is the inverse of
eq
.
Ordering (specifically, the functions le
, lt
,
ge
, gt
, and their corresponding operators
<=
, <
, >=
,
>=
) is defined over numbers and strings only. Strings are
sorted in Unicode codepoint sequence.
Testing whether a value V
is present in a list A
(the equivalent of the =
operator in XPath) is sufficiently common that
we provide a function, in(V, A)
with corresponding operator ∊ (x220A).
The function in(V, A)
can be defined as {let V=$1; let A=$2;
exists(A?{$=V})}
. (This uses a filter operator which we will introduce in
due course.)
A collation is modelled as a set of functions. Specifically, a collation for a
particular language, say Swedish, is obtained using the function call
collation(<lang="se">)
. This returns an element, whose
attributes are functions. One of these functions is a sort function, so to sort a
list of strings using Swedish collation, one can write
collation(<lang="se">)@sort(input)
. Other functions available as
attributes of a collation include comparison functions eq
,
le
, etc, and collation-sensitive versions of other functions that
involve string comparison such as in
, min
,
max
, indexOf
, contains
,
startsWith
, endsWith
, substringAfter
,
substringBefore
.
Comparing A = null
returns true if A
is null,
false otherwise. (This is not as obvious as it might seem, given the semantics in
some
other languages.)
Operations involving Types
The function isA
tests whether a value belongs to a given type.
Types are represented using the element representation introduced in an earlier
section. For example, x.isA(<number ge=0>)
returns true if the
value of x
is a number and satisfies the facet ge=0
.
Recall that a type is a predicate, not a label associated with a value, so this
tests whether the value meets all the constraints defined by the type, not
whether the value carries any particular type label.
For convenience, the functions isNull
, isBoolean
,
isNumber
, isString
, isList
,
isElement
, isText
, and isFunction
are
available to test for membership of the primitive types.
The function as
is similar to isA
, but instead of
returning a boolean indicating whether or not the value is a member of the type,
it returns the value unchanged if this is the case, and throws an error
otherwise. We saw this function used to check the arguments to a function call.
For convenience, the functions asNull
, asBoolean
,
asNumber
, asString
, asList
,
asElement
, asText
, and asFunction
are
available to test for membership of the primitive types. In each case, they return
the argument unchanged if it matches the corresponding type, or throw an error
otherwise.
Functions for conversion of values have names such as toString
,
toList
, and toText
. There are no general rules here;
as in other languages, the rules for what can be converted to what are inherently
ad-hoc.
The parse
function takes a FtanML lexical representation of a value
and returns the corresponding value; conversely, serialize
takes a
value and returns its FtanML lexical representation.
Boolean Functions and Operators
The functions and
, or
, and not
are
available. The first two have equivalent operators &&
and ||
.
The argument must either be a boolean or null; there is no implicit conversion
to boolean as in XPath. If an argument is null, the operators implement three-valued
logic as in SQL, for example (null||true)
is true
.
Order of evaluation is not defined; programmers should not assume that the
second argument will only be evaluated if it is required. (This rule might seem
unnecessary in the absence of side-effects, but it becomes important when defining
the terminating conditions of a recursive function. Like XPath, we choose to allow
optimizers the freedom to re-order the terms in a predicate, which can be important
when indexes are available on large data sets.)
Numeric Functions and Operators
The functions plus
, minus
, times
,
div
, idiv
, and mod
are defined; the first
four have corresponding operators +
, -
, *
,
and /
. Arithmetic is performed in decimal. Division by zero is an
error; the precision of the result of division is implementation-defined, as are
limits on the value space.
Additional functions abs
, round
, floor
,
ceiling
have the same effect as in XPath.
The function parse
may be used to convert a string to a number.
Writing parse(X).asA(<number>)
checks that the value was
indeed numeric.
Supplying null to an arithmetic function or operator results in the value
null. Supplying any other non-numeric value causes an error. There is no
implicit conversion of strings to numbers.
Aggregation functions sum
, avg
, min
,
max
work as in XPath.
The function to
or its equivalent operator ..
returns a list of consecutive integers, for example 1..5
returns the
list [1,2,3,4,5]
.
String Functions and Operators
The toString
function can be applied to any value without
exception to convert it to a string. If the input is already a string, it is
returned unchanged. If the input is a boolean, number, or null, the result is the
same as the FtanML literal representation of the value. If the input is a list, the
result is string-join($!toString, " ")
, that is, the space-separated
concatenation of the string values of the members of the array. If the input is an
element, the result is string(content($))
, that is, the string value of
the element's content. If the input is rich text, the result is
string-join($.toList()!toString)
, that is, the concatenation
(without space separation) of the string-values of the strings and elements making
up the text. If the value is a function, the resulting string begins with "{" and
ends with "}", and is otherwise implementation-defined; it is not necessarily a
string that can be parsed and executed as a function.
String concatenation can be achieved conveniently using string templates, for
example "See section {s} in chapter {n}"
. This mechanism can be used
wherever a string literal is permitted. The result of the enclosed expressions is
converted to a string if necessary, by using the toString
function.
Generally FtanSkrit avoids implicit conversion. For example, if rich text is
to be compared to a string, it must be converted to a string explicitly. When null
is supplied to a function that expects a string, it is generally treated as a
zero-length string (but this is a convention adopted by the functions in the
built-in function library; it is not a feature of the language).
Other functions are available as in XPath. Counting of characters in a string,
however, starts at zero. The basic built-in functions use codepoint collation;
equivalents using a different collation can be obtained as attributes of the
collation.
Functions and Operators on Lists
The number of items in a list A
is obtained using
count(A)
(or equivalently, A.count()
).
The construct A[n]
selects the n'th item in list A
.
This construct is never used for filtering, only for subscripting. If n
is out of range, the expression returns null. This operation is also available as
a
function itemAt(A, n)
.
The function cat(a, a)
(operator ++
) concatenates
two lists. The function append(a, i)
(operator +~
) appends
an item to a list, while prepend(i, a)
(operator ~+
)
prepends. Thus for example 0 ~+ [1,2,3] ++ [4,5,6] +~ 7
returns the
list [0, 1, 2, 3, 4, 5, 6, 7]
.
The function head(a)
is equivalent to a[0]
, while
tail(a)
equates to remove(a, 0)
.
The function flatten
flattens a list: it creates a new list in
which any non-list items in the argument list are copied to the new list, and any
list items are processed by copying their contents. This only works one level deep.
So flatten([[1,2],[3,[4,5]]])
returns
[1,2,3,[4,5]]
.
Functions index-of
, remove
, subsequence
work as in XPath, except that indexing starts at zero. The
insert-before
function inserts a single item at a specified
position; if the supplied item is a list, it becomes a nested list (there is no
flattening).
The function toList
works as follows: if the argument is a list,
it is returned unchanged. If the argument is rich text, it is converted to a list
whose members are (non-zero-length) strings and elements. In other cases, the
function creates and returns a singleton list in which the argument is the only
item. This function is useful because it makes it easier to process different types
of content in the same way: a single element looks the same as a list of elements
of
length one, which looks the same as mixed content comprising a single element; a
single string looks the same as mixed content containing no elements.
The function forEach
, or the equivalent operator !
,
applies a function to every item in a list. So forEach([1,2,3], {$+1})
returns [2,3,4]
; this can also be written [1,2,3] ! {$+1}
.
Similarly, [1,2,3]!toString
returns ["1", "2", "3"]
. Note
that this is a non-flattening mapping operation; the result list will contain
exactly the same number of items as the input.
Another example: (1..5)!{<br>}
returns
[<br>, <br>, <br>, <br>, <br>]
The function select
, or the equivalent operator ?
,
applies a function to every item in a list and returns a list containing those items
for which the function returns true. So select([1,2,3], {$>=2})
returns
[2,3]
; this can also be written [1,2,3]?{$>=2}
.
Similarly, [1,2,"London"]?isNumber
returns [1,2]
.
The result of the select
operator or function is always a list,
even if only one item is selected. If it is known that the predicate will select
exactly one item, it is necessary to extract that item from the result, typically
by
a call on head
, or by using the subscript operation
(A?P)[0]
. Because this is a common operation, the operator
??
is provided, equivalent to ?
followed by
head()
: it selects the first item found, or null if nothing was
matched. A query to find a singleton can now be written, for example
items??{$@id='xyz'}
.
The functions all
and some
can be used in
conjunction with the forEach
(!
) operator to perform
universal and existential quantification: they test whether a list consists entirely
of boolean true values (all), or contains at least one true value (some). So, for
example all([1,2,3]!{$>0})
returns true, while
some([1,2,3]!{$=0})
returns false.
Functions fold-left
and fold-right
are available as
in XPath 3.0.
Functions and Operators on Elements
The function name(E)
returns the name of an element
E
, or null if it is unnamed. The syntax E.name()
can also be used, of course.
The mapping and filtering operators (!
and ?
) apply to
elements as well as to lists. In this case they expect a two argument function to
be
supplied, and call this with the attribute name as the first argument and the
attribute value as the second. The mapping operator returns a list with as many
members as there are non-null attributes in the input; the filter operator returns
an element with the same name as original, and with a subset of its attributes.
These operators treat the content value as just another attribute. They provide the
most general and powerful means of processing elements, and other operations can be
defined in terms of these two. For example, the function content(E), which returns
the content of an element, could be defined as E?{$1=""}!{$2}.
For example E?{$.in("id", "code", "status"}
returns a copy of element
E, retaining only the three specified attributes.
If E
is an element and xyz
is the name of an attribute
(known at the time the program is written), then E@xyz
returns the
value of the attribute. It returns the value only, not an "attribute node" as in
XPath; if the attribute is not present, it returns null. If the name needs to be
dynamically computed, this can be achieved using an expression in parentheses, for
example E@(X@name)
returns the attribute of E
whose name
is given by the expression (X@name)
. The construct E^
is
an abbreviation for E@``
— it returns the value of the attribute whose
name is the zero-length string, that is, the content value.
Filtering a list of elements to select those with a given name is likely to be a
common operation. The syntax L?{name($)=N}
achieves this but is a
little cumbersome, and becomes more so if the list can also include values other
than elements. So we provide the construct :N
, where N
is
a name, which represents a function that returns true when its first argument is an
element with the name N
, and false otherwise. So given a list of
elements L
, we can now select those having the name N by writing
L?:N
. If we know there will be only one such element, we can select
it using L??:N
.
So if PO
is the purchase order presented in section 2.1, then
PO@shipTo^??:name
gives the value "Alice Smith"
, while
PO@items[0]@partNum
gives "872-AA"
.
The following example selects from a list of elements those having a particular
attribute value: PO@items?{$@USprice > 20.00}
.
The @
operator performs implicit mapping. Specifically: if the
left-hand operand is a list L
, then any lists contained in this list
are expanded recursively. Any values in the expanded list that are not elements are
ignored, so we end up with a list of elements; call this LL
. The value
of the expression L@name
is then defined as LL!{$@name}
.
Note that the result may be a list of lists; it is not flattened.
Returning again to the purchase order example, this means that
PO@items@partNum
returns the list of strings ["872-AA",
"926-AA"]
.
The postfix operator //
represents the deepContent()
function, which is the flattened transitive closure of the content()
function. Specifically, if the result of content()
is a list, then any
element in that list has its own descendants inserted immediately after it. So the
function descendants(E)
can be defined as content(E)!{$ ~+ (if
$.isA(<element>) then $.descendants() else [])}
. So if E
is an element, then E//?:status
will select all descendant elements
named status
, and E//?[$@id=12]
will select all descendant
elements having the id
attribute equal to 12.
The postfix operator @@
similarly gives the transitive closure of
attributes()
.
As already mentioned, the function forEach
and the equivalent
operator !
are overloaded for elements to process all the attributes of
an element (including the content). The second operand is a function which is called
with two arguments, the attribute name and the attribute value. For example, given
an element E the expression E!{$}
returns the names of its
attributes.
Similarly the function select
and the equivalent operator
?
are overloaded to process all the attributes, and return an
element in which only a subset of the original attributes are present.
The function element(name)
constructs an element with a given
name (which may be null). It can also be called with two arguments:
element(name, attributes)
. The second argument is a list of
attributes, each attribute being represented by a two-member list containing the
name and the value.
The function add(element, name, value)
takes an existing element,
and adds an attribute with the given name and value, replacing any existing
attribute with the same value. A new element is returned. Calls to this function can
conveniently be chained: element("rect").add("id", "a001").add("color",
"black")
.
For convenience, the function addContent
is defined as
add(?, "", ?)
.
So, for example, we can convert attributes to a list of child elements like
this:
let elementsToAttributes = {
let E = $.asElement();
element(E.name()).setContent(E!{element($1).setContent($2)})
}
The semantics of these constructions in FtanSkrit are different from the corresponding
operations in XPath, but hopefully they will have a familiar feel.
Conclusions and Summary
We have introduced three languages as replacements for the central pillars of the
markup edifice: FtanML as the markup language, FtanGram as its schema language, and
FtanSkrit as its query and transformation language. Let's try now to assess what we've
achieved.
Firstly, FtanML compared to XML. FtanML is considerably smaller as a specification,
but it's also more powerful. It gets rid of the same unwanted things that MicroXML
gets
rid of (namespaces, comments, processing instructions, entities, DTDs), but by allowing
attributes and element content to be any value, the data model is much richer, more
orthogonal, and more expressive. It also solves the whitespace issue (which whitespace
is significant?). By dropping end tags, the language is a lot less verbose, which
is
particularly noticeable when it is used for highly structured information, as in the
FtanGram syntax. There's a lot of general tidying-up in little areas like escaping
of
special characters.
Does verbosity matter? We think it does. The fact that XML is bulky and hard to read
is a significant factor leading to the adoption of alternative syntaxes for languages
such as RDF and RelaxNG, and is a big turn-off for people coming newly to XSLT. Even
if
specialist editors can reduce the burden of entering the markup, the amount of noise
on
the page affects the ability of a human reader to absorb information quickly. This
is
not to say that the most concise syntax is optimal, of course: we might have swung
too
far. XML had human
readability as one of its goals, and we should remember that readability is not a
binary
attribute; there are degrees of readability, and readability also depends greatly
on
the familiarity of the reader with the notation.
Compared to JSON, FtanML's main contribution is that it adds support for mixed
content. And element names, which are very handy when modelling document
structures.
Compared to the XPath data model (XDM), the FtanML model has more capability and
greater orthogonality. The core structuring constructs (elements and lists) are powerful
enough for all computational requirements. XSLT and XQuery have found a need to extend
the core XML-based model with other constructs such as maps and lists; the FtanML
model
does not have this awkward duality between constructs that can be directly serialized
in
the markup, and constructs used only for internal processing.
FtanGram learns from RelaxNG the importance of designing a schema language to do
validation and nothing else. Unlike RelaxNG, it's able to take advantage of the
simplification and orthogonality of the data model. The unification of facilities
for
"simple types" and "complex types" is particularly appealing, allowing a smaller number
of constructs to be combined in more powerful ways to create richer functionality.
The
idea that element names as well as attributes and content are something that can be
constrained by a type is also a useful simplification. FtanGram also attempts to show
that by making the markup language itself more powerful and less verbose, the need
for a
"compact syntax" (that is, a syntax using constructs other than those available in
the
target language) is eliminated.
FtanSkrit is broadly equivalent in capability to XQuery, but with a stronger reliance
on higher-order functions and operators in preference to custom syntax. It currently
lacks any mechanism comparable to XSLT's template rules, but we have ideas for how
that
could be added.
There will always be debates about strong versus weak typing, static versus dynamic.
I
believe that FtanML's dynamic typing approach fits better with the philosophy that
with
markup, you can have as much or as little schema machinery as you want. The XPath
ability to mix typed and untyped data is one solution to the problem of spanning the
worlds of structured and unstructured data, but it is something of a camel.
Are there any downsides? Some may find the languages excessively terse; highly compact
syntax is not easy on the eye. The absence of named end tags in FtanML can lead to
long
strings of closing angle-brackets which are hard to match up without the support of
a
syntax-driven editor. Generally, though, we feel that FtanML with its sister languages
FtanGram and FtanSkrit together form a markup system that has more than the power
of the
equivalent XML stack, with much greater integrity of design, simplicity, orthogonality,
efficiency, and usability.