Text And Markup Processing Languages

In the old days, most of the work required of computers was about processing numbers: for either commercial or scientic applications. That, together with manipulating the system hardware on which there programs ran, was what programmers required of computer programming languages. Even 50 years ago, in the late 1960's, there were many programming languages produced to help with all of that.

In the still early days of computer languages, in the 1960's and 1970's, text and markup processing were none-the-less important application areas. In fact, there were more text processing languages than there seem to be now. Most notable were Ralph Griswold's SNOBOL and Icon, and the less well known HUGO -- a text processing and electronic publishing language that I designed and helped implement at the Canadian Publishing and Printing Office -- it was used in the publishing of Hansard, Canadian Laws and other government publications. In the last couple of decades what computers do has greatly changed. A lot of what is done with computers these days is manipulating text, and how it is marked up: formatting it for different platforms, converting it from one markup form to another, and enhancing it with new information. Yet there are very few computer languages that are designed to help with these tasks.

Perl more-or-less leads the field of commonly used text processing languages these days, and a lot of work has been put into it to keep it up-to-date. XSLT is very useful and is widely used in the markup language field, even though it only supports input in XML and is somewhat weak in its text processing features. OmniMark is strong in its text processing abilities, it handles SGML, XML and, more recently, JSON. (OmniMark's unpopularity relative to XSLT, in spite of its wider application uses, is as much as anything else because of the high cost of acquiring it, not because of its comparative utility.)

The total history of programming languages of all sorts is a little over 60 years -- less than the age of some of us still in this business. When I first got interested in programming languages, in the late 1960's, the people who had pioneered programming languages in the 1950's were still active in the field. In this context, both XSLT and OmniMark are getting rather long-in-tooth. XSLT is 20 years old and OmniMark's basic design and most useful features date from 30 years ago. In fact, OmniMark's syntax, processing model and text processing capabilities are based on the HUGO language, making its style more like 40 years old. There have been a few recent advances in both languages, including serial markup processing in XSLT and JSON support in both OmniMark and in XSLT[1] but development in both cases has definitely slowed down.

Why Programming Languages Advance

For programming languages, the available memory and speed of processors at the time of their initial design and implementation is a major factor in a language's design and the in cost of its implementatation. Early design decisions in a language greatly impact what can be added to it later. This is why new programming languages come along every decade or so: one has to go back to the beginning to really make a change in a language, and increased computer power make that progressively easier to do over time.

As an example of how the world changes, when OmniMark was first designed and implemented, the largest machine in the office was a 2 megabyte memory desktop Macintosh. Now I'm typing this paper on a much faster 8 gigabyte memory laptop, which is no more than average for these days. It is about seven years old, so it is getting old, and may already be considered slow and small.

More memory and faster processors make it easier to implement a new programming language. Easier yes, but not easy -- it is still a big job. And more is now expected of a new programming language than was the case in the past. It is getting less and less clear what hasn't already been done, and what new could be done. However, even though it is over 60 years since the first widely-used programming languages like FORTRAN and COBOL were created, computer hardware, system software and other software tools have continued to develop, what they are used for has changed and expanded, and one would think that programming languages should participate in this continued development. As well, a number of useful programming features from the 1960's seem to have disappeared, and really need reviving (in my opinion, of course).

So if nothing else, there is an opportunity for a new programming language, combining desirable features both from current languages and past languages, together with a few minor innovations.

An Attempt At A Future Programming Language

I've been working on a new programming language, on and off, for best part of a decade now, and it is coming close to completion. It is an example of what I think should be on the horizon. It is a bit too large a topic for this talk, but I'll outline a few features, and give a few examples, which I hope will demonstrate what I'm talking about.[2]

The new language is currently called Bobbee.

Language Features

The language has a number of interesting features:

  • It supports classes, interface, enums and the features one expects to find in an object-oriented programming language.

  • Although the language is object-oriented, a user's program need not define classes.

  • It supports procedural, functional, and operator styles of progamming.

  • It has a 21st century style syntax style, unlike the 1970's syntax style of OmniMark, for example.

  • It supports "true" Unicode strings, for which, amongst other things, the "length" of the string is the number of characters, unlike the case for native Java strings.

  • The language supports user definition of operators. Operators can be prefix (before their one argument), infix (between their two arguments), and postfix (after their one argument). All operators are implemented using this mechanism: there are no "built-in" operators.

  • Text pattern matching can be done in both rule based and procedural text processing styles. Its text processing features include support for both SNOBOL/OmniMark expression style text matching, and regular expression style text matching as currently used in XSLT. (The regular expression style uses the libraries provided with Java. All other text matching facilities are implemented in Bobbee itself: they are not "built-in" in the language.)

  • It supports multi-phase processing, where the output of one phase can be processed by another phase. What is passed between phases can either be a stream of text, or a sequence of objects of a specified common type. This multi-phase processing is all implemented using the language iteself, and is not a feature of the language implementation.

  • Its compiler creates code that runs on the Java virtual machine, so it should work on a variety of platforms, and not needing reimplementing for different operating systems. It runs on all Java virtual machines with version 1.6, and later. (It creates Java code that works on the latest version Java, but does not use any features that do not work with version 1.6.)

Markup Language Features

The most interesting aspects of Bobbee's markup language support are that:

  • All the markup languges are implemented in Bobbee. None of the markup languages is a feature of the Bobbee language: they are not "built-in". This means that it is easy to add a new markup language to what is supported, and it is easy to experiment with new markup languages and new markup language features.

  • All the markup languages are separately implemented. One (like JSON) is not supported by translating it to another markup language (like XML), as is the case for OmniMark and XSLT.

  • A user program can use any number of different markup languages.

  • Markup language processing can be done in either or both a serial style or tree style, or a mixture of the two.

  • Markup processing can be done both at a top-level program and within classes.

  • In addition to rule-based processing of markup languages, procedural processing is is supported: you can ask for the markup language components either as a tree structure, or as an iterated sequence of items.

  • One can, while doing serial processing, choose that the children of any particular node or element be processed in tree style. This is useful for processing tables, for example.

Currently, the libraries available for use with Bobbee support the markup languages SGML, XML, MicroXML, JSON and BML.[3]

Examples

Following are examples of some of the interesting features of the language. Given that the language description is hundreds of pages long, this is of necessity just a taste of what the language does and how it works. As well, many features do not work well for small examples: they need much larger examples to be convincing.

Hello World

We start with the basic "Hello World" program, which is required for all programming language examples:

println ("Hello World");

People with Java experience will notice that println doesn't need prefixing with "System.out". This is a user selected option that applies to anything the user chooses. There are a few defaults in the "standard" profile (which can be overridden). Here is the declaration that makes the above happen:

use System.out.(print, println, printf);

The user can declare their own short-hand forms. Where the short-hand is not appropriate, fully qualified names can be used, as in:

System.err.println ("Goodbye Cruel World");

XML Processing

Here is an XML example that converts a XML document to a JSON form:

program xmltree;

parseXml (System.in);

choose xmlElementNode {
  print (" [\"%s\" {" (*.qName));
  for a = *.attributes, comma = "" then "," do
    print ("%s\"%s\":\"%s\"" (comma, a.qName, a.allValues));
  print ("}");
  processChildren;
  print ("]");
}

So:

<section id="X"><title>Title Text</title><p>Para 1.</p></section>

is converted to:

["section", {"id":"X"}, ["title", {}, "Title Text"], ["p", {}, "Para 1."]]

An XML processing program is prefixed by a program directive that indicates what kind of XML processing is to be done, in this case, XML tree processing, as in the original release of XSLT.

A markup processing rule is heralded by a two-word prefix: choose, indicating it is a markup processing rule and, in the above case, xmlElementNode indicating what is to be recognized by the rule.

More XML Processing

Here is another XML processing program, using serial parsing this time:

program xmlserial;

sendToMatch (System.in);

choose xmlElement ("section") {
  processChildren;
}

choose xmlElement ("title") when *.parent..is ("section") {
  print (".section ");
  processChildren;
  println ();
}

choose xmlElement ("p") {
  print (".para ");
  processChildren;
  println ();
}

About markup processing:

  • Serial processing uses different names (xmlElement for a serially processed XML element rather than xmlElementNode for a tree processed XML element) to recognize what is being processed. Amongst other thing, this separates the two types of processing, and allows them to be both used within a program.

  • Markup processing rules can be defined to have a "name" on which they can be chosen, as well as a when (or unless) condition based on other markup component and contextual properties.

  • The "conditional selection" operator ".." (two dots instead of one, as for the usual field or method selection), instead of being an error when the base value is null, returns false (or null or 0 as appropriate). For example, in this case, the when returns false if there is no "parent" or if the parent is not named "section".

Even More XML Processing

Serial and tree markup processing can be combined. For eample, when processing a potentially large document that is best done with serial processing, there may be subcomponents, like tables, that require the added functionality of tree parsing. As a simple example:

program xmlserial, xmltree, textpatterns;

parseXmlSerially ();

choose xmlElement ("doc") {
  println (".startdoc"); processChildren; println (".enddoc");
}

choose xmlElement ("p") :- processChildren;

choose xmlElement ("table") :- xmlElementNode (*.captureTree ());

choose xmlElementNode ("table") {
  println (".starttable"); processChildren; println (".endtable");
}

choose xmlElementNode ("tr") {
  println (".startrow"); processChildren; println (".endrow");
}

choose xmlElementNode ("td") {
  println (".tableitem "); processChildren;
}

choose xmlText :-
  println (*.data) unless *.data matches ([" \t\n"]* & -|);

choose xmlTextNode :-
  println (*.data) unless *.data matches ([" \t\n"]* & -|);

The "Node" rules do tree processing, and the non-"Node" rules, serial processing.

"captureTree" captures the current element (in this case "<table>") as a tree data structure, and invoking "xmlElementNode" passes the tree to the tree processing rules.

XML Processing Without Rules: Serial Processing

A markup parser can be invoked without using rule-base processing. It is an iterator of the serial processing item type returned by the parser, in this case XmlItem:

import bj.xml.*;
for item = XmlSerialParser.parse () do
  println ("%s: %s" (name of item, item));

With the following input:

<doc><p>The text.</p></doc>

this program outputs the following:

bj.xml.XmlStartDocumentItem: <?xml version="1.0"?>
bj.xml.XmlStartTagItem: <doc>
bj.xml.XmlStartTagItem: <p>
bj.xml.XmlTextItem: The text.
bj.xml.XmlEndTagItem: </p>
bj.xml.XmlEndTagItem: </doc>
bj.xml.XmlEndDocumentItem: <!-- end document -->

The above sample program illustrates two minor but useful features of Bobbee:

  • A string can be used as if it were method, in which case the string is treated as a "format string", with the arguments as the arguments of the formatting.

  • The prefix name of operator returns the name of the type of its argument.

XML Processing Without Rules: Tree Processing

As well as direct access to a markup parser's serial parser as an iterator, its tree parser can be used to return a tree-structured data structure:

import bj.xml.*;
println (XmlTreeParser.parse ());

With input as:

<?xml version=\"1.0\"?><doc><p>The text.</p></doc>

this program produces the same output as its input.

However, because "XmlTreeParser.parse" returns a data structure, working with it can be very useful. For example, if one wants to sort a table prior to outputing information from it, based on some of its fields, that data structure is something one can do that with the data structure.

Text Matching Rules

For a taste of text pattern matching rules, here is a program that converts all words by capitalizing the first letter followed by all the other letters in lower case. It has two rules: the first recognizes a word (a string of letters, including appostraphies and dashes), and does the conversion, and the second copies out all other things:

program textpatterns;

sendToMatch (System.in);

match uLetter => "first" & (uLetter | ["'-"])* => "rest" {
  print (upper => "first" + lower => "rest");
}

match \uLetter+ => "other" {
  print (=> "other");
}

The "=>" operator is used in two ways (in the same way that "+" can be used in various ways):

  • in the match rules, as an infix operator assigning what was matched by the first argument pattern to the pattern variable named by its second argument, and

  • in the body statements, as a prefix operator retrieving the value named by its argument.

"upper", "lower" and "length" are prefix operators that respectively upper-case and lower-case their argument: which in this example is captured text from a text pattern match. And "+" joins two string values.

One major feature in the above example is starting the program with a program declaration. In this example, it names the text pattern features that are used in the program. (More about program later.)

Using Regular Expressions

Here is the previous text matching rules exmaple using Regular Expressions instead of text pattern matching expressions:

program textpatterns;

sendToMatch (System.in);

match `\=(\p{L})([\p{L}\p{N}]*)` {
  print (upper * [1] + lower * [2]);
}

match `\=([^\p{L}])` {
  print (* [1]);
}

When using regular expressions:

  • ` (backquote or grave) is used to quote regular expressions rather than the string quoting ",

  • "\=" suppresses the string parser from looking at following "\" characters, so that they can be left in the form expected in regular expression syntax,

  • "*" in indexed or unindexed form identifies the captured parts of the most recent regular expression pattern match.

"\=" helps make strings a bit easier to read. The alternative to using "\=" is to double "\" characters as is used in Java text patterns:

match `(\\p{L})([\\p{L}\\p{N}]*)` {...}

"\=" is like Java's "\Q" except that:

  • there is no equivalent to "\E" to turn it off, and

  • any "\" following "\=" does not need doubling as is the case for "\Q". It is this latter provision that is the primary advantage of using "\=".

A string can be split into parts, and the effect of "\=" and "\Q" is limited to one part:

print ("\=\\" "\\");

prints three backquote characters: two because "\=" means that neither of the following "\"'s are considered escapes, and one more because "\=" doesn't affect the second part of the string, where an first "\" is needed to escape the second one. The two-part string in this example is a single string, not two strings "joined" together.

Defining Operators

In the previous example, a variety of operators are used. Operators can be defined in a Bobbee program or library. For example, these operators are defined with their name, their arguments names and types, the result of the operator and finally the code that produces the result:

operator length (arg1 : String) : integer :- arg1.length ();
operator lower (arg1 : String) : String :- arg.toLowerCase ();
operator upper (arg1 : String) : String :- arg.toUpperCase ();
operator (arg1 : String) + (arg2 : String) : String :- arg1.concat (arg2);

defines the "length", "lower", "upper" and "+" operators as they apply to "String" values. These operators are defined in a "functional" style, with ":-" meaning that what follows is the returned value of operator. Operators and methods can be defined in this functional style, or have a "body" containing "return" statements.

Operators need a "precedence" defined for them, which is what determines how tightly they bind. For example, one wants "*" (for multiplication) to bind more tightly than "+" (for addition). For these operators, again defined in a Bobbee program or library:

operator length 311;
operator lower 311;
operator upper 311;
operator upper 311;
operator 240 + 241;

To improve program performance, Bobbee provides a mechanism for eliding the call to the "length" operator definition, by telling the compiler what to call instead, by use of a "@Builtin" annotation:

@Builtin ("VCALL1:public:java.lang.String.length:()I")
operator length (arg1 : String) : javaInt :- arg1.length ();
@Builtin ("VCALL1:public:java.lang.String.toLowerCase:()Ljava/lang/String;")
operator lower (arg1 : String) : String :- arg1.toLowerCase ();
@Builtin ("VCALL1:public:java.lang.String.toUpperCase:()Ljava/lang/String;")
operator upper (arg1 : String) : String :- arg1.toUpperCase ();
@Builtin ("VCALL2:public:java.lang.String.concat:(Ljava/lang/String;)Ljava/lang/String;")
operator (arg1 : String) + (arg2 : String) : String :- arg1.concat (arg2);

(The "length" operator actually returns a "javaInt" value rather than whatever Bobbee's "integer" type may be implemented as. This reflects what the underlying Java library method returns.)

The "length", "lower", "upper" and "+" operators are defined in a class (bj.lang.Operators). To allow them to be used in a user's program without class qualification, Bobbee provides a means for declaring which values, methods and operators can be used unqualified. For these operators this is:

use bj.lang.Operators.{length, lower, upper, +};

(Different overloadings of an operator or method can have a "use" from different defining classes.)

All operators, even plain old arithmetic "+" and "*", are defined by these mechanisms.

The primary motivation for including operator definitions in Bobbee is to allow libraries to be implemented in Bobbee itself, and not requiring them to be "built in". All of Bobbee's markup language and text pattern matching support, and much of its other functionality, is implemented in libraries written in Bobbee using these mechanisms.

Procedural And Functional Methods

A classic method example is the "factorial" function. In Bobbee there are a number of ways of defining it, that illustrate different ways of implementing it. First, here's the traditional recursive functional form:

def factorial (n : integer) : integer :-
  if n == 0 then 1 else n * factorial (n - 1);

Here's the same thing in an iterative procedural form:

def factorial (n : integer) : integer {
  var f : integer = 1;
  for i = 2 to n do
    f *= i;
  return f;
}

":-" indicates that the (returned) value of the method or other form is given as an expression. return provides a method result in a procedural manner. Both methods and operators can be defined in either a functional or procedural form. For some things one form is best. For others the other.

Numbers and Strings

Bobbee used two kinds of numbers: integer (64-bit integer) and real (64-bit floating number). The idea is that given that computer memories are way bigger than they were not long ago, there's no real need to be careful with storage sizes the way there used to be. This makes programming a bit easier.

var n : integer = 0;
# declare "n" to be an integer, and initialize it to zero.

One can access Java type numbers using special names: javaByte, javaShort, javaInt , javaLong, javaFloat and javaDouble. The next section has an example of using this feature.

Bobbee supports two kinds of strings (and characters):

  • String (or java.lang.String, spelled with upper-case "S") uses the Java 16-bit character encoding, with two characters for encoding larger character values, and

  • string (spelled lower-case), uses 21-bit character encoding, meaning that every character is one character.

Most importantly, for string, the length of a string is the actual number of characters in the string, unlike the case for Java strings, where the length may be more than the number of characters.

By default, all types, including numbers, characters and boolean values, are implemented as objects. There's a lot to be said for making everything an object: it makes the language more uniform in its use. And there's not as much cost as one might think: for example, calls to the various "print" methods convert all their arguments to objects. On the other hand, the language can handle non-object forms of these types.

Classes

Here is a simple example of a class that implements a subclass of OutputStream, that just puts what is written to it in a buffer, which then becomes the class's "toString" value:

class BufferedOutput : OutputStream {
  def this () {}
  def this (this.buffer) {}
  def write (b : javaInt) : void :-
    write (new javaByte [] {b}, 0, 1);
  def write (b : javaByte []) : void :- write (b, 0, length b);
  def write (b : javaByte [], off, len : javaInt) : void :-
    buffer += new String (b, off, len);
  def close () : void {}
  def flush () : void {}
  def toString () : String :- buffer;
  private:
  var buffer : String = "";
}

This class illustrates a number of features of the language:

  • Initializers are defined using def this, rather than using the class's name.

  • The types javaInt and javaByte are used for compatibility of the extended class's redefined methods.

  • String is used as the return type of "toString", because it extends the String-returning method of type Object.

  • The default scope of a class's properties is public, and a group of properties (only "buffer" in this example) can be prefixed by a single statement of scope.

Parameterized Classes

The "generic type" or "parameterized type" feature of many current languages is very useful in the the definition and use of classes whose subcomponents can be of many different types. For example, the following is a useful way of defining a variable-sized list of string values, that can have values added to it or removed from it:

val stringList : ArrayList<string>;

As well, the very useful Iterable and Iterator types can be used with a specified type that is returned when using those types. There's an iterator example later that illustrates this feature.

Parameterized classes (interfaces and enums) are defined using type parameters following the name of the class, interface or enum, as in this class that defines a very general implementation of value pairs:

class Pair<H,T> {
  val head : H;
  val tail : T;
  def this (this.head, this.tail);
  def toString () : String :- "Pair(%s,%s)" (head, tail);
}

Used like:

var myPair : Pair<string,Integer>;
myPair = new Pair<string,Integer> ("third", 3);

This simple, class-level parameterized type syntax is supported, but there's no corresponding support for parameterizing methods and the "super-type" relation.

Synchronous Pipes and Streams

Bobbee supports synchronous threads -- which are called "coroutines" in other languages. Synchronous threads run in parallel as do other threads, but are implemented so that only one is running at the same time. This means that no synchrously is required for one thread to use properties of another such thread. There are two kinds of synchronous threads: object-passing pipes and text streams.

Synchronous pipes and text streams are useful when doing context-sensitive data processing: where where one is in the original input is significant for down-stream processing. Object-passing pipes are useful for things such as parsing and processing markup languages and other data encodings in parallel. Text streams are useful where what is passed is a text stream.

Pipes

An object-passing synchronous pipe passes objects of a particular type from one thread to another, pausing the sending pipe until the receiving pipe has used the passed value and requires another value, and pausing the receiving pipe when it requires another value until the sending pipe has a value to send to it. A single SynchronousPipe is created to communicate between two processes:

val pipe = new SynchronousPipe<string> ();

The pipe is initialized to wait for something to be written ("put") to it by one process:

pipe.put ("Start");

Once a value is written, the writing process is suspended until another process retrieves the value. That other process can wait for something to be written to the pipe, and get the passed value when it is written:

val nextValue : String = pipe.get ();

This reading process continues until it does another "get". At which time the reading process is suspended, and the writing process is resumed so that it produce another value.

Text Streams

A variation on the synchronous pipe is the SynchronousStream, which instead of passing objects between synchronous threads, passes a stream of text. Text-communicating coroutines are created by calling one of two static methods of the SynchronousStream type, as in:

local System.out = outputTo (theOtherRegime);

or as in:

local System.in = inputFrom (theOtherRegime);

where "theOtherRegime" is a java.lang.Runnable class which is started out by setting its "standard output" to return text to the current thread (for "outputTo") or which is started out by setting its "standard in" to send text to the current thread (for "inputFrom"). In both cases, synchronization is kept by only allowing one of the two threads to be active at any one time.

Locally Scoped Names

A couple of features have been revived from 1960's languages.

The local prefix says that the value of given name is to be restored on exit from the current local scope, no matter how it is exited:

local depth += 1;

means save away the current value of "depth", increment its value for use in the local scope, and restore its saved value at the end of the scope. Even if the local scope exits with a throw or an error, the restoring will happen. This reduces the corruption of data when scopes are exited in strange ways.

local can be used with qualified names as well. The following temporarily rebinds "System.out" but ensures it is restored for later use:

local System.out = new PrintStream ("myoutputfile.txt");

Selecting

select is the Bobbee language version of "switch" in other languages. It has a number of features beyond the selecting of numeric and string values. Two forms of select are of special interest. Pattern matching can be done using select with match parts rather than case parts:

def upperize (x : string) : string {
  select x {
    match uLetter => "first" & (uLetter | ["'-"])* => "rest":
      result += upper => "first" + lower => "rest";
    match \uLetter+ => "other":
      result += => "other";
  }
  return result;
}

A value can be selected based on its type:

def show (x : Object) : string {
  select x {
    case (y : String) : return "String: \"%s\"" (y);
    case (y : Long) : return "Number: %d" (y);
    default: return "Other: %s" (x);
  }
}

The type selecting statement does three things of use:

  • it identifies the type of the argument,

  • it binds the argument value to a new name with the identified type, and

  • it supports identifying and binding to multiple types.

Iterators

One can define iterators in Bobbee as methods, rather than as classes as in Java. For example, the following method creates an Iterator that returns all the space separated words in a passed-in string:

def splitWords (sentence : string) : Iterator<string> {
  selectAll sentence {
    match [" \t\n"]* & \[" \t\n"]+ => "word":
      yield => "word";
  }
}

This example illustrates the use of the selectAll statement, which extends the select statement by looping over a string, array, Iterator or other collection, by selecting parts of the string or components of the collection on each iteration of the selectAll.

Iterators are used extensively in the implementation of the language's markup libraries, where the serial parsers are all iterators of their item types.

Patchable Print Streams

Patchable print streams allow later-found information, such as chapter numbers, to be used earlier in an output document, even when using serial processing.

with pps = new PatchablePrintStream () do {
  with local System.out = pps do processDocument ();
  pps.emit ();
}

Within whatever "processDocument" does, the current print output is bound to the "patchable" stream. This means "marks" can be written and defined. The "pps.emit ()" writes out the result to the current output, which in the example is the "System.out" outside of its binding to the patchable print stream. "Marks" are written to the patchable stream using "writePrintMark". In the following example, a "<ref>" element's "id" attribute value is used as a mark:

choose xmlElement ("ref") {
  writePrintMark (* ["id"].value);
}

"Marks" are defined by assigning values to items in the PatchablePrintStream value. In the following example, a copy of the title text is bound to the mark value given by the chapter title's "id" attribute value, if it has one:

choose xmlElement ("title") when *.parent.is ("chapter") {
  with title = new ByteArrayOutputStream () do {
    print ("<H2>");
    with local System.out = new PrintStream (title) do
      processChildren;
    with id = *.parent ["id"] do
      if id != null then {
        print ("<A NAME=\"%s\"></A>" (id.value));
        pps [id] = "<A HREF=\"#%s\">%s</A>" (id.value, title);
      }
    println ("%s</H2>" (title));
  }
}

Program Profiles

Program features are defined by one or more "profile" files. There is one that defines the basic features of the language that is always used (the "standard profile"), and others imported at the start of a program using the program directive, as in:

program xmlserial;

Profiles define defaults for:

  • what classes and interfaces are used,

  • what short-hand names are available,

  • the meaning of rules,

  • what operators are used, where they are defined, and what their precedences are,

  • what methods are available, and

  • the types of differently quoted strings.

The user can define their own profile files, and override the "standard profile". More than one profile can be declared for a program. For example, the following says that the program can use text pattern, serial XML processing, tree XML processing and tree JSON processing:

program textpatterns, xmlserial, xmltree, jsontree;

Defining XML Processing

As an example of a profile file, here is how XML serial processing is defined:

"Use XML serial profile.";

import bj.xml.*;

def choose xmlComment (XmlCommentItem) default {}

def choose xmlDataEntity (XmlDataEntityItem)
  choose (*.entity.name)
  default {System.err.println
             ("ERROR: No rule for entity \"%s\"!" (*.entity.toRef));}

def choose xmlDocument (XmlStartDocumentItem ... XmlEndDocumentItem)
  default {processChildren;}

def choose xmlDtd (XmlStartDtdItem ... XmlEndDtdItem) default {processChildren;}

def choose xmlElement (XmlStartTagItem ... XmlEndTagItem)
  choose (*.uri default null) : (*.localName)
  default {System.err.println
             ("ERROR: No rule for element \"%s\"!" (*.element.name));
           processChildren;}

def choose xmlError (XmlErrorItem)
  default {System.err.println ("ERROR: %s" (*.message));}

def choose xmlProcessingInstruction (XmlProcessingInstructionItem)
  choose (*.target) default {}

def choose xmlText (XmlTextItem) default {print (*.data);}

def choose xmlTextEntity (XmlStartTextEntityItem ... XmlEndTextEntityItem)
  choose (*.entity.name) default {processChildren;}

def parseXmlFileSerially (systemId : String, options : String = "") : void :-
  parseXmlSerially (XmlSerialParser.parseFile (systemId, options));

def parseXmlSerially (data : String, options : String = "") : void :-
  parseXmlSerially (XmlSerialParser.parse (data, options));

def parseXmlSerially (in : InputStream = System.in,
                      options : String = "") : void :-
  parseXmlSerially (XmlSerialParser.parse (in, options));

def parseXmlSerially (buffer : CharSequence, options : String = "") : void :-
  parseXmlSerially (XmlSerialParser.parse (buffer, options));

def parseXmlSerially (source : Readable, options : String = "") : void :-
  parseXmlSerially (XmlSerialParser.parse (source, options));

def parseXmlSerially (parser : XmlSerialParser) {
  val iterator : Iterator<XmlItem> = parser.iterator ();
  for item = iterator do
    $processAllChildren (item, iterator);
}

This definition for a serial XML parser consists of:

  • a text string value to be displayed by the compiler: "Use XML serial profile.",

  • the appropriate import directives, identifying where the markup language parser and other used facilities can be found (in this case the contents of the "bj.xml" library),

  • definitions of the various choose rules that are appropriate for processing the components of the markup language (xmlComment for XML serial found comments etc.), and

  • definitions of the various methods that initiate parsing (parseXmlSerially and parseXmlFileSerially).

The choose definitions consist of:

  • the name of the defined rule (xmlComment etc.),

  • the name of the class or classes (for elements XmlStartTagItem and XmlEndTagItem, for example) that this rule recognizes,

  • the property of that class that is the "name" recognized by the rule, using choose (the local name of an element, for example), and

  • the default behaviour for the rule: which can be to do nothing (for comments), to issue an error message (for an unhandled element), or to do something sensible (like copying text to the output, for text, for example).

Conclusion

Tools for processing text and markup (programming languages) have slowly developed over the last six decades and one thinks that that development will continue. I'm hoping that the Bobbee programming language will participate in that continued development, either being the next step forward, or providing ideas for the future.

As to where the Bobbee language is at the moment, I'm still in the process of debugging the implementation, refining its user documentation, and documenting its code, so that others can help with or take over maintaining it.

How it'll be distributed, and whether it'll be for sale or freely distributed, I don't yet know. But once the implementation and its documentation are completed, I'll be working on figuring that out.



[1] In both cases, the JSON support is provides by the conversion of JSON to XML, rather than direct support of JSON itself.

[2] I call it an "attempt", based on what English change-ringing performances are called prior to their completion: it is an "attempt" before it is successfully finished, and only called a "touch", "quarter-peal" or "peal" once ringing has been finished and what has been rung has conformed to what was intended. This language is an "attempt" prior to its completion and prior to be accepted by a significant number of users.

[3] BML is a new minimal markup language of my own design. It was developed as an aid in debugging the language's markup language support, but it is interesting with respect to how small a fully functional markup language can be.

Author's keywords for this paper:
Markup Language Implementation; XML; SGML; MicroXML; JSON; BML

Sam Wilmott

Sam Wilmott designed his first programming language in the winter of 1967-1968 and was using early non-standardized markup languages in the late 1960's. Since then he has led the development of typesetting/text-formatting systems for the Canadian Government Printing Office (in the 1970's) and for a major real-estate company (in the 1980's), implemented one of the first SGML parsers (which was also the first pull-model markup parser), and is the originator of the OmniMark programming language (in the early 1990's), with its strong support of SGML, XML, and text transformation.

After leaving OmniMark, Sam worked in the XSLT world: he contributed to the implementation of an XSLT compiler and worked as an XSLT programmer and analyst (in the early 2000's). Currently he is largely retired, happily married, does voluntary work locally and walks a little dog every day, but in spite of his advancing age, he is nonetheless working on new programming language ideas for markup language and text processing.