Balisage: A New \u: Extending XPath Regular Expressions for Unicode

XPath Functions, Regular Expressions, and Unicode

In XPath and XQuery Functions 3.1, four functions depend upon regular expressions: fn:matches(), fn:replace(), fn:tokenize(), and fn:analyze-string(). Their regular expressions are defined on the basis of XML Schema Part 2: Datatypes Second Edition (herein XS2), which has been extended to include start- and end-of-string matches, reluctant quantifiers, and back-references. To build classes of Unicode characters one uses \p{} or its converse \P{}, explained in XS2's Appendix F, Character Classes. The curly brackets for \p take two types of construction:

Categories: A capital letter ([LMNPZSC]) specifying a general Unicode category, perhaps followed by a lowercase letter to specify a subcategory. This feature is very handy for finding letters (\p{L}), private use area characters (\p{Co}), digits from any system of numeration (\p{N}), or the inverse of these categories (by replacing \p with \P).
Blocks: "Is" followed by a string ([a-zA-Z0-9-]+) that corresponds to the name of a block of Unicode characters. This feature is very useful for finding all Arabic (\p{IsArabic}) characters, all arrows (\p{IsArrows}), general punctuation (\p{IsGeneralPunctuation}), or the inverse of these categories (by replacing \p with \P).

In most other programming languages, regular expressions do not support \p{}, or if they do, they are based on relatively simple POSIX character classes, which are restricted to a limited set of key terms (e.g., Lower, ASCII, Alnum, XDigit).

Some flavors of regular expressions access Unicode characters via \u. JavaScript and Python, for example, allow \uFFFF, where FFFF is a single hexadecimal number identifying a codepoint. Perl uses a slightly different syntax: \x{FFFF}.

XPath does not have \u, but it doesn't need it, as used in other programming languages. The entity, e.g., {, is a sufficient replacement for \u. And it's better. The entity need not be padded with zeros, and can be more than four digits. That is, it can access characters outside plane one, U+10000 and beyond. Entities can be marshalled to define a range using the hyphen (e.g., [j-±]).

In sum, as XML developers, we have unparalleled access to Unicode characters in our regular expressions. And we can expand on that excellence. In this article I introduce TAN-regex, an XSLT-based library of XPath functions that extend regular expressions to capture Unicode characters based upon their name and their relationships to composite and base characters.

Reimagining Regular Expressions for Unicode

The characters that make up the Unicode standard are a motley bunch. We who delve into its darker corners probably have our favorite bêtes noires. As a scholar who works with ancient Greek, I find the Greek and Coptic blocks to be the most visible witness to Unicode's choppy progress. Sets of characters from both languages spawned new, dedicated blocks (e.g., Coptic U+2C80..2CFF, Greek Extended U+1F00..1FFF), and Greek characters as individuals or small groups have popped up here and there, in assorted blocks. Although the general idea has been to keep blocks consistent and complete, that ideal is not often realized. It would be nice to access characters that naturally group with one another but straddle Unicode blocks. Desideratum one.

The Supplemental Punctuation block (U+2E00..U+2E2F) has a number of characters I must access regularly, to process ancient and medieval inscriptions, e.g., ⹄ U+2E44 DOUBLE SUSPENSION MARK (in Unicode Notational Conventions, ranges are expressed with two dots and the name is rendered in small capitals). Although that character is in Unicode because of a proposal I wrote, I regularly forget the hexadecimal number, and must look it up. When working with characters outside the ASCII block we customarily have at hand supplementary tools. Within oXygen XML editor, the Character Map is quite valuable. For general use, I personally prefer BabelMap (Windows only) and Richard Ishida's Unicode code converter, the former to find and copy characters and the latter to analyze them. The recently redesigned home page for Unicode is also quite useful. You will no doubt think of other tools you like. Those tools are essential, but they can also be an inconvenient departure from the algorithm being constructed. It would be nice to get those characters in a human-friendly way while staying mentally within my XSLT code. Desideratum two.

Sometimes I am looking not for a single character but for all permutations of that letter. That is, if I am searching a text for every variation of b, I would like to build a character class for any character that according to the Unicode database has a b as a component (in addition to b itself, there are 20 such characters, from U+1D47 MODIFIER LETTER SMALL B to U+1D68B MATHEMATICAL MONOSPACE SMALL B). In this case, auxiliary tools are of limited use, requiring ad hoc browsing and patchwork results. Unicode decomposition (see Unicode Standard Annex 15) via fn:normalize-unicode(*, 'NFKD') is not a help here, because that only helps you get from a precomposed character to its components. I am interested in the reverse of the process. Desideratum three.

So, despite XPath's deep engagement with Unicode, there remain three key obstacles to building classes of Unicode characters. Many Unicode character classes I wish to build do not map onto either a code block or a Unicode property—the two types of access provided by \p{}. Those constructors cover either too much or too little. Writing a regular expression based on hexadecimal entities can be cumbersome and haphazard, requiring correct use of external tools. Reading it can be equally challenging. And going from a character to its composites can be tricky.

Most of the Unicode character classes I build are united by some logic. In some cases, I know I could build a character class based upon words in the names of the individual Unicode characters. So I began to wonder, couldn't I simply use the Unicode name DOUBLE SUSPENSION MARK, and not worry about remembering the hexadecimal value of the codepoint? Or if I wanted all suspenion marks, not just mine, couldn't I just write "SUSPENSION MARK"? Doing so would make reading and writing a regular expression much easier. And it seems consistent with current conventions. After all, I can already invoke the name of a Unicode block in my regular expression. Why not also the name of a character, equally immutable?

The proposition might sound risky. Yes, Unicode names are unique and stable, but there are characters that for all intents and purposes are misnamed, so to use a name runs the risk of getting characters you did not want and failing to get those you did.

We already run that risk. We face it each time we use \p{} or even \w (word characters), whose results can sometimes surprise or annoy. Unicode nomenclature and classification can run against our druthers. If \p{} were extended to Unicode character names, we would need simply to extend the caution we already must exercise. For example, if I am looking for the medieval/late antique Greek numeral 6, the ϛ (U+03DB), I cannot use "episemon," the oldest name for this character (ἐπίσημον, attested 2nd c. CE by Clement of Alexandria). Instead, I need to familiarize myself with, and use, the official name, GREEK SMALL LETTER STIGMA, regardless of history (the earliest appearance I have found for "stigma" dates to an 18th century manuscript).

I also realized that we regularly remember keywords, but not necessarily their order within a name. If I wish to cite the name for ỗ (U+1ED7) is it LATIN SMALL LETTER O WITH CIRCUMFLEX AND TILDE or ...TILDE AND CIRCUMFLEX? It would be nice if once did not have to know. A Unicode name starts with the centermost components, but that principle helps only slightly, because there's no reason why I should presume to know which component is drawn closer to the center, or that Unicode decisions have always been consistent. Why not build a class constructor simply through select keywords in the name?

That is, I propose to break any character's name into individual words, treating each one like a property, much like space-delimited values of @class in HTML elements. If you are familiar with HTML conventions, you might immediately see the upside to tagging Unicode characters like this:

. . . . . . .
<div id="x1ed5" class="above and circumflex hook latin letter o small with">ổ</div>
<div id="x1ed6" class="and capital circumflex latin letter o tilde with">Ỗ</div>
<div id="x1ed7" class="and circumflex latin letter o small tilde with">ỗ</div>
<div id="x1ed8" class="and below capital circumflex dot latin letter o">Ộ</div>
. . . . . . .

In each @class, words in the character name have been intentionally set lowercase and alphabetized, to show that, for our purposes, order and capitalization may be treated arbitrarily. This name signature, i.e., the character's name parts alphabetized and space-joined, is not necessarily unique, and should not be treated as an identifier. See section “Caveats”.

If we wanted to select the ỗ in the example above, the third <div>, to style it in a certain way, in our CSS stylesheet we could simply write: .o.circumflex.tilde.small. Because only one codepoint has those four words in its name, we do not need to cite all eight words (but we could if we wanted). From there we can expand the class as needed. If we wanted to include the uppercase version, we could simply drop the word "small": .o.circumflex.tilde, which matches exactly two characters (as of Unicode version 13.0). Dropping other words increases the size of the set.

The dot-notation approach used in CSS + HTML classes can then be leveraged to build a wide variety of regular expression classes based on Unicode character names. Pure dot notation might create a class that is too large for some purposes, so the syntax should provide a way to exclude classes. For example, we might want all letter U's with diaereses, but not those with a caron (ˇ), i.e., drop Ǚ and ǚ. The exclamation mark to mean "not" has precedence (albeit not in CSS selectors), and seems intuitive as a mark of exclusion; for the previous example, we would write something like .u.diaeresis!caron.

A name-based approach to classes of Unicode characters opens up interesting, new possibilities. One can use .combining to find combining characters. One can use .latin to find a close approximation to all Latin characters, or .greek to all Greek ones. Using .with gets all Unicode characters that have a "withness" property, i.e., characters that are composed of more than one element (whether or not Unicode decomposition is defined). Similarly .with.and points to characters that have at least three components (e.g., ᵳ U+1D73 LATIN SMALL LETTER R WITH FISHHOOK AND MIDDLE TILDE), whereas .with!and points to those that have only two components (e.g., À U+00C0 LATIN CAPITAL LETTER A WITH GRAVE).

Dot- and exclamation-mark-selectors have quite a bit of potential, but they are not useful for an important desideratum I had set out at the beginning of this section, namely, the creation of character classes based upon the relationship of composite and component characters. Let us suppose, for example, I want to build the Unicode class of variants on the Latin letter b. If I use .b as described above I capture 290 characters, including many that are not directly related to the Latin letter. Perhaps that's fine for some situations, but in others, I am looking for a much smaller class, namely the twenty decomposable variations of b, according to the Unicode database.

For such cases, we can adopt a different type of notation, with a + signifying that the string that follows should be expanded to all composites. That is, +b would expand to bᵇḃḅḇ⒝ⓑ㍴㏔㏝ｂ𝐛𝑏𝒃𝒷𝓫𝔟𝕓𝖇𝖻𝗯𝘣𝙗𝚋 (in Unicode version 13.0). +bB would expand to include both upper- and lowercase results.

A kind of reversal could be implemented with a similar syntax, i.e., a minus instead of a plus, so that, for example, -ḃãäḅẫậ would return simply baabaa. Such a transformation is not as pressing a need as the other cases, but if we are going to the trouble of building composites, one might as well provide a similar way to reverse course.

Bringing `\u` Back

Much of this re-imagination took place in the course of developing the function library of the Text Alignment Network (TAN, http://textalign.net), a suite of XML formats intended to make Text Encoding Initiative (TEI) files more semantically and syntactically interoperable. I soon realized that my tinkering with regular expressions could have very broad, practical applications, relevant to those who might not care much about TEI or TAN. So I isolated this part of the TAN function library as a separate package or module, TAN-regex, to support quick, easy imports or includes by projects that did not want to fetch the entire TAN function library.

The namespace of TAN-regex is identical to the TAN namespace, tag:textalign.net,2015:ns (a tag URN), but tethered to the prefix rgx:. (You can adopt whatever prefix you like in your host application.)

I had considered the idea of incorporating the new syntax directly into the escape class \p{}. Although this idea had merits, I decided against it, mainly because I wanted to compel anyone writing or reading the code to understand that this was a clear departure from the core specifications. I also did not want to try to support the negated class builder, \P{}. So I opted for \u{}. It was nice to have \u back.

The primary goal of the small XSLT library TAN-regex was to write versions of fn:matches(), fn:replace(), fn:tokenize(), and fn:analyze-string() that supported \u. The challenge could be reduced to ensuring that any instance of \u{} in the standard parameter $pattern was replaced with a string for the intended character class, padded by [ and ] if not embedded as part of a character class.

The master data for Unicode characters, including their names, is the Unicode Character Database, a set of tables in plain text, e.g., https://unicode.org/Public/13.0.0/ucd/, upon which code charts and related resources (e.g., Common Locale Data Repository) depend. This master data is also converted to an XML format, e.g., https://www.unicode.org/Public/13.0.0/ucdxml/. For name-word constructors, I opted to use the version that excluded the Unihan characters, since their names (all numbered) would not be useful objects of query. The TAN-regex stylesheet ucd/ucd-names.xsl converts a given version of the XML version of the Unicode Character Database to a simple catalog of <char>s with name words tokenized, lowercased, and placed in <n>s, with results saved in the subdirectory ucd at, e.g., ucd-names.13.0.xml. Creating such a file is quite fast, a couple of seconds.

The decomposition process cycles through the XML database that includes the Unihan characters, to ensure complete decomposition. The TAN-regex stylesheet ucd/ucd-decomp.xsl converts the UCD database to two different forms. One type of output, e.g., ucd-decomp-simple.13.0.xml, is slim, and features a pair of elements, <mapString> and <transString> with text nodes of identical length. They provide a simple one-for-one translation for those precomposed characters that can be resolved to a single base character. The other output file, e.g., ucd-decomp.13.0.xml, is a collection of <char>s with a child <b> for each base component. For both types of output, decomposition must be performed against the Unicode database recursively, because some characters are defined as decomposing to characters that themselves admit decomposition. The iterative function requires at least four passes through the UCD database to ensure a complete inventory of atomic components. Therefore, running ucd-decomp.xsl takes a couple of minutes.

In the end the TAN-regex subdirectory ucd is about fifty megabytes, populated as it is with optimized data from Unicode version 5.1 through 13.0 (at present). Supporting each Unicode version allows users to create regular expressions based upon a particular Unicode version, should that be desired.

To access the function library simply include or import TAN-regex.xsl, the only XSLT file of note at the root of the project. (But don't forget to also get a copy of the subdirectory ucd.) The functions do not depend upon templates, so the library can be used via <xsl:import> or <xsl:include> equally, your choice.

Most users will care only about the functions rgx:matches(), rgx:replace(), rgx:tokenize(), and rgx:analyze-string(). But those shadow functions rely upon component functions that will be helpful for developers.

Each one relies directly upon rgx:regex(). If that function detects the new escape class, \u{}, it will invoke rgx:parse-regex(), which takes as parameters a regular expression and a Unicode version number and returns an XML tree fragment whose string value is a suitable substitution for $pattern.

The value within the curly brackets of any \u{} is interpreted by rgx:process-regex-escape-u(), which also requires a Unicode version. The curly brackets allow multiple items, space-delimited. Each item is checked. If the item matches a hexadecimal number (perhaps two of them separated by a hyphen), it is converted to the corresponding codepoint.

If an item starts with +, the output of rgx:string-to-composites() is returned. That function takes a string, breaks it into characters, and for each character returns a string that concatenates all characters that use the input character as a component.

If an item starts with -, the process invokes rgx:string-base(), a function that performs limited decompositon of Unicode characters. The input is passed along with a Unicode version through fn:translate(), which takes the relevant version of ucd-decomp-simple.*.*.xml to convert decomposable characters that can be reduced to one major base character. If there is no such one-to-one correspondence, the original character is returned. rgx:string-base() is similar to fn:normalize-unicode(., 'NFKD'), except that all component parts that are not the sole base letter are discarded. It is actually closer in spirit to fn:lower-case() and fn:upper-case() in that the length of the input string is always preserved, keeping intact any characters that cannot be so reduced.

If an item starts with . or !, it is treated as a name query, and rgx:get-chars-by-name() returns matching characters, treating a string prefixed by . as a word that must appear in a character name, and one prefixed by ! as a word that must not appear. Names equivalences are not case-sensitive. This function returns fragments from the Unicode names database, for example:

<char cp="0029" val=")">
   <na>
      <n>right</n>
      <n>parenthesis</n>
   </na>
</char>

Each <n> can be capitalized and string-joined to render the character name in the customary fashion. Perhaps an even more convenient way to get such fragments is with the key get-chars-by-name, e.g., key('get-chars-by-name', ('parenthesis'), $default-ucd-names-db). You may then filter and sort the results as you like.

rgx:parse-regex() takes the results from rgx:process-regex-escape-u() and pads the output string in square brackets if the original \u{} is not within the context of a character class; if it is, the string is returned unchanged.

TAN-regex comes with a few other related functions that could be useful in certain contexts. The functions that convert hexadecimal numbers to decimal and vice versa are generalized, to allow conversions to and from bases 2 through 16 and 64 (rgx:dec-to-n() and rgx:n-to-dec()).

The function rgx:string-to-components(), the inverse of rgx:string-to-composites(), takes an input string and returns a sequence of strings. It chops the input into characters, and for each character returns its component characters. If the character does not decompose, the character itself is returned.

rgx:string-base() and rgx:string-to-components() are two quick ways to handle decomposition. They rely upon a decomposition database provided by rgx:get-ucd-decomp-db(), whose tree can be used to build your own functions. For example, you could use on the decomposition database the XPath expression /*/char[b[1]/@gc eq 'Nd'][b[2]/@gc eq 'Sm'], which matches the twenty characters that decompose into first a numeral and second a symbol, such as ¼. A sample tree fragment:

<char cp="00BC" val="¼">
   <b gc="Nd">1</b>
   <b gc="Sm">⁄</b>
   <b gc="Nd">4</b>
</char>

rgx:string-to-components() is for all intents and purposes the same as for $i in fn:string-to-codepoints($string) return fn:normalize-unicode(fn:codepoints-to-string($i), 'NFKD'), i.e., a sequence of strings that correspond one-to-one to each character in the input string. When concatenated, the output of rgx:string-to-components() should be identical to fn:normalize-unicode($string, 'NFKD'). The sequence form of output in rgx:string-to-components() might be useful in cases where a developer wishes to intercept the decomposing normalization process.

But rgx:string-base(.) is importantly different. The length of output always matches the length of the input string, and makes substitutions only if a composite can be replaced by a single distinct base character. It would be comparable to fn:substring(fn:normalize-unicode(., 'NFKD'), 1, 1) if every composite Unicode character was made of one base character followed by zero or more non-base characters. But many composite Unicode characters do not fit this model. Some have more than one base character (e.g., ⅐ U+2150 VULGAR FRACTION ONE SEVENTH) and others begin with a non-base character (e.g., ำ U+0E33 THAI CHARACTER SARA AM, ⒜ U+249C PARENTHESIZED LATIN SMALL LETTER A). The purpose of rgx:string-base() is not to imitate the decomposition process, but to provide a type of normalization comparable to fn:lower-case() and fn:upper-case(), for relaxed string comparisons. The escape class \u{-} is but one beneficiary; the function is also useful in contexts where two strings need to be relaxed to be compared.

All the above functions can be run against any version of Unicode from 5.1. If no version is supplied, the most recent version of Unicode will be used (currently 13.0). If you are writing a regular expression that requires a specific version of Unicode, put the version number in the $flags parameter, along with any other flags, e.g., rgx:tokenize($my-string, '\u{+b}','13.0i').

Testing `\u`

TAN-regex includes a subdirectory, tests, which has a stylesheet test.xsl to produce ad hoc results from the functions. The subdirectory also includes a battery of XSpec tests, tan-regex.spec. All XSpec tests are currently successful.

Experiments run with TAN-regex based on Unicode version 13.0 produced some surprising results. Comments below are documented in test.xsl.output.xml.

As might be expected, none of the 43,026 characters that matched !combining (i.e., characters that do not have the word COMBINING in their name) also match the category for combining marks, \p{M}. You might expect the reverse to be true, that the inverse category .combining and \p{M} would result in coterminous sets. But only 330 of the 462 characters that matched .combining also matched \p{M}. After some diagnosis, it turned out that the processor, likely because of the underlying Java version (1.8, build 25.261), did not recognize the other 132 characters, and classified them as not assigned, \p{Cn}.

Of the 1,157 characters matching .symbol, 213 do not match the symbol category, \p{S}. This is not a bug or anomaly. It simply shows that there are many Unicode characters that have "SYMBOL" in their name but are not classified as symbols. For example, ϕ U+03D5 GREEK PHI SYMBOL is classified as a lowercase letter, \p{Ll}. So the constructor \u{.symbol] usefully allows us to construct a class of Unicode characters that people might treat as symbols, irrespective of their Unicode general category.

There are 946 characters matching .digit and .numeral. Of these, 297 do not have the number property, \p{N}. After weeding out those characters that were not classified by Saxon, 36 remain, such as ݳ U+0773 ARABIC LETTER ALEF WITH EXTENDED ARABIC-INDIC DIGIT TWO ABOVE and ꣧ U+A8E7 COMBINING DEVANAGARI DIGIT SEVEN. Those examples show that some Unicode characters have secondary qualities that are communicated only through the name. TAN-regex's \u provides a unique way to query and fetch such secondary characteristics. This quality should be seen as complementing (and not replacing) the already powerful method of accessing Unicode characters through their general category properties (e.g., \p{L} for letters).

Caveats

TAN-regex fuctions based on \u{} may penalize some applications, if not properly deployed.

Consider an XML file with 125,000 leaf nodes (an XML document with fifty elements on level 2 each with fifty elements, each with another fifty), each with some text, and an XSLT stylesheet that checks for a match on each leaf. When the leaf template uses something simple such as fn:matches(., 'A'), the process on a Dell Inspiron 5570 (Intel Core 1.6GHz i5-8250U with 4 physical and 8 logical cores) takes 0.5 seconds. Using rgx:matches(., 'A') takes 1.1 seconds, perhaps an acceptable increase. When \u is introduced, this increases somewhat: rgx:matches(., '\u{.circle}') takes 2.7 seconds, provided the processor supports @cache on XSLT 3.0 functions (Saxon PE and EE do so, but Saxon HE does not).

When working with a processor that does not support cached functions, rgx:matches(., '\u{.circle}') takes 9359.3 seconds (one hour forty-six minutes), because the value of .circle is calculated time and again. The solution in such a situation is to tether \u{.circle} to a global variable, so that it is calculated only once. To do this, first define a global variable:

<xsl:variable name="regex-circle" select="rgx:regex('\u{.circle}')"/>

Then invoke that global variable as needed, for example, rgx:matches(., concat($regex-circle, '\s+', $regex-circle)) finds any two characters with "circle" in their Unicode name, separated by one or more space characters. (For further efficiency, you might bind the composite value of the second parameter to a global variable.)

When the process on the 125K leaf-node file is shifted to a global variable approach, fn:matches(., $regex-circle) takes 2.0 seconds. The corresponding rgx:matches(., $regex-circle) takes 2.8 seconds.

Even if you are using a processor that handles cached XSLT 3.0 functions, you will find it useful to build global variables with rgx:regex(), to be invoked mnemonically where you like. For example:

<xsl:variable name="class-of-chars-with-symbol-in-name" 
   select="rgx:regex('\u{.symbol}')"/>

<xsl:function name="my:strip-symbols" as="xs:string?">
  . . . . . 
    <xsl:sequence select="rgx:replace($input, $class-of-chars-with-symbol-in-name, '')"/>
  . . . . . 
</xsl:function>

Keep in mind that rgx:regex() will convert \u{} to a character class, framed by square brackets. If you want only a string of characters, use, e.g., rgx:process-regex-escape-u(), whose results permit further processing as desired.

It might be objected that composition and decomposition via + and - are unnecessary.

Rather than building a class of composites, one could simply first pass the input through fn:normalize-unicode($input, 'NFKD'), to convert it to component parts, then search accordingly. But that approach works in only some cases. If you are looking for a sequence of characters, you must anticipate an unknown number of extra characters, many but not all combining characters. Take for example the input "ẵbcẚ⒝c". When filtered through fn:normalize-unicode() the string expands to length eleven (U+0061 U+0306 U+0303 U+0062 U+0063 U+0061 U+02BE U+0028 U+0062 U+0029 U+0063). If you are searching for "abc," which you expect to match twice, you cannot use as your regular expression 'abc'; you must use something like: fn:matches(fn:normalize-unicode('ẵbcẚ⒝c', 'NFKD'), 'a[\p{M}\p{Pe}\p{Lm}]*\p{Ps}?b[\p{M}\p{Pe}]*c'), and hope that you have correctly built the classes of ignorable characters that might follow an a or b. The preceding regular expression anticipates the possibility of encountering ẚ U+1E9A, ⒜ U+249C, or ⒝ U+249D, but it might result in false positives, such as a match on this input string: "a]{b)c". Constructing an airtight regular expression under this technique might be impossible. For any two strings that have identical NFKD-normalization forms, e.g., "⒜⒝⒞" and "(a)(b)(c)", your regular expression will match either both or none, which you might not want. Even if you are not so picky, writing a strong regular expression under this method can become quite time-consuming and result in unreadable code. The TAN-regex equivalent, rgx:matches(., '\u{+a}\u{+b}\u{+c}') is faster to write, easier to read, and probably more accurate.

A close approximation of decomposition (-) is already available to us via XPath expressions. For example, \u{-ḃ} is merely another way of saying concat('[', fn:substring((fn:normalize-unicode('ḃ', 'NFKD')), 1, 1), ']'). That works for this simple example, but many times, as explained above in the discussion of rgx:string-base(), the normalized string might bring unwanted surprises.

Not every name has a unique name signature (i.e., words in the name alphabetized and joined by spaces). About 0.8% of Unicode characters have name signatures that are duplicates of the name signature of at least one other character (394 characters in 182 groups, as of Unicode version 13.0), e.g., ⫓ U+2AD3 SUBSET ABOVE SUPERSET and ⫔ U+2AD4 SUPERSET ABOVE SUBSET. A future version of TAN-regex may support name component order.

One other hazard that needs to be watched for are ambiguous name words. For example, "a" can mean either the letter a or it can be the indefinite article. So .a!Latin captures not only А U+0410 CYRILLIC CAPITAL LETTER A but also ⊅ U+2285 NOT A SUPERSET OF. If you use \u{} you must still study the Unicode standard, particularly Character Properties: Name, section 4.8 of The Unicode Standard Core Specification.

What To Do with `\u`

To this point I have depicted TAN-regex and its component functions in broad strokes. These are building blocks for other applications. I conclude with an example relevant to those of us who work with texts with numerous accents. I illustrate with polytonic Greek, but the principle could be applied to other languages.

When processing ancient Greek texts, we frequently need to normalize the accents. Greek has a number of accentuation rules, and it is common for context to demand that an acute ΄ accent be switched to grave `. But sometimes we need to switch back. If we wish to look a word up in a dictionary, the grave accent must be converted to its normal acute version, e.g., ἀδελφὸς → ἀδελφός or ἂν → ἄν. (Note how the ΄ can be one of several combining marks.) The problem is a challenge because there are dozens of Greek Unicode characters with the acute and grave, in various precomposed configurations. Conversions are possible and straightforward, but the most obvious solutions are verbose, and time-consuming to build.

To accommodate the need to switch accents on a complex character, TAN-regex includes the function rgx:replace-by-char-name(), which shows how to combine and use the lower-level TAN-regex functions. The function rgx:replace-by-char-name() takes as input a string that should be changed (parameter 1), three sequences of strings (parameters 2-4), and an indication whether a replacement should be strict (parameter 5). A 6-arity version of the function also permits a Unicode version (parameter 6). The string sequences in the second through fourth parameters ($words-in-name-to-drop, $words-in-replacement-char-name, $words-not-in-replacement-char-name) are supposed to be keywords in Unicode character names. Changes are made to only those characters in the input string whose names have a word that matches the list in $words-in-name-to-drop. Those keywords are dropped from the input character's name and the search for names is conducted again, using the other two keyword parameters to filter the results. If any substitute characters are found, they are returned, otherwise the original character is returned.

In the case of the problem above, changing the grave to an acute, one can write rgx:replace-by-char-name('ἀδελφὸς ἂν ᾖ.', 'varia', 'oxia', (), true()). The input string ἀδελφὸς ἂν ᾖ. ("He should be a brother.") is processed letter by letter. Nothing happens unless a letter has a Unicode name with the word VARIA (= grave accent). So the only two characters that are affected are the ὸ U+1F78 GREEK SMALL LETTER OMICRON WITH VARIA and ἂ U+1F02 GREEK SMALL LETTER ALPHA WITH PSILI AND VARIA. In each case "VARIA" is dropped and a search is made for Unicode characters with the rest of the name words, as long as they also include the word "OXIA" (= acute). Each of the two letters has a single replacement, i.e., U+1F79 GREEK SMALL LETTER OMICRON WITH OXIA and U+1F04 GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA. The output is the desired change of the grave accents to acute: 'ἀδελφός ἄν ᾖ.'

Another type of normalization often need to perform on ancient Greek is to drop from words any accents that result from enclitics. (An enclitic is a word whose accent shifts back to the previous word, similar to the way, when pronouncing the phrase "Codify it," the "it" prompts us to slightly emphasize "fy.") The result is that some Greek words have two accents instead of the customary one, e.g., ἄνθρωπός τις ("a certain human being"). Tokens need to be adjusted before looking them up in a lexicon or database, so we normally want to drop only the second accent and keep the first. This task can be cumbersome to do in XSLT because of the many codepoints that represent permutations of Greek vowels and their combining marks. Building a regular expression to capture double-accented Greek words is quite a chore. And changing the second accent requires a choose-when-test operation with a minimum of fourteen branches; probably more, depending upon the kinds of decisions being made.

Fortunately, such normalization can be applied in a relatively straightforward manner by using both rgx:regex() and rgx:replace-by-char-name():

<xsl:variable name="greek-pattern-for-accented-vowels"
        select="rgx:regex('\u{.greek.tonos .greek.oxia .greek.varia .greek.perispomeni}')"/>
<xsl:variable name="greek-pattern-for-acute-vowels"
    select="rgx:regex('\u{.greek.tonos .greek.oxia}')"/>
<!-- In the variable below the first word uses U+03CC GREEK SMALL LETTER OMICRON WITH TONOS, 
    the second word U+1F79 GREEK SMALL LETTER OMICRON WITH OXIA. They look identical, but 
    the first is the preferred (normalized) form. -->
<xsl:variable name="greek-words-with-two-accents" select="'σῶσόν σῶσόν'"/>
<xsl:variable name="second-accent-dropped-from-greek" as="xs:string*">
    <xsl:analyze-string select="$greek-words-with-two-accents"
        regex="({$greek-pattern-for-accented-vowels}\S*)({$greek-pattern-for-acute-vowels})">
        <xsl:matching-substring>
            <xsl:value-of select="regex-group(1)"/>
            <xsl:value-of
                select="rgx:replace-by-char-name(regex-group(2), 
                ('oxia', 'tonos', 'with'), (), (), true())"
            />
        </xsl:matching-substring>
        <xsl:non-matching-substring>
            <xsl:value-of select="."/>
        </xsl:non-matching-substring>
    </xsl:analyze-string>
</xsl:variable>
<xsl:value-of select="string-join($second-accent-dropped-from-greek, '')"/>
<!-- The above results in 'σῶσον σῶσον' -->

In the code above, σῶσόν (save) returns σῶσον. ἐλέησόν με (have mercy on me) returns ἐλέησον με.

The processes described above could be replicated, more efficiently, with traditional decomposition, replacement on the parts, and then normalization. But that is because my example has been Greek decomposable letters. It gets much trickier when there are no components, such as switching something that is left-oriented into its equivalent right-oriented character. For example, to switch upward-pointing arrows into downward-pointing arrows:

<xsl:variable name="upwards-arrows" as="xs:string" select="'↑↥'"/>
<xsl:value-of select="rgx:replace-by-char-name($upwards-arrows, 'upwards', 'downwards', (), true())"/>

The result is ↓↧. This method could be used to develop applications that programmatically change chess pieces (U+2654..U+265F), recycling labels (U+2673..U+267A), domino tiles (U+1F030..1F09F), or playing cards (U+1F0A)..U+1F0FF). The potential for variations is endless.

TAN-regex is released under a GNU General Public license, to encourage others to change, adapt, and improve the code. Updates to the library will be made at https://github.com/textalign/TAN-regex.

Enjoy the new \u!

BalisageThe Markup Conference2020

Balisage Paper: A New `\u`: Extending XPath Regular Expressions for Unicode

Abstract

Table of Contents

XPath Functions, Regular Expressions, and Unicode

Reimagining Regular Expressions for Unicode

Bringing `\u` Back

Testing `\u`

Caveats

What To Do with `\u`

Balisage Series on Markup Technologies

Balisage Paper: A New \u: Extending XPath Regular Expressions for Unicode

Abstract

Table of Contents

XPath Functions, Regular Expressions, and Unicode

Reimagining Regular Expressions for Unicode

Bringing \u Back

Testing \u

Caveats

What To Do with \u

Balisage Series on Markup Technologies

Balisage Paper: A New `\u`: Extending XPath Regular Expressions for Unicode

Bringing `\u` Back

Testing `\u`

What To Do with `\u`