How to cite this paper
Kalvesmaki, Joel. “A New \u: Extending XPath Regular Expressions for Unicode.” Presented at Balisage: The Markup Conference 2020, Washington, DC, July 27 - 31, 2020. In Proceedings of Balisage: The Markup Conference 2020. Balisage Series on Markup Technologies, vol. 25 (2020). https://doi.org/10.4242/BalisageVol25.Kalvesmaki01.
Balisage: The Markup Conference 2020
July 27 - 31, 2020
Balisage Paper: A New \u
: Extending XPath Regular Expressions for Unicode
Joel Kalvesmaki
Founder and director of the Text Alignment Network (TAN), Joel
Kalvesmaki is an XML developer for the Government Publishing Office and a
scholar in early Christian studies. Those two worlds intersect in TAN and the
Guide to Evagrius
Ponticus, an XML-driven online reference work on the
fourth-century monk-theologian.
Copyright ©Joel Kalvesmaki, Creative Commons Attribution 4.0 International
Abstract
Regular expressions from one programming language or environment to the next
differ in details. The XPath flavor of regular expressions has unrivaled access to
Unicode code blocks and character classes. But why stop there? In this paper I
present a small XSLT function library that extends the XPath functions
fn:matches()
, fn:replace()
,
fn:tokenize()
, and fn:analyze-string()
to permit new ways
to build classes of Unicode characters, by means of their names and decomposition
relations.
Table of Contents
- XPath Functions, Regular Expressions, and Unicode
- Reimagining Regular Expressions for Unicode
- Bringing
\u
Back
- Testing
\u
- Caveats
- What To Do with
\u
XPath Functions, Regular Expressions, and Unicode
In XPath and XQuery Functions
3.1, four functions depend upon regular expressions:
fn:matches()
, fn:replace()
, fn:tokenize()
,
and fn:analyze-string()
. Their regular expressions are defined on the basis
of XML Schema Part 2: Datatypes
Second Edition (herein XS2), which has been extended to include start- and
end-of-string matches, reluctant quantifiers, and back-references. To build classes
of
Unicode characters one uses \p{}
or its converse \P{}
,
explained in XS2's Appendix F,
Character Classes. The curly brackets for \p
take two types of
construction:
-
Categories: A capital letter
([LMNPZSC]
) specifying a general Unicode category, perhaps
followed by a lowercase letter to specify a subcategory. This feature is very
handy for finding letters (\p{L}
), private use area characters
(\p{Co}
), digits from any system of numeration
(\p{N}
), or the inverse of these categories (by replacing
\p
with \P
).
-
Blocks: "Is" followed by a string
([a-zA-Z0-9-]+
) that corresponds to the name of a block of
Unicode characters. This feature is very useful for finding all Arabic
(\p{IsArabic}
) characters, all arrows
(\p{IsArrows}
), general punctuation
(\p{IsGeneralPunctuation}
), or the inverse of these categories
(by replacing \p
with \P
).
In most other programming languages, regular expressions do not support
\p{}
, or if they do, they are based on relatively simple POSIX
character classes, which are restricted to a limited set of key terms (e.g.,
Lower
, ASCII
, Alnum
,
XDigit
).
Some flavors of regular expressions access Unicode characters via \u
.
JavaScript and Python, for example, allow \uFFFF
, where FFFF
is a single hexadecimal number identifying a codepoint. Perl uses a slightly different
syntax: \x{FFFF}
.
XPath does not have \u
, but it doesn't need it, as used in other
programming languages. The entity, e.g., {
, is a sufficient
replacement for \u
. And it's better. The entity need not be padded with
zeros, and can be more than four digits. That is, it can access characters outside
plane
one, U+10000 and beyond. Entities can be marshalled to define a range using the hyphen
(e.g., [j-±]
).
In sum, as XML developers, we have unparalleled access to Unicode characters in our
regular expressions. And we can expand on that excellence. In this article I introduce
TAN-regex, an XSLT-based
library of XPath functions that extend regular expressions to capture Unicode characters
based upon their name and their relationships to composite and base characters.
Reimagining Regular Expressions for Unicode
The characters that make up the Unicode standard are a motley bunch. We who delve
into
its darker corners probably have our favorite bêtes noires. As a scholar who works
with
ancient Greek, I find the Greek and Coptic blocks to be the most visible witness to
Unicode's choppy progress. Sets of characters from both languages spawned new, dedicated
blocks (e.g., Coptic U+2C80..2CFF, Greek Extended U+1F00..1FFF), and Greek characters
as
individuals or small groups have popped up here and there, in assorted blocks. Although
the general idea has been to keep blocks consistent and complete, that ideal is not
often realized. It would be nice to access characters that naturally group with one
another but straddle Unicode blocks. Desideratum one.
The Supplemental Punctuation block (U+2E00..U+2E2F) has a number of characters I must
access regularly, to process ancient and medieval inscriptions, e.g., ⹄ U+2E44 DOUBLE
SUSPENSION MARK (in Unicode Notational
Conventions, ranges are expressed with two dots and the name is rendered in
small capitals). Although that character is in Unicode because of a proposal I wrote,
I
regularly forget the hexadecimal number, and must look it up. When working with
characters outside the ASCII block we customarily have at hand supplementary tools.
Within oXygen XML editor, the Character Map
is quite valuable. For general use, I personally prefer BabelMap
(Windows only) and Richard Ishida's Unicode code converter,
the former to find and copy characters and the latter to analyze them. The recently
redesigned home page for Unicode is also quite
useful. You will no doubt think of other tools you like. Those tools are essential,
but
they can also be an inconvenient departure from the algorithm being constructed. It
would be nice to get those characters in a human-friendly way while staying mentally
within my XSLT code. Desideratum two.
Sometimes I am looking not for a single character but for all permutations of that
letter. That is, if I am searching a text for every variation of b, I would like to
build a character class for any character that according to the Unicode database has
a b
as a component (in addition to b itself, there are 20 such characters, from U+1D47
MODIFIER LETTER SMALL B to U+1D68B MATHEMATICAL MONOSPACE SMALL B). In this case,
auxiliary tools are of limited use, requiring ad hoc browsing and patchwork results.
Unicode decomposition (see Unicode Standard Annex 15)
via fn:normalize-unicode(*, 'NFKD')
is not a help here, because that only
helps you get from a precomposed character to its components. I am interested in the
reverse of the process. Desideratum three.
So, despite XPath's deep engagement with Unicode, there remain three key obstacles
to
building classes of Unicode characters. Many Unicode character classes I wish to build
do not map onto either a code block or a Unicode property—the two types of access
provided by \p{}
. Those constructors cover either too much or too little.
Writing a regular expression based on hexadecimal entities can be cumbersome and
haphazard, requiring correct use of external tools. Reading it can be equally
challenging. And going from a character to its composites can be tricky.
Most of the Unicode character classes I build are united by some logic. In some cases,
I know I could build a character class based upon words in the names of the individual
Unicode characters. So I began to wonder, couldn't I simply use the Unicode name DOUBLE
SUSPENSION MARK, and not worry about remembering the hexadecimal value of the codepoint?
Or if I wanted all suspenion marks, not just mine, couldn't I just write "SUSPENSION
MARK"? Doing so would make reading and writing a regular expression much easier. And
it
seems consistent with current conventions. After all, I can already invoke the name
of a
Unicode block in my regular expression. Why not also the name of a character, equally
immutable?
The proposition might sound risky. Yes, Unicode names are unique and stable, but there
are characters that for all intents and purposes are misnamed, so to use a name runs
the
risk of getting characters you did not want and failing to get those you did.
We already run that risk. We face it each time we use \p{}
or even
\w
(word characters), whose results can sometimes surprise or annoy.
Unicode nomenclature and classification can run against our druthers. If
\p{}
were extended to Unicode character names, we would need simply to
extend the caution we already must exercise. For example, if I am looking for the
medieval/late antique Greek numeral 6, the ϛ (U+03DB), I cannot use "episemon," the
oldest name for this character (ἐπίσημον, attested 2nd c. CE by Clement of Alexandria).
Instead, I need to familiarize myself with, and use, the official name, GREEK SMALL
LETTER STIGMA, regardless of history (the earliest appearance I have found for "stigma"
dates to an 18th century manuscript).
I also realized that we regularly remember keywords, but not necessarily their order
within a name. If I wish to cite the name for ỗ (U+1ED7) is it LATIN SMALL LETTER
O WITH
CIRCUMFLEX AND TILDE or ...TILDE AND CIRCUMFLEX? It would be nice if once did not
have
to know. A Unicode name starts with the centermost components, but that principle
helps
only slightly, because there's no reason why I should presume to know which component
is
drawn closer to the center, or that Unicode decisions have always been consistent.
Why
not build a class constructor simply through select keywords in the name?
That is, I propose to break any character's name into individual words, treating each
one like a property, much like space-delimited values of @class
in HTML
elements. If you are familiar with HTML conventions, you might immediately see the
upside to tagging Unicode characters like this:
. . . . . . .
<div id="x1ed5" class="above and circumflex hook latin letter o small with">ổ</div>
<div id="x1ed6" class="and capital circumflex latin letter o tilde with">Ỗ</div>
<div id="x1ed7" class="and circumflex latin letter o small tilde with">ỗ</div>
<div id="x1ed8" class="and below capital circumflex dot latin letter o">Ộ</div>
. . . . . . .
In each @class
, words in the character name have been intentionally set
lowercase and alphabetized, to show that, for our purposes, order and capitalization
may
be treated arbitrarily. This name signature, i.e., the character's name parts
alphabetized and space-joined, is not necessarily unique, and should not be treated
as
an identifier. See section “Caveats”.
If we wanted to select the ỗ in the example above, the third <div>
, to
style it in a certain way, in our CSS stylesheet we could simply write:
.o.circumflex.tilde.small
. Because only one codepoint has those four
words in its name, we do not need to cite all eight words (but we could if we wanted).
From there we can expand the class as needed. If we wanted to include the uppercase
version, we could simply drop the word "small": .o.circumflex.tilde
, which
matches exactly two characters (as of Unicode version 13.0). Dropping other words
increases the size of the set.
The dot-notation approach used in CSS + HTML classes can then be leveraged to build
a
wide variety of regular expression classes based on Unicode character names. Pure
dot
notation might create a class that is too large for some purposes, so the syntax should
provide a way to exclude classes. For example, we might want all letter U's with
diaereses, but not those with a caron (ˇ), i.e., drop Ǚ and ǚ. The exclamation mark
to
mean "not" has precedence (albeit not in CSS selectors), and seems intuitive as a
mark
of exclusion; for the previous example, we would write something like
.u.diaeresis!caron
.
A name-based approach to classes of Unicode characters opens up interesting, new
possibilities. One can use .combining
to find combining characters. One can
use .latin
to find a close approximation to all Latin characters, or
.greek
to all Greek ones. Using .with
gets all Unicode
characters that have a "withness" property, i.e., characters that are composed of
more
than one element (whether or not Unicode decomposition is defined). Similarly
.with.and
points to characters that have at least three components
(e.g., ᵳ U+1D73 LATIN SMALL LETTER R WITH FISHHOOK AND MIDDLE TILDE), whereas
.with!and
points to those that have only two components (e.g., À U+00C0
LATIN CAPITAL LETTER A WITH GRAVE).
Dot- and exclamation-mark-selectors have quite a bit of potential, but they are not
useful for an important desideratum I had set out at the beginning of this section,
namely, the creation of character classes based upon the relationship of composite
and
component characters. Let us suppose, for example, I want to build the Unicode class
of
variants on the Latin letter b. If I use .b
as described above I capture
290 characters, including many that are not directly related to the Latin letter.
Perhaps that's fine for some situations, but in others, I am looking for a much smaller
class, namely the twenty decomposable variations of b, according to the Unicode
database.
For such cases, we can adopt a different type of notation, with a +
signifying that the string that follows should be expanded to all composites. That
is,
+b
would expand to bᵇḃḅḇ⒝ⓑ㍴㏔㏝b𝐛𝑏𝒃𝒷𝓫𝔟𝕓𝖇𝖻𝗯𝘣𝙗𝚋
(in Unicode version 13.0). +bB
would expand to include both upper- and
lowercase results.
A kind of reversal could be implemented with a similar syntax, i.e., a minus instead
of a plus, so that, for example, -ḃãäḅẫậ
would return simply
baabaa
. Such a transformation is not as pressing a need as the other
cases, but if we are going to the trouble of building composites, one might as well
provide a similar way to reverse course.
Bringing \u
Back
Much of this re-imagination took place in the course of developing the function
library of the Text Alignment Network (TAN, http://textalign.net), a suite of XML formats intended to make Text Encoding
Initiative (TEI) files more semantically and syntactically interoperable. I soon
realized that my tinkering with regular expressions could have very broad, practical
applications, relevant to those who might not care much about TEI or TAN. So I isolated
this part of the TAN function library as a separate package or module, TAN-regex, to support
quick, easy imports or includes by projects that did not want to fetch the entire
TAN
function library.
The namespace of TAN-regex is identical to the TAN namespace,
tag:textalign.net,2015:ns
(a tag URN), but tethered to the prefix
rgx:
. (You can adopt whatever prefix you like in your host
application.)
I had considered the idea of incorporating the new syntax directly into the escape
class \p{}
. Although this idea had merits, I decided against it, mainly
because I wanted to compel anyone writing or reading the code to understand that this
was a clear departure from the core specifications. I also did not want to try to
support the negated class builder, \P{}
. So I opted for \u{}
.
It was nice to have \u
back.
The primary goal of the small XSLT library TAN-regex was to write versions of
fn:matches()
, fn:replace()
, fn:tokenize()
,
and fn:analyze-string()
that supported \u
. The challenge could
be reduced to ensuring that any instance of \u{}
in the standard parameter
$pattern
was replaced with a string for the intended character class,
padded by [
and ]
if not embedded as part of a character
class.
The master data for Unicode characters, including their names, is the Unicode
Character Database, a set of tables in plain text, e.g., https://unicode.org/Public/13.0.0/ucd/, upon which code charts and related
resources (e.g., Common Locale Data Repository) depend. This master data is also
converted to an XML format, e.g., https://www.unicode.org/Public/13.0.0/ucdxml/. For name-word constructors, I opted to
use the version that excluded the Unihan characters, since their names (all numbered)
would not be useful objects of query. The TAN-regex stylesheet
ucd/ucd-names.xsl
converts a given version of the XML version of the
Unicode Character Database to a simple catalog of <char>
s with name
words tokenized, lowercased, and placed in <n>
s, with results saved in
the subdirectory ucd
at, e.g., ucd-names.13.0.xml
. Creating
such a file is quite fast, a couple of seconds.
The decomposition process cycles through the XML database that includes the Unihan
characters, to ensure complete decomposition. The TAN-regex stylesheet
ucd/ucd-decomp.xsl
converts the UCD database to two different forms.
One type of output, e.g., ucd-decomp-simple.13.0.xml
, is slim, and features
a pair of elements, <mapString>
and <transString>
with
text nodes of identical length. They provide a simple one-for-one translation for
those
precomposed characters that can be resolved to a single base character. The other
output
file, e.g., ucd-decomp.13.0.xml
, is a collection of <char>
s
with a child <b>
for each base component. For both types of output,
decomposition must be performed against the Unicode database recursively, because
some
characters are defined as decomposing to characters that themselves admit decomposition.
The iterative function requires at least four passes through the UCD database to ensure
a complete inventory of atomic components. Therefore, running
ucd-decomp.xsl
takes a couple of minutes.
In the end the TAN-regex subdirectory ucd
is about fifty megabytes,
populated as it is with optimized data from Unicode version 5.1 through 13.0 (at
present). Supporting each Unicode version allows users to create regular expressions
based upon a particular Unicode version, should that be desired.
To access the function library simply include or import TAN-regex.xsl
,
the only XSLT file of note at the root of the project. (But don't forget to also get
a
copy of the subdirectory ucd
.) The functions do not depend upon templates,
so the library can be used via <xsl:import>
or
<xsl:include>
equally, your choice.
Most users will care only about the functions rgx:matches()
,
rgx:replace()
, rgx:tokenize()
, and
rgx:analyze-string()
. But those shadow functions rely upon component
functions that will be helpful for developers.
Each one relies directly upon rgx:regex()
. If that function detects the
new escape class, \u{}
, it will invoke rgx:parse-regex()
,
which takes as parameters a regular expression and a Unicode version number and returns
an XML tree fragment whose string value is a suitable substitution for
$pattern
.
The value within the curly brackets of any \u{}
is interpreted by
rgx:process-regex-escape-u()
, which also requires a Unicode version.
The curly brackets allow multiple items, space-delimited. Each item is checked. If
the
item matches a hexadecimal number (perhaps two of them separated by a hyphen), it
is
converted to the corresponding codepoint.
If an item starts with +
, the output of
rgx:string-to-composites()
is returned. That function takes a string,
breaks it into characters, and for each character returns a string that concatenates
all
characters that use the input character as a component.
If an item starts with -
, the process invokes
rgx:string-base()
, a function that performs limited decompositon of
Unicode characters. The input is passed along with a Unicode version through
fn:translate()
, which takes the relevant version of
ucd-decomp-simple.*.*.xml
to convert decomposable characters that can
be reduced to one major base character. If there is no such one-to-one correspondence,
the original character is returned. rgx:string-base()
is similar to
fn:normalize-unicode(., 'NFKD')
, except that all component parts that
are not the sole base letter are discarded. It is actually closer in spirit to
fn:lower-case()
and fn:upper-case()
in that the length of
the input string is always preserved, keeping intact any characters that cannot be
so
reduced.
If an item starts with .
or !
, it is treated as a name
query, and rgx:get-chars-by-name()
returns matching characters, treating a
string prefixed by .
as a word that must appear in a character name, and
one prefixed by !
as a word that must not appear. Names equivalences are
not case-sensitive. This function returns fragments from the Unicode names database,
for
example:
<char cp="0029" val=")">
<na>
<n>right</n>
<n>parenthesis</n>
</na>
</char>
Each <n>
can be capitalized and string-joined to render the character
name in the customary fashion. Perhaps an even more convenient way to get such fragments
is with the key get-chars-by-name
, e.g., key('get-chars-by-name',
('parenthesis'), $default-ucd-names-db)
. You may then filter and sort the
results as you like.
rgx:parse-regex()
takes the results from
rgx:process-regex-escape-u()
and pads the output string in square
brackets if the original \u{}
is not within the context of a character
class; if it is, the string is returned unchanged.
TAN-regex comes with a few other related functions that could be useful in certain
contexts. The functions that convert hexadecimal numbers to decimal and vice versa
are
generalized, to allow conversions to and from bases 2 through 16 and 64
(rgx:dec-to-n()
and rgx:n-to-dec()
).
The function rgx:string-to-components()
, the inverse of
rgx:string-to-composites()
, takes an input string and returns a
sequence of strings. It chops the input into characters, and for each character returns
its component characters. If the character does not decompose, the character itself
is
returned.
rgx:string-base()
and rgx:string-to-components()
are two
quick ways to handle decomposition. They rely upon a decomposition database provided
by
rgx:get-ucd-decomp-db()
, whose tree can be used to build your own
functions. For example, you could use on the decomposition database the XPath expression
/*/char[b[1]/@gc eq 'Nd'][b[2]/@gc eq 'Sm']
, which matches the twenty
characters that decompose into first a numeral and second a symbol, such as ¼. A sample
tree fragment:
<char cp="00BC" val="¼">
<b gc="Nd">1</b>
<b gc="Sm">⁄</b>
<b gc="Nd">4</b>
</char>
rgx:string-to-components()
is for all intents and purposes the same as
for $i in fn:string-to-codepoints($string) return
fn:normalize-unicode(fn:codepoints-to-string($i), 'NFKD')
, i.e., a sequence
of strings that correspond one-to-one to each character in the input string. When
concatenated, the output of rgx:string-to-components()
should be identical
to fn:normalize-unicode($string, 'NFKD')
. The sequence form of output in
rgx:string-to-components()
might be useful in cases where a developer
wishes to intercept the decomposing normalization process.
But rgx:string-base(.)
is importantly different. The length of output
always matches the length of the input string, and makes substitutions only if a
composite can be replaced by a single distinct base character. It would be comparable
to
fn:substring(fn:normalize-unicode(., 'NFKD'), 1, 1)
if every composite
Unicode character was made of one base character followed by zero or more non-base
characters. But many composite Unicode characters do not fit this model. Some have
more
than one base character (e.g., ⅐ U+2150 VULGAR FRACTION ONE SEVENTH) and others begin
with a non-base character (e.g., ำ U+0E33 THAI CHARACTER SARA AM, ⒜ U+249C PARENTHESIZED
LATIN SMALL LETTER A). The purpose of rgx:string-base()
is not to imitate
the decomposition process, but to provide a type of normalization comparable to
fn:lower-case()
and fn:upper-case()
, for relaxed string
comparisons. The escape class \u{-}
is but one beneficiary; the function is
also useful in contexts where two strings need to be relaxed to be compared.
All the above functions can be run against any version of Unicode from 5.1. If no
version is supplied, the most recent version of Unicode will be used (currently 13.0).
If you are writing a regular expression that requires a specific version of Unicode,
put
the version number in the $flags
parameter, along with any other flags,
e.g., rgx:tokenize($my-string, '\u{+b}','13.0i')
.
Testing \u
TAN-regex includes a subdirectory, tests
, which has a stylesheet
test.xsl
to produce ad hoc results from the functions. The subdirectory
also includes a battery of XSpec tests, tan-regex.spec
. All XSpec tests are
currently successful.
Experiments run with TAN-regex based on Unicode version 13.0 produced some surprising
results. Comments below are documented in test.xsl.output.xml
.
As might be expected, none of the 43,026 characters that matched
!combining
(i.e., characters that do not have the word COMBINING in
their name) also match the category for combining marks, \p{M}
. You might
expect the reverse to be true, that the inverse category .combining
and
\p{M}
would result in coterminous sets. But only 330 of the 462
characters that matched .combining
also matched \p{M}
. After
some diagnosis, it turned out that the processor, likely because of the underlying
Java
version (1.8, build 25.261), did not recognize the other 132 characters, and classified
them as not assigned, \p{Cn}
.
Of the 1,157 characters matching .symbol
, 213 do not match the symbol
category, \p{S}
. This is not a bug or anomaly. It simply shows that there
are many Unicode characters that have "SYMBOL" in their name but are not classified
as
symbols. For example, ϕ U+03D5 GREEK PHI SYMBOL is classified as a lowercase letter,
\p{Ll}
. So the constructor \u{.symbol]
usefully allows us
to construct a class of Unicode characters that people might treat as symbols,
irrespective of their Unicode general category.
There are 946 characters matching .digit
and .numeral
. Of
these, 297 do not have the number property, \p{N}
. After weeding out those
characters that were not classified by Saxon, 36 remain, such as ݳ U+0773 ARABIC LETTER
ALEF WITH EXTENDED ARABIC-INDIC DIGIT TWO ABOVE and ꣧ U+A8E7 COMBINING DEVANAGARI
DIGIT
SEVEN. Those examples show that some Unicode characters have secondary qualities that
are communicated only through the name. TAN-regex's \u
provides a unique
way to query and fetch such secondary characteristics. This quality should be seen
as
complementing (and not replacing) the already powerful method of accessing Unicode
characters through their general category properties (e.g., \p{L}
for
letters).
Caveats
TAN-regex fuctions based on \u{}
may penalize some applications, if not
properly deployed.
Consider an XML file with 125,000 leaf nodes (an XML document with fifty elements
on
level 2 each with fifty elements, each with another fifty), each with some text, and
an
XSLT stylesheet that checks for a match on each leaf. When the leaf template uses
something simple such as fn:matches(., 'A')
, the process on a Dell Inspiron
5570 (Intel Core 1.6GHz i5-8250U with 4 physical and 8 logical cores) takes 0.5 seconds.
Using rgx:matches(., 'A')
takes 1.1 seconds, perhaps an acceptable
increase. When \u
is introduced, this increases somewhat:
rgx:matches(., '\u{.circle}')
takes 2.7 seconds, provided the processor
supports @cache
on XSLT 3.0 functions (Saxon PE and EE do so, but Saxon HE
does not).
When working with a processor that does not support cached functions,
rgx:matches(., '\u{.circle}')
takes 9359.3 seconds (one hour forty-six
minutes), because the value of .circle
is calculated time and again. The
solution in such a situation is to tether \u{.circle}
to a global variable,
so that it is calculated only once. To do this, first define a global variable:
<xsl:variable name="regex-circle" select="rgx:regex('\u{.circle}')"/>
Then invoke that global variable as needed, for example, rgx:matches(.,
concat($regex-circle, '\s+', $regex-circle))
finds any two characters with
"circle" in their Unicode name, separated by one or more space characters. (For further
efficiency, you might bind the composite value of the second parameter to a global
variable.)
When the process on the 125K leaf-node file is shifted to a global variable approach,
fn:matches(., $regex-circle)
takes 2.0 seconds. The corresponding
rgx:matches(., $regex-circle)
takes 2.8 seconds.
Even if you are using a processor that handles cached XSLT 3.0 functions, you will
find it useful to build global variables with rgx:regex()
, to be invoked
mnemonically where you like. For example:
<xsl:variable name="class-of-chars-with-symbol-in-name"
select="rgx:regex('\u{.symbol}')"/>
<xsl:function name="my:strip-symbols" as="xs:string?">
. . . . .
<xsl:sequence select="rgx:replace($input, $class-of-chars-with-symbol-in-name, '')"/>
. . . . .
</xsl:function>
Keep in mind that rgx:regex()
will convert \u{}
to a
character class, framed by square brackets. If you want only a string of characters,
use, e.g., rgx:process-regex-escape-u()
, whose results permit further
processing as desired.
It might be objected that composition and decomposition via + and - are unnecessary.
Rather than building a class of composites, one could simply first pass the input
through fn:normalize-unicode($input, 'NFKD')
, to convert it to component
parts, then search accordingly. But that approach works in only some cases. If you
are
looking for a sequence of characters, you must anticipate an unknown number of extra
characters, many but not all combining characters. Take for example the input "ẵbcẚ⒝c".
When filtered through fn:normalize-unicode()
the string expands to length
eleven (U+0061 U+0306 U+0303 U+0062 U+0063 U+0061 U+02BE U+0028 U+0062 U+0029 U+0063).
If you are searching for "abc," which you expect to match twice, you cannot use as
your
regular expression 'abc'
; you must use something like:
fn:matches(fn:normalize-unicode('ẵbcẚ⒝c', 'NFKD'),
'a[\p{M}\p{Pe}\p{Lm}]*\p{Ps}?b[\p{M}\p{Pe}]*c')
, and hope that you have
correctly built the classes of ignorable characters that might follow an a or b. The
preceding regular expression anticipates the possibility of encountering ẚ U+1E9A,
⒜
U+249C, or ⒝ U+249D, but it might result in false positives, such as a match on this
input string: "a]{b)c". Constructing an airtight regular expression under this technique
might be impossible. For any two strings that have identical NFKD-normalization forms,
e.g., "⒜⒝⒞" and "(a)(b)(c)", your regular expression will match either both or none,
which you might not want. Even if you are not so picky, writing a strong regular
expression under this method can become quite time-consuming and result in unreadable
code. The TAN-regex equivalent, rgx:matches(., '\u{+a}\u{+b}\u{+c}')
is
faster to write, easier to read, and probably more accurate.
A close approximation of decomposition (-) is already available to us via XPath
expressions. For example, \u{-ḃ}
is merely another way of saying
concat('[', fn:substring((fn:normalize-unicode('ḃ', 'NFKD')), 1, 1),
']')
. That works for this simple example, but many times, as explained above
in the discussion of rgx:string-base()
, the normalized string might bring
unwanted surprises.
Not every name has a unique name signature (i.e., words in the name alphabetized and
joined by spaces). About 0.8% of Unicode characters have name signatures that are
duplicates of the name signature of at least one other character (394 characters in
182
groups, as of Unicode version 13.0), e.g., ⫓ U+2AD3 SUBSET ABOVE SUPERSET and ⫔ U+2AD4
SUPERSET ABOVE SUBSET. A future version of TAN-regex may support name component
order.
One other hazard that needs to be watched for are ambiguous name words. For example,
"a" can mean either the letter a or it can be the indefinite article. So
.a!Latin
captures not only А U+0410 CYRILLIC CAPITAL LETTER A but also
⊅ U+2285 NOT A SUPERSET OF. If you use \u{}
you must still study the
Unicode standard, particularly Character Properties: Name, section 4.8 of
The Unicode Standard Core Specification.
What To Do with \u
To this point I have depicted TAN-regex and its component functions in broad strokes.
These are building blocks for other applications. I conclude with an example relevant
to
those of us who work with texts with numerous accents. I illustrate with polytonic
Greek, but the principle could be applied to other languages.
When processing ancient Greek texts, we frequently need to normalize the accents.
Greek has a number of accentuation rules, and it is common for context to demand that
an
acute ΄ accent be switched to grave `. But sometimes we need to switch back. If we
wish
to look a word up in a dictionary, the grave accent must be converted to its normal
acute version, e.g., ἀδελφὸς → ἀδελφός or ἂν → ἄν. (Note how the ΄ can be one of several
combining marks.) The problem is a challenge because there are dozens of Greek Unicode
characters with the acute and grave, in various precomposed configurations. Conversions
are possible and straightforward, but the most obvious solutions are verbose, and
time-consuming to build.
To accommodate the need to switch accents on a complex character, TAN-regex includes
the function rgx:replace-by-char-name()
, which shows how to combine and use
the lower-level TAN-regex functions. The function
rgx:replace-by-char-name()
takes as input a string that should be
changed (parameter 1), three sequences of strings (parameters 2-4), and an indication
whether a replacement should be strict (parameter 5). A 6-arity version of the function
also permits a Unicode version (parameter 6). The string sequences in the second through
fourth parameters ($words-in-name-to-drop
,
$words-in-replacement-char-name
,
$words-not-in-replacement-char-name
) are supposed to be keywords in
Unicode character names. Changes are made to only those characters in the input string
whose names have a word that matches the list in $words-in-name-to-drop
.
Those keywords are dropped from the input character's name and the search for names
is
conducted again, using the other two keyword parameters to filter the results. If
any
substitute characters are found, they are returned, otherwise the original character
is
returned.
In the case of the problem above, changing the grave to an acute, one can write
rgx:replace-by-char-name('ἀδελφὸς ἂν ᾖ.', 'varia', 'oxia', (), true())
.
The input string ἀδελφὸς ἂν ᾖ. ("He should be a brother.") is processed letter by
letter. Nothing happens unless a letter has a Unicode name with the word VARIA (=
grave
accent). So the only two characters that are affected are the ὸ U+1F78 GREEK SMALL
LETTER OMICRON WITH VARIA and ἂ U+1F02 GREEK SMALL LETTER ALPHA WITH PSILI AND VARIA.
In
each case "VARIA" is dropped and a search is made for Unicode characters with the
rest
of the name words, as long as they also include the word "OXIA" (= acute). Each of
the
two letters has a single replacement, i.e., U+1F79 GREEK SMALL LETTER OMICRON WITH
OXIA
and U+1F04 GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA. The output is the desired
change of the grave accents to acute: 'ἀδελφός ἄν ᾖ.'
Another type of normalization often need to perform on ancient Greek is to drop from
words any accents that result from enclitics. (An enclitic is a word whose accent
shifts
back to the previous word, similar to the way, when pronouncing the phrase "Codify
it,"
the "it" prompts us to slightly emphasize "fy.") The result is that some Greek words
have two accents instead of the customary one, e.g., ἄνθρωπός τις ("a certain human
being"). Tokens need to be adjusted before looking them up in a lexicon or database,
so
we normally want to drop only the second accent and keep the first. This task can
be
cumbersome to do in XSLT because of the many codepoints that represent permutations
of
Greek vowels and their combining marks. Building a regular expression to capture
double-accented Greek words is quite a chore. And changing the second accent requires
a
choose-when-test operation with a minimum of fourteen branches; probably more, depending
upon the kinds of decisions being made.
Fortunately, such normalization can be applied in a relatively straightforward manner
by using both rgx:regex()
and
rgx:replace-by-char-name()
:
<xsl:variable name="greek-pattern-for-accented-vowels"
select="rgx:regex('\u{.greek.tonos .greek.oxia .greek.varia .greek.perispomeni}')"/>
<xsl:variable name="greek-pattern-for-acute-vowels"
select="rgx:regex('\u{.greek.tonos .greek.oxia}')"/>
<!-- In the variable below the first word uses U+03CC GREEK SMALL LETTER OMICRON WITH TONOS,
the second word U+1F79 GREEK SMALL LETTER OMICRON WITH OXIA. They look identical, but
the first is the preferred (normalized) form. -->
<xsl:variable name="greek-words-with-two-accents" select="'σῶσόν σῶσόν'"/>
<xsl:variable name="second-accent-dropped-from-greek" as="xs:string*">
<xsl:analyze-string select="$greek-words-with-two-accents"
regex="({$greek-pattern-for-accented-vowels}\S*)({$greek-pattern-for-acute-vowels})">
<xsl:matching-substring>
<xsl:value-of select="regex-group(1)"/>
<xsl:value-of
select="rgx:replace-by-char-name(regex-group(2),
('oxia', 'tonos', 'with'), (), (), true())"
/>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:variable>
<xsl:value-of select="string-join($second-accent-dropped-from-greek, '')"/>
<!-- The above results in 'σῶσον σῶσον' -->
In the code above, σῶσόν (save) returns σῶσον. ἐλέησόν με (have mercy on me) returns
ἐλέησον με.
The processes described above could be replicated, more efficiently, with traditional
decomposition, replacement on the parts, and then normalization. But that is because
my
example has been Greek decomposable letters. It gets much trickier when there are
no
components, such as switching something that is left-oriented into its equivalent
right-oriented character. For example, to switch upward-pointing arrows into
downward-pointing arrows:
<xsl:variable name="upwards-arrows" as="xs:string" select="'↑↥'"/>
<xsl:value-of select="rgx:replace-by-char-name($upwards-arrows, 'upwards', 'downwards', (), true())"/>
The result is ↓↧. This method could be used to develop applications that
programmatically change chess pieces (U+2654..U+265F), recycling labels
(U+2673..U+267A), domino tiles (U+1F030..1F09F), or playing cards (U+1F0A)..U+1F0FF).
The potential for variations is endless.
TAN-regex is released under a GNU General Public license, to encourage others to
change, adapt, and improve the code. Updates to the library will be made at https://github.com/textalign/TAN-regex.
Enjoy the new \u
!