Flynn, Peter. “Cooking up something new: An XML and XSLT experiment with recipe data.” Presented at Balisage: The Markup Conference 2020, Washington, DC, July 27 - 31, 2020. In Proceedings of Balisage: The Markup Conference 2020. Balisage Series on Markup Technologies, vol. 25 (2020). https://doi.org/10.4242/BalisageVol25.Flynn01.
Balisage: The Markup Conference 2020 July 27 - 31, 2020
Balisage Paper: Cooking up something new
An XML and XSLT experiment with recipe data
Peter Flynn
Peter Flynn managed the Academic and Collaborative
Technologies Group in IT Services at University College
Cork, Ireland until his retirement in 2018. He trained at
the London College of Printing and did his MA in
computerized planning at Central London Poly (now the
University of Westminster). He worked in the UK for the
Printing and Publishing Industry Training Board, first as
researcher and then as DP Manager; and for United
Information Services of Kansas as IT consultant before
joining UCC as Project Manager for academic and research
computing. In 1990 he installed Ireland’s first Web server,
and expanded the university’s academic and research
publishing support. He has been Secretary of the TeX Users
Group, Deputy Director for Ireland of EARN, and a member
both of the IETF Working Group on HTML and of the W3C XML
SIG; and he has published books on HTML, SGML/XML, and
LaTeX. Peter also runs the markup and typesetting
consultancy Silmaril, and is editor of the XML FAQ as well
as an irregular contributor to conferences and journals in
electronic publishing, markup, and Humanities computing, and
a regular speaker and session chair at the XML SummerSchool
in Oxford. He completed his PhD in User
Interfaces to Structured Documents with the Human
Factors Research Group in Applied Psychology in UCC in 2014.
He maintains a fairly random semi-technical blog at http://blogs.silmaril.ie/peter
This is a report on an experiment to see if
XML and the disaggregation of ingredient
metadata could be used to reduce errors in recipes. Errors in
web pages, PDFs, and print have been an
irritant to authors, cooks, editors, and publishers for many
decades, and occasionally the cause of an expensive recall.
This research aims to see if markup could help.
Modern recipe structure is a well-established convention
of a list of ingredients and a list of instructions
(method). The writing about mistakes is sparse
but highlights errors of omission and commission,
inconsistency, sequence, and mismatch between the lists.
Attributes for ID and classification were added to
the ingredients list in a nonce recipe schema, along with an
IDREFS attribute for use in the references to
ingredients in the instructions; and code was written for
cross-checking existence, usage, and consistent reproduction
of names, and tested on a small collection of recipes.
We demonstrated that five of the seven classes of error
identified could be straightforwardly remedied, but that the
requirements for disaggregated data input needed to deal with
the consistency issues may be too detailed for non-expert use
unless assisted by a semantic filter.
My thanks go to all those friends in cooking and markup who
contributed with suggestions and food.
Background
There is a conventional formality to the way in which
recipes are presented in western cultures which has been common
since the middle of the nineteenth century. Before that,
‘receipts’ (as they were then known, from
the Latin for ‘Take…’) were largely
narratives, so you had to read them all the
way through and note down what ingredients you would
need.[1]; this was true from the earliest clay tablets
[Anon 2016] through the Greek and Latin
recipes of the Classical period [Vehling 1936] to the end of the manuscript era with the
first large-scale cookbook, The Forme of
Cury; and from the subsequent rise of the printed
cookbook from the 1470s [Sitwell 2012],
including the extensive body of household manuscript cookery
books and ephemera (see, for example, Figure 2) that continued to flourish until the
end of the 18th century [Masters 2013], to the conventional modern style
which was pioneered by Eliza Acton (1845) and
popularised by Isabella Beeton (1861) in the UK and Fannie Farmer (1896) in the USA.
This style has a structure
something like this:
Title and/or Description, sometimes with a
picture
Number of portions (sometimes)
List of Ingredients (quantities, materials,
treatments)
Method of preparation (steps)
Comments or serving suggestions
It is nowadays also common to provide an extended narration
after the Description, perhaps explaining where the recipe
originated, or what changes have been made, but this is a matter
of taste and style, and not an essential component. The key
components remain the List of Ingredients and the steps of the
Method.
There has been some interesting work on encoding recipes,
particularly in the historical field (and therefore by default
using TEI) [Knauf 2017][Klug 2017], but these are
typically done to enable the recipe[s] to be identified within a
much larger corpus, not for the purposes of analysing the
ingredients or method, so they do not tend to use markup down to
the level proposed here.
Ingredients
Ingredients are usually given in the order in which they
get used in the steps, but sometimes they may be in order of
importance (for example, a recipe for a beef stew could start
with the beef, even though the onions may be the first thing
you begin the cooking with); and sometimes they may be
grouped, especially in complex recipes (all spices together,
or all ingredients for a sauce together).
The convention for ingredients is to give the quantity,
units, and item in that order (eg 3 Kg onions,
but some authors or editors give the item first
(Onions, 3 Kg). It is not important, except
that from a publishing point of view it needs to be done the
same way in each recipe to avoid confusing the reader.
Other material about quality, size, shape, and treatment
may be interspersed: the example above would be needed for
3 Kg small red onions, peeled and chopped fine.
The borderline between the preparation or treatment being
attached to the ingredient, or being mentioned in the steps of
the method is sometimes hard to determine: both are common,
and the discussion above about style and consistency applies
here also.
Measurement
Measurements have their own cultural conventions. In
modern western cultures there are three common
‘standards’:
In most European-influenced cultures, metric units are
standard (grams, kilos, liters).
In the UK and some of its former
spheres of influence, metric units are the norm for
published recipes, but Imperial units are still often
used domestically (ounces, pounds, pints, with 20 fluid
ounces to the pint).
In the USA, measurements are given by volume (cups,
pints, with 16 fluid ounces to the pint) but also (in
larger quantities) in pounds and quarts; the word
‘ounce’ is used to mean
‘fluid ounce’, as an ounce weight is rarely
used. Canada and Australia officially use the metric
system but many people still habitually use Imperial or US
measurements. New Zealand uses metric measures but has a
standard metric cup (250ml).
However, all these cultures tend to use similar measures
for very small quantities, subject to some minor differences
related to eating and drinking habits (see Figure 3):
tea-spoon or coffee-spoon (tsp, cuillère
à café or càc, etc: about 5ml),
although in cultures where tea-drinking predominates, a
coffee-spoon is smaller than a tea-spoon, about 3ml
dessert-spoon (dsp, cuillère à
dessert or càd, etc: about 10ml), common
in UK and French-speaking
cultures only, so far as I have been able to
determine
table-spoon or soup-spoon (tbsp,
cuillère à soupe or càs,
etc: about 15ml, but a tbsp is 20ml in Australia); in the
UK, a soup-spoon is the same size as a
dessert-spoon (although a different shape) — [e]veryone knows how big a
table-spoon is: it will just go into your mouth, though
not if you have nice manners. [Freeling 1972]
Other spoons exist, of course: mustard-spoons and
salt-spoons, for example, but I am not aware of any
standard capacities. Extensive internationalisation would
be needed for more widespread applicability: while the
sizes appear not to vary much, the names and abbreviations
are of course different.
Errors
It is not uncommon for recipes published in books and
magazines and on the web to contain mistakes that can confuse
even experienced cooks. This may be caused by many factors,
including writing or typing up the recipe in a hurry; changing
it while you experiment, and forgetting to update it; failing
to get it edited professionally before publishing; working
from illegible or out-of-date sources; misunderstanding a
translation; or not testing the recipe — and doubtless many
others including plain ordinary typographic errors.
Errors in recipes are an annoyance to readers when a dish
fails; they are an embarrassment to their authors; they are
damaging to the reputation of the publishers; and occasionally
they can be the cause of serious financial loss, if a book has
to be withdrawn because of them. It is therefore in everyone’s
interests that recipes be as correct as possible. This
research is an attempt to see if markup can contribute to a
solution.
Complete omission of an ingredient (both from the list
and from the method) is an editorial and
testing problem, easily fixed online but not in print
[Cloake 2011]. This class of error is not
susceptible to treatment in software as the relevant data is
by definition entirely absent in the first place, so there is
nothing for a program to do anything with.
Cloake (2011) also quotes an example of the much more
common problem of omitting the ingredient in one place but not
the other:
Nigella’s Feast […] contains a
recipe for a chocolate orange cake that includes a direction
to ‘cream together the butter and sugar’ —
which would come as a nasty surprise to the prospective
baker, given no butter is mentioned in the ingredients.
(When chocolatier Paul A Young tried both versions, he
concluded the butter was a red herring — the cake turns out
much better without it.)
Mismatched quantities can also confuse the cook.
Jacob (2016) describes an
error where the list of ingredients specified four cups (of
shredded sharp cheddar cheese), but the method only used half
a cup.
In an earlier article, Jacob (2010) identified seven classes of error (14 if
we include a later list of seven more). Most of these are
editorial problems which are important but out of scope for
this research. The key concerns here are (using [Jacob 2010]’s original
numbering for the first and second lists):
Ingredients out of order (1/1)
Missing ingredient (1/2)
Wrong amounts (1/3)
Making every step a separate number (2/6)
In item 2, [Jacob 2010] groups
together the errors of omission and of commission (a listed
ingredient which does not get used; and a step referring to an
ingredient that is not listed), but we would argue that these
are technically two separate classes of error.
Testing is probably the most essential part of recipe
development, but for this very reason, each cycle of testing
means changes to the recipe. Hart, in an article on writing cookery
books, emphasises that while there are things a good editor
will catch, it’s up to the cook/author to get it right to
start with [Hart 2012].
An additional class of error is the inconsistent use of
names, that is, using a different name for an ingredient in
the List of Ingredients to the one used in the Method. This
can occur where different cultures name things differently,
and either lack of editorial oversight or authorial
absent-mindedness results in both names being used for the
same things in different places (‘spring
onions’ and ‘scallions’
is one example that might need explaining out of its cultural
context).
These problems are not new. Burros (1997) was blunter about it:
The prevalence of errors in cookbooks is the publishing
world’s dirty little secret. The problem is likely to get
worse as an industry mired in economic doldrums resorts to
cost-cutting, practically guaranteeing less editing and
testing before publication.
The publishing industry has indeed continued to get worse,
and it is now a rare publisher who can offer to copyedit and
proofread a manuscript, and the online publishing business has
regrettably mirrored the worst practices of its print
forebears. Burros (1997) goes on to explain the division of
blame between publisher and author, both of whom feel the
other could do more, and concludes that
[i]t is a haphazard system — further complicated by
typesetting errors and editing that too often fails to
eliminate confusion.
Elsewhere she refers to human error or computer
gremlins, which is where the present research comes
in.
Scope
With rare exceptions, published recipes nowadays are either
on the web, which means HTML in one form or
another; or in print, which means typesetting to
PDF. The source format for new recipes is
likely to be a Microsoft Word file,
or a blog entry (perhaps Markdown), or an email message, or
possibly still a typescript or manuscript. They may be original
to an author (even though something similar may have existed for
centuries elsewhere, unknown to the author), or they may have
been copied or converted in many ways from recipes passed
between friends and family, and they may of course also have
been pirated: copyright notwithstanding, photocopies of recipes
from magazines and books are legion, and it is not hard to do an
OCR from a scan.
The scope for errors is enormous: the author’s own
experience includes an edit of a typescript which listed half a
pint of milk, originally typed (on a typewriter, from a
handwritten recipe) as 1/2 pt milk.
This became 1 or 2 pints milk in the editing (by
someone unfamiliar with the lack of a ½ sign on an old
typewriter), but it was corrected at proofing stage to
½ pt milk — and then the defective software
used by the typesetter could only manage □ pt
milk.
Recipe management software is available industrially, but
tends to focus on very large volume production in the food
industry and the automation of mixing and cooking equipment.
However, a Belgian software company,
youmeal.io, produces kitchen-oriented
food analysis products for the catering and restaurant industry,
and emphasises that using correct food data is of primary
importance. They quote a study of their own claiming that
50% of technical sheets for compound products were
incomplete or incorrect.
A software solution to at least some of the problems above
was considered to be potentially of use to the cookery author,
editor, or publisher, as well as to cooks who wants to write up
their own recipes in a way that will pass the test of time — but
many other problems will continue to rely on humans for a
solution. (Historical recipes are interesting for the lack of
detail as well as for the actual food: some of them read like
recipes from a professional cook’s manual such as
Le Répertoire de la Cuisine [Saulnier 1982] where for brevity the reader is assumed
already to know everything from experience; others are virtually
unusable because not all the relevant ingredients are
mentioned, so expert guesswork is needed.)
From the errors discussed earlier, a candidate list of
topics emerged, based on susceptibility to solution by
software:
Ingredient referred to in method was never listed
Ingredient listed was never referred to in method
Ingredients out of order
Bogus quantities (eg too big or too small)
Mismatched quantities (different between the list of
ingredients and the step of the method)
Inconsistent naming of ingredient between list and
step
Steps too small (ie too many of them)
Of these, the control on bogus quantities was seen as
unimplementable without a data history and suitable limits,
which places it outside the scope of this experiment. The step
size problem is also not easily susceptible to machine
judgement. Both these classes were therefore dropped at this
stage
Schematron was suggested by two reviewers, and could be used
to calculate ‘reasonable’ measurements
and highlight deviations, as well as to identify ingredient item
conflicts, but in the time available this was not
possible.
The objective, therefore, was to see if adding markup to the
ingredients and steps could be used at or before the rendering
stage to limit the remaining classes of error without creating
too much work for the author or editor.
It was seen as important for potential solutions that they
could be implemented in any programming language, and the data
could be stored in a number of different ways, so while this
implementation is in XML and
XSLT, the data structure (50 lines) and the
code (600 lines) are both small and should be easy to
reimplement. The choice of XML was based on a
number of considerations: ; a) many publishers already use XML as
part of their workflow; b) it is commonplace in web systems; and c) a recipe is essentially narrative text (still), even
if it is presented in the form of two lists, and
XML was designed for dealing with mixed
content (plain text mixed with special meanings). XML editing software also has
controls which can be used on elements and attributes and
references to them, early in the workflow, as well as at the
point where output is created.
Taking this as a starting-point, some common
XML markup features could readily be seen as
having potential use: for example the built-in
ID/IDREF checks could be used to test for
the presence or absence of ingredients in the steps and
vice versa; and enumerated (token
list) attributes could be used to represent the options for
different categories of ingredients. This would improve
the accuracy of reproducing the textual form of the ingredients;
allow for finer-grained checking; and enable indexing for book
publication and for online searching.
During initial development it became apparent that a
sufficiently accurate categorisation of the ingredient metadata
could provide a solution to error class 6 by [re]generating the
textual form of each ingredient programmatically from the
categorised data.
Implementation
The implementation proceeded in two phases: developing and
testing the ID/IDREF mechanism, used for
error classes 1, 2, 3, and 5 in the list in §5, and
developing the categorisation for ingredients, used in
class 6.
Identity checks
Some initial tests showed that detecting the use of an
ingredient was trivial. Given a schema that makes
xml:id a REQUIRED attribute on an
ingredient element, a conditional using an XPath statement
such as count(idref(@xml:id))=0 is sufficient to
determine if the ingredient is not referenced anywhere else in
the recipe. Note that at this level, it does not control for
where such a reference ought to occur,
nor whether it would be meaningful in context: those are still
tasks for a human editor or proofreader.
The reverse is simpler and even less controlled: if the
references from the steps to ingredients are done using an
element with an IDREF attribute, then standard
validation techniques will throw an error on any such
references that have no matching ID, even before
regular processing starts.
As a first stage, therefore, we can use two declarations,
one for the ingredients and one for references to them:
<!ELEMENT ingredient (#PCDATA)>
<!ATTLIST ingredient xml:id ID #REQUIRED>
...
<!ELEMENT ing EMPTY>
<!ATTLIST ing i IDREFS #REQUIRED>
The first element would occur as part of the content model
for the list of ingredients, and the second element would be
valid in mixed content in the steps of the method, as the
reference to the ingredient[s] being used. In fact, if this
system is to be implemented in an existing
schema/DTD (as opposed to the nonce schema
used for testing), only the attributes are required: the names
of the element types could be anything.
Consistency
The ID/IDREF link used in section “Identity checks” can also be used to reproduce the name of
the ingredient at the point of reference, instead of requiring
it to be entered manually during composition, unless some special
wording is required. In effect, if we write
<ingredients>
<ingredient xml:id="flour">brown flour</ingredient>
<ingredient xml:id="sugar">muscovado sugar</ingredient>
</ingredients>
...
<method>
...
<step>Add the <ing i="flour sugar"/> and mix well.</step>
</method>
it is straightforward to write code which will
produce
3. Add the brown flour and muscovado sugar and mix well.
This makes use of the binding between ingredient and
mention which addressed the missing ingredients problem.
However, merely reproducing the name of the linked ingredient
does not solve the problem of the wrong ingredient being
accidentally referenced, an in many cases the full name is not
required (eg just ‘flour’ and
‘sugar’ are enough). Proofreading and
recipe-testing are still important to prevent this.
Order
A test for the order or sequence of ingredients could be
encoded into the handling of the mixed-content element type
(ing in the example in section “Identity checks”),
but in order to take account of potential previous references
to the same ingredient, which encumbers the coding, it is
preferable to do this at another stage, for example in the
handling of the container of the steps of the Method.
For each grouped unique occurrence of descendant
ID values (that is, in the steps of the Method),
the position within the Method is compared with the position
of the matching ingredient in the List of
Ingredients.[2]
<step>Add the <ing i="sugar flour"/> and mix well</step>
would throw an error because the flour is listed as an
earlier ingredient than the sugar.
While it is conventional to list the ingredients in order
of their mention, it is by no means universal; but where
ingredients are grouped (for example into component parts of
the recipe), then there are usually also multiple matching
Method steps, and within them the rule of order-of-mention
appears to be observed.
Categorisation
It became apparent that the disaggregation of the
ingredient data could lead to the generation of the
human-readable ingredient items both in the List of
Ingredients and in the mentions in the Method. There is a
formality here too, in the way in which ingredients are
expressed, and there are conventions which vary by culture. It
is possible to say 100 g walnuts, chopped fine
as well as 100 g finely-chopped walnuts: both
mean the same thing, although in English there is an implicit
presumption in the first form that you take whole walnuts and
chop them fine yourself; and in the second, that you buy the
walnuts ready-chopped. While these variants are largely
stylistic, published collections of recipes try to standardise
on one way of saying things in order not to confuse the
readers, especially if they are likely to be beginners and
unfamiliar with the conventions.
It therefore became an additional task to equip the system
with the ability to store the ingredient data as separate
identities for units, quantities, different classes of
foodstuffs, qualities, treatments, etc, so that the
ingredients list could be generated in an acceptable format,
especially across many recipes following a pattern. A
side-benefit is that it could also result in the consistent
use of names between ingredients and method. The
categorisation of the ingredients required considerably more
work, and remains open to much discussion.
Many categorisations or classifications are based on
nutrition or source, both of which would require specialist
knowledge to enter as data. Wikipedia suggests Dairy, Fruits,
Grains/Beans/Legumes, Meat, Confections, Vegetables, and Water
[Northamerica1000 2020], based largely on work
by Nestlé (2013), which
is closer to how a cook would think of ingredients. Bearing in
mind that a categorisation for this purpose needs to be useful
for decision-making (Is this recipe vegetarian,
Is there alcohol in this recipe, Does it
contains nuts?), a few changes were made to this
scheme:
the Meat category was split into Meat and Fish
(to cover seafood)
Nuts were separated out from other Vegetable
materials, as was Pasta
Confections was ignored as a separate category (sugar
is subsumed under Spices)
store-cupboard ingredients were given their own
category of Basic (although there could be much dispute
over what one person has in this category compared with
another person)
Five additional categories were Herbs; Spices;
Alcohol; Toppings, which covers edible decoration; and Prep,
intended for ready-prepared ingredients usually bought
pre-packaged.
This leaves unsolved some problems of categorisation which
are not dealt with elsewhere because traditional food
classifications omit items such as chocolate (technically a
ready-prepared item, although humorists would have it a food
group in is own right). In the current settings, chocolate is
a store-cupboard item but chocolate-chips are a
topping.
Markup
The current system provides for the following attributes
on the ingredient element:
@xml:id, unique
ID for the ingredient
@quantity, a number,
possibly including a decimal fraction (but restricted to
the half, quarters, eighths, thirds, and fifths, as
these can be represented in text with existing Unicode
fractions)
@unit, a list of
standardised abbreviations (dl, dsp,
fl.oz, g, Kg, lb, l, ml, oz, pt, tbsp, tsp) plus
common measures such as cup, can,
dash, drop, handful, etc
@unit-weight, text for
describing a standard size of one of the common
measures, like a 400 g can
@container, text for the
name of the container of the @unit-weight
@size, a list of
adjectives, eg large, medium, small, etc
@colour, a colour
name used for description, like red apple
@quality, any adjective
describing a pre-existing
condition, eg dry, smooth, unsalted, etc (not a @treatment, see below)
Items (the material ingredients) — these are
mutually exclusive (with the exception of
@part):
@meat, a list of
meats, eg beef, chicken, lamb, pork, etc
@fish, a list of
seafood, eg salmon, hake, prawn, lobster, etc
@part, a list of
body parts or products, eg breast, kidney, wing,
egg, seed, etc
@dairy, a list of
dairy products, eg milk, cheese, cream, yoghurt,
etc
@fruit, a list of
fruits
@alcohol, a list of
drinks
@herb, a list of
herbs
@vegetable, a list
of vegetables
@nuts, a list of
nuts
@pasta, a list of
types of pasta, noodles, etc
@spice, a list of
spices
@basic, a list of
common store-cupboard ingredients, eg flour, oil,
yeast, etc
@toppings, a list
of edible decorative items, eg Streusel
@prep, text for
any class of ready-prepared ingredient
@treatment, an adjective
such as chopped, ground, melted, etc (something done to
the foodstuff)
@note, a digit, for use
in referring to footnotes (deprecated)
@comment, any
text
@symbol, a symbol or
emoji, provision for bullet labelling
@alt, text describing an
alternative for substitution if the exact foodstuff is
not available
@status, an enumerated
list optional or required,
so that optional
ingredients can be identified
These are used to describe the foodstuff in a way that
avoids the need for extensive typing in most
cases, as the enumerated list values can be
selected from a menu. It was regarded as important that the
actual names of items should not be subject to typing errors
on each occasion of entry.
<ingredients>
<ingredient xml:id="avo" quantity="4" size="large" quality="very ripe"
treatment="chopped fine" vegetable="avocado"/>
<ingredient xml:id="toms" quantity="2" size="medium"
treatment="chopped just as fine" vegetable="tomato"/>
<ingredient xml:id="oil" quantity="1" size="hefty" unit="dash" note="1"
quality="pimento" basic="oil"/>
<ingredient xml:id="lj" quantity="2" unit="tsp" fruit="lemon" part="juice"/>
<ingredient xml:id="garlic" quantity="1" unit="clove" size="fat"
vegetable="garlic"/>
<ingredient xml:id="ff" quantity="2–4" unit="fl.oz" dairy="fromage-frais"
comment="or double [heavy] cream if not on a diet"
alt="Sour cream is also good here"/>
<ingredient xml:id="salt" spice="salt"/>
<ingredient xml:id="pep" spice="pepper"/>
</ingredients>
A set of rules was developed in XSLT
which implements the grammatical precedence of the attribute
descriptive values (described below). This results in a list
such as:
4 large very ripe avocados, chopped fine
2 medium tomatoes, chopped just as fine
1 hefty dash pimento
oil¹
2 tsp lemon juice
1 fat clove garlic
2–4 fl.oz fromage frais (or double [heavy] cream
if not on a diet). Sour cream is also good
here.
Footnotes in ingredient lists are extremely rare and
largely inadvisable, so they are not provided for; the one
in this example was implemented manually.
In tests, all the classes of ingredient could be
represented without the need for character data content.
However, much more extensive testing would be needed to
ensure the coverage of the enumerated lists, and to tighten
up the rules on how the wording is generated.
The lists mentioned in the attributes are plain text
files, one value per line, ending in a vertical bar (the
standard delimiter for enumerated attributes), so for
example the test file meat.list
currently says:
beef|
chicken|
duck|
ham|
lamb|
pork|
turkey|
As they are plain text files, they can be customised to
the author’s desire, and can be as long or as short as
needed provided they follow the rules for enumerated list
items (compounds need a hyphen, not a space, like
fromage-frais; this is removed in the
XSLT on output), so there is no limit on the
number of items or their order (alphabetic order was used
purely for convenience) and they don’t need to be one per
line: any additional spacing is entirely optional.
Rules using categorization
From inspection of existing recipes, it was possible to
come up with a first conjecture on the order and precedence
for expressing the ingredients in natural language, using
the data in the attributes. Such a mechanism would require
a much larger amount of data than was available for the
rigorous regression testing needed before it could be widely
used, but the current rules appear to work acceptably in
many circumstances.
Quantity
This always comes first, except where it is
implicit (knob butter) or where it is
left to the cook (salt). Non-numeric
quantities such as ranges (10–12
apples) or judgments (a few
apples) are reproduced as-is, otherwise the
integer portion of the quantity is used, and any
(decimal) fractional part converted to the nearest
vulgar fraction.
Size
Size is used as a prefix to the unit when the unit
is common (eg large handful)
Unit weight
This is used when the quantity refers to an
ingredient that comes supplied in a measured
container, like a 400 g can of tomatoes. If it follows
a numeric quantity, it gets a multiplication delimiter
(×)
Container
This is only meaningful when @unit-weight is used, and gets
output immediately after it
Unit
Unit follows quantity (but may have been prefixed
by size and unit weight). Common units are pluralised
if the quantity is more than one or is non-numeric
(intervention: ‘dash’ requires
an ‘e’)
Size
When the unit is standardised or absent, it is
applied to the ingredient, not the unit (eg medium
eggs)
Quality
This is a predetermined feature of the ingredient
like best or home-grown,
being one that the cook selects before use (see
Treatment below)
Colour
Any colour; accepted as-is
Treatment
The actions ground,
grated, and shredded are
applied before the ingredient (see more below)
Ingredient
There are currently ten groups as described
earlier. These are based on observation, and are
largely pragmatic or conjectural:
; a) alcohol; b) basic (ie store-cupboard items); c) dairy; d) fruit; e) herb; f) meat; g) pasta; h) spice; i) toppings (decorative sprinkles); and j) vegetable. Order is not significant, as they must
be mutually exclusive for any given ingredient. The
lists can be tailored ad
infinitum. If a value contains a
hyphen, replace it with a space. This enables the use
of hyphenated compounds like baking-powder, and
two-word names like soy-sauce (the case where
retention of the hyphen is needed is unresolved).
Pluralisation of ingredients is a little more tricky
than for quantities: if the quantity is more than one,
or it is non-numeric, or the unit is a standardised
unit (excluding tsp, tbsp, and dsp),
and the ingredient is not among
the values for meat, dairy, spice, pasta, basic, or
herb (excluding spinach, seed, rice, and garlic), then
pluralise it, adding an e to potato and
tomato.
Part
If the ingredient is a part of a greater whole,
like a flower, seedpod, kidney, skin, or egg, use it as-is,
and pluralise it if the quantity is more than one or
the unit is lb or Kg.
Treatment
The remaining actions (ie not
ground, grated, and
shredded handled above, and also
excluding powder,
butter, and to taste)
are prefixed with a comma.
Alternative ingredients, if any, are added verbatim in
parentheses; footnote marks are added if given; the
[optional] indicator is added if required, and any comments
are added in another set of parentheses.
At the time of writing, smaller, experimental, changes
are being made, principally to accommodate syntactic needs
revealed as more recipes are encoded. Two of the more common
are the selective elision of adjectival @part and @colour values in references, where
only the substantive is required; and the need for grouping,
as in ‘add the spices’, which at the moment
will cause omission of the order and reference tests.
Handling of conflicts
In examining the syntax of ingredient description compared
with those of references in the method, it was clear that
there were places where additional information was needed in
the references, for example to distinguish between two or more
sugars, or group them together or to highlight the fact that
an ingredient needed to be referred to by more than just name
at this stage.
As a palliative measure, a @mod attribute was added to the
ing element type. This is an enumerated attribute
whose values are the names of all the control attributes on
the ingredient element type; that is, all the
descriptive ones but not the actual food-item attributes: @quantity, @unit, @unit-weight, @container@size, @colour, @quality, @treatment.
Using this on the example in Appendix A, we
could write
<ing i="sugar" mod="quality"/>
which would result in dark
brown sugar. This does not solve the problem of
(hopefully edge) cases where identifying an ingredient
accurately would need more than one such qualifier.
A related requirement is to disambiguate multiple related
ingredients, such as all-purpose flour and whole-wheat flour.
Currently, the XSLT code checks for the
existence of one or more other ingredients with the same item
name, and checks if they all have at least one of the control
attributes in common (set to different values, like @quality). If so, the attribute value
is used as a prefix on the items to make the reference.
Results and conclusions
Testing
The testing of ingredient and reference co-presence was
shown to be trivial using the ID/IDREF
mechanism in XML, which covers error
classes 1 and 2.
The testing of ingredient order for error class 3 was not
as trivial, but relatively straightforward to implement in
XSLT. No attempt was made to implement any
other order, such as quantity or semantic relevance.
The potential mismatch in quantities between ingredient
list and step (error class 5) was not tested: in the sample
recipes used, there were no occurrences of partial quantities
being used in one step, with the remainder used in another.
There were indeed recipes using a single ingredient type in
two or more places, but in those cases the quantities were
given as separate ingredient items. An aggregate quantity test
is needed where an ingredient is divided (a practice decried
by Jacob (2010)).
The naming (and regeneration of names) was by far the most
complex matter. The reconstruction of ingredient listings from
the disaggregated data is non-trivial, and a comprehensive
solution would involve extension of the current system well
into the future in order to handle the infinite number of ways
that recipe authors will have of expressing themselves.
However, for practical purposes, it appears that
(unquantified) most recipes can be represented
accurately, in the sense that the need to add new ingredient
items to the lists diminished rapidly as testing proceeded.
The current system appears to handle correctly the generation
of items for the list of ingredients and their matching
references in the method (error class 6), but it is in no way
comprehensive and needs much more testing with a greater range
of ingredients.[3]
There was considerable conflict over the assignment of a
few items to lists: should garlic be under vegetables or
spices? Are beans a sufficiently large class to warrant their
own list? Are nuts? It is simple enough to edit the files and
change the classes, but some agreed standard would make it
more useful.
Benefits and drawbacks
The benefits of a system checking these errors would
include greater reliability, accuracy, and consistency; three
things that publishers insist on from their contributors,
whatever about the utility to personal web recipe sites.
Identifying the ingredient data in a form a computer can
manage also has a benefit separate from these quality control
aspects: it might make that hoary old chestnut ‘recipe
search’ actually work for once,
both in the sense of locating a recipe using specific
ingredients, distinct from whatever the title says, as well as
in the sense of letting cooks find out exactly what they can
make with the ingredients in the quantities on hand.
I leave to others the dubious usefulness of having your
recipe selection trigger your fridge into ordering the missing
ingredients. While it is perfectly possible, the effort in
maintaining the metadata after every midnight snack is
probably not worth the candle.
The most obvious drawback in the system as it currently
stands is that implementing it requires some form of
programming in a target system. Cooks, and cookery authors and
contributors, are not part of the target market for
XML systems: although implementation in an
XML editor should be straightforward, they
are not going to buy an editor for recipes, and they won’t be
using Emacs.
Commonplace editors like Microsoft
Word can certainly be coerced into
providing prompted or drop-down categorisation, although
embedding the error-checking logic currently implemented in
XSLT would require more effort. Web-based
systems running Javascript are perhaps more likely targets, as
would be Wordpress plugins. Unless
someone makes me an offer I can’t refuse, the current code
will be released under a suitable public licence later in the
year.
Conclusions
In general, this work satisfied the requirements and
demonstrated that a limited amount of data checking can
eliminate (or at least, signal) five of the seven classes of
errors described.
However, the need to have an authorial or editorial
interface written to handle data input (encoding) accurately
means that wider implementation would need to rely on demand,
unless there is sufficient interest in a collaborative,
possibly open-source, implementation.
Encoding would still remain a time-consuming operation,
even with sophisticated software, because of the need to apply
domain expertise, which in turn would require relatively
experienced users (cooks, collectors, publishers). Given the
fairly strict formatting of published recipes, however, it
might be possible to write a semantic and syntactic filter to
identify at least quantity, units, and name from published
recipes. This has not been investigated in the current
iteration.
The work on the category lists confirms the well-known
principle that data should be stored at the lowest practicable
level of disaggregation because it can always be aggregated
for implementation, whereas data stored aggregated can never
be broken back down into its components. It also confirms the
long-held, if anecdotal, belief in systems design that time
spent planning the data model shortens the overall development
time: if the data model is right (that is, it matches
reality), most requirements tend to click into place; if the
data model is wrong, the entire project may be irretrievably
damaged from the start.
However, the corollary is that if you
do get the data model right, you will
still need to front-load enough data for it to be workable as
a model before you start to develop it
into a full system. In the current circumstances, nowhere near
enough recipes have been tested, so the front-loading is a
potential point of failure, and for this reason the current
system remains experimental and open to more widespread
testing and updating.
Appendix A. Worked example
This is an example of a partly-edited recipe from the
author’s collection, with unresolved issues (at the time):
<!DOCTYPE recipe SYSTEM "recipe.dtd">
<recipe id="cashewscones">
<nav/>
<info>
<title>Butterscotch and Cashew Drop-scones</title>
<author>Anon</author>
<copyright year="2019" web="https://www.teatimemagazine.com/"
contrib="Ann Marie O’Connell">Tea Time Magazine</copyright>
</info>
<intro>
<para>Anna mentioned this online and I asked her for the recipe.
The original was from Tea Time Magazine (Jan/Feb 2019, but is
not in their archive¹). She notes that it works fine with all
white whole-wheat flour, and she also added large-crystal raw
sugar as a topping, instead of an egg glaze, because of the
additional caramel notes.</para>
</intro>
<ingredients>
<ingredient xml:id="plainflour" quantity="1.5" unit="cup"
quality="all-purpose" basic="flour"/>
<ingredient xml:id="wwflour" quantity=".5" unit="cup"
quality="whole-wheat" basic="flour"/>
<ingredient xml:id="sugar" quantity="0.333" unit="cup"
quality="dark brown" treatment="packed" spice="sugar"/>
<ingredient xml:id="bp" quantity="1" unit="tbsp"
basic="baking-powder"/>
<ingredient xml:id="salt" quantity=".5" unit="tsp" spice="salt"
comment="use ¼ tsp if the cashews are already salted"/>
<ingredient xml:id="butter" quantity=".5" unit="cup"
quality="unsalted" dairy="butter" treatment="chilled and
diced"/>
<ingredient xml:id="chips" quantity=".5" unit="cup"
treatment="slightly heaping" topping="butterscotch-chips"
alt="any preferred chips"/>
<ingredient xml:id="cashews" quantity=".5" unit="cup"
quality="toasted" treatment="slightly heaping"
vegetable="cashew"/>
<ingredient xml:id="cream" quantity=".5" unit="cup"
quality="heavy" dairy="cream"/>
<ingredient xml:id="egg" quantity="1" size="large"
treatment="beaten" part="egg"/>
</ingredients>
<method>
<step>
<para>Preheat oven to 400°F.</para>
</step>
<step>
<para>Combine together <ing i="plainflour wwflour sugar
bp salt"/> in medium bowl.</para>
</step>
<step>
<para>Add the <ing i="butter"/>; using fingertips,
rub to form coarse meal.</para>
</step>
<step>
<para>In separate bowl, whisk the <ing i="milk"/> and
the <ing i="egg"/>.</para>
</step>
<step>
<para>Gradually add the <ing i="milk egg"/> mix to the
flour mixture, keeping back 1 tsp of the egg mix to use
for glazing.</para>
</step>
<step>
<para>Toss or knead it to thoroughly moisten it and form a
clumpy dough (add more milk if too dry).</para>
</step>
<step>
<para>Mix in the <ing i="chips"/>.</para>
</step>
<step>
<para>Drop the dough by ¼ cupfuls onto a nonstick or lightly
greased baking sheet at least 1 inch apart, to give 8–10
drop-scones. (You can line a regular pan with aluminum foil
instead of greasing it.)</para>
</step>
<step>
<para>Brush the remaining <ing i="egg"/> on top as a
glaze.</para>
</step>
<step>
<para>Bake for about 20 minutes or until golden brown.</para>
</step>
</method>
<para>You can also use a mini-scone baking pan, like the Nordic Ware
cast-aluminum one, which gives you 16 triangular scones.</para>
<para>If you use the “freeze the portioned dough” technique, they
will need to bake 3–5 minutes longer.</para>
<para>¹ Possibly because the ingredients didn’t match the method in
several places.</para>
</recipe>
Running the current XSLT code produces
the following log:
Processing cashewscones.xml using xml2html.xsl to cashewscones.html
Using parameters
8. Unused ingredient "slightly heaping ½ cup toasted cashews"
9. Unused ingredient "½ cup heavy cream"
Checking 1. @plainflour
Checking 2. @wwflour
Checking 3. @sugar
Checking 4. @bp
Checking 5. @salt
Checking 6. @butter
Checking 7. @milk
Ingredient "" (milk) is listed 1st but mentioned 7th
Checking 8. @egg
Ingredient "1 large egg, beaten" (egg) is listed 10th but mentioned 8th
Checking 9. @chips
Ingredient "slightly heaping ½ cup butterscotch chips (or any preferred chips)"
(chips) is listed 7th but mentioned 9th
4. No ingredient matching ID "milk"
5. No ingredient matching ID "milk"
[Knauf 2017] Knauf, Torsten (2017) Definition der TEI-basierten culinary editions Markup Language (cueML), Bewertung
von Verfahren
für die automatische Extraktion von Zutatenlisten aus Rezepten
und die Auszeichnung des Praktischen Kochbuchs für
die gewöhnliche und feinere Küche von Henriette
Davidis (1849). URI:https://shaman-apprentice.github.io/MyMasterThesis/ (retrieved 11 February 2020).
[Masters 2013] Masters, Kristin (2013) ‘The Incredible Treasures of Manuscript Cookbooks’. In ILAB
July 2013, International League of Antiquarian Booksellers, Geneva.
[Nestlé 2013] Nestlé, Marion (2013) Food Politics. University of California Press, Berkeley, CA. ISBN:9780520275966.
[Saulnier 1982] Saulnier, Louis (1982) Le Répertoire de la Cuisine. Leon Jaeggi & Sons Ltd, Ashford, UK, 239pp. ASIN:B00I637XDK.
[1] Freeling’s
The Cook Book is
possibly one of the last from a modern author in Europe to
use the narrative style throughout [Freeling 1972].
[2] My thanks to Michael Kay for his suggestions on how to
achieve this most efficiently.
[3] As an edge case, the system was tested with a few
AI-generated recipes courtesy of [Shane 2020][Shane 2020] where a neural net
created recipes without reference to feasibility or
edibility (and much else!). However, having coded them to
the above standard, they tested correctly, all the errors
being picked up.
Knauf, Torsten (2017) Definition der TEI-basierten culinary editions Markup Language (cueML), Bewertung
von Verfahren
für die automatische Extraktion von Zutatenlisten aus Rezepten
und die Auszeichnung des Praktischen Kochbuchs für
die gewöhnliche und feinere Küche von Henriette
Davidis (1849). URI:https://shaman-apprentice.github.io/MyMasterThesis/ (retrieved 11 February 2020).
Masters, Kristin (2013) ‘The Incredible Treasures of Manuscript Cookbooks’. In ILAB
July 2013, International League of Antiquarian Booksellers, Geneva.