How to cite this paper

Flynn, Peter. “Cooking up something new: An XML and XSLT experiment with recipe data.” Presented at Balisage: The Markup Conference 2020, Washington, DC, July 27 - 31, 2020. In Proceedings of Balisage: The Markup Conference 2020. Balisage Series on Markup Technologies, vol. 25 (2020). https://doi.org/10.4242/BalisageVol25.Flynn01.

Balisage: The Markup Conference 2020
July 27 - 31, 2020

Balisage Paper: Cooking up something new

An XML and XSLT experiment with recipe data

Peter Flynn

Peter Flynn managed the Academic and Collaborative Technologies Group in IT Services at University College Cork, Ireland until his retirement in 2018. He trained at the London College of Printing and did his MA in computerized planning at Central London Poly (now the University of Westminster). He worked in the UK for the Printing and Publishing Industry Training Board, first as researcher and then as DP Manager; and for United Information Services of Kansas as IT consultant before joining UCC as Project Manager for academic and research computing. In 1990 he installed Ireland’s first Web server, and expanded the university’s academic and research publishing support. He has been Secretary of the TeX Users Group, Deputy Director for Ireland of EARN, and a member both of the IETF Working Group on HTML and of the W3C XML SIG; and he has published books on HTML, SGML/XML, and LaTeX. Peter also runs the markup and typesetting consultancy Silmaril, and is editor of the XML FAQ as well as an irregular contributor to conferences and journals in electronic publishing, markup, and Humanities computing, and a regular speaker and session chair at the XML SummerSchool in Oxford. He completed his PhD in User Interfaces to Structured Documents with the Human Factors Research Group in Applied Psychology in UCC in 2014. He maintains a fairly random semi-technical blog at http://blogs.silmaril.ie/peter

Abstract

This is a report on an experiment to see if XML and the disaggregation of ingredient metadata could be used to reduce errors in recipes. Errors in web pages, PDFs, and print have been an irritant to authors, cooks, editors, and publishers for many decades, and occasionally the cause of an expensive recall. This research aims to see if markup could help.

Modern recipe structure is a well-established convention of a list of ingredients and a list of instructions (method). The writing about mistakes is sparse but highlights errors of omission and commission, inconsistency, sequence, and mismatch between the lists. Attributes for ID and classification were added to the ingredients list in a nonce recipe schema, along with an IDREFS attribute for use in the references to ingredients in the instructions; and code was written for cross-checking existence, usage, and consistent reproduction of names, and tested on a small collection of recipes.

We demonstrated that five of the seven classes of error identified could be straightforwardly remedied, but that the requirements for disaggregated data input needed to deal with the consistency issues may be too detailed for non-expert use unless assisted by a semantic filter.

Background

Ingredients
Measurement
Errors

Markup
Rules using categorization

Handling of conflicts

Results and conclusions

Testing
Benefits and drawbacks
Conclusions

Appendix A. Worked example

Note: Acknowledgements

My thanks go to all those friends in cooking and markup who contributed with suggestions and food.

Background

There is a conventional formality to the way in which recipes are presented in western cultures which has been common since the middle of the nineteenth century. Before that, ‘receipts’ (as they were then known, from the Latin for ‘Take…’) were largely narratives, so you had to read them all the way through and note down what ingredients you would need.^[1]; this was true from the earliest clay tablets [Anon 2016] through the Greek and Latin recipes of the Classical period [Vehling 1936] to the end of the manuscript era with the first large-scale cookbook, The Forme of Cury; and from the subsequent rise of the printed cookbook from the 1470s [Sitwell 2012], including the extensive body of household manuscript cookery books and ephemera (see, for example, Figure 2) that continued to flourish until the end of the 18th century [Masters 2013], to the conventional modern style which was pioneered by Eliza Acton (1845) and popularised by Isabella Beeton (1861) in the UK and Fannie Farmer (1896) in the USA. This style has a structure something like this:

Title and/or Description, sometimes with a picture
Number of portions (sometimes)
List of Ingredients (quantities, materials, treatments)
Method of preparation (steps)
Comments or serving suggestions

Figure 1: Simple recipe

<recipe>
  <name>Fudge</name>
  <ingredients>
    <item>1lb Sugar</item>
    <item>½pt Cream</item>
    <item>Chocolate</item>
    <item>2oz Butter</item>
  </ingredients>
  <method>
    <step>Mix ingredients</step>
    <step>Boil to 112°C</step>
    <step>Stir and cool</step>
    <step>Pour into dish</step>
  </method>
</recipe>

It is nowadays also common to provide an extended narration after the Description, perhaps explaining where the recipe originated, or what changes have been made, but this is a matter of taste and style, and not an essential component. The key components remain the List of Ingredients and the steps of the Method.

There has been some interesting work on encoding recipes, particularly in the historical field (and therefore by default using TEI) [Knauf 2017][Klug 2017], but these are typically done to enable the recipe[s] to be identified within a much larger corpus, not for the purposes of analysing the ingredients or method, so they do not tend to use markup down to the level proposed here.

Ingredients

Ingredients are usually given in the order in which they get used in the steps, but sometimes they may be in order of importance (for example, a recipe for a beef stew could start with the beef, even though the onions may be the first thing you begin the cooking with); and sometimes they may be grouped, especially in complex recipes (all spices together, or all ingredients for a sauce together).

The convention for ingredients is to give the quantity, units, and item in that order (eg 3 Kg onions, but some authors or editors give the item first (Onions, 3 Kg). It is not important, except that from a publishing point of view it needs to be done the same way in each recipe to avoid confusing the reader.

<ingredient xml:id="onions" quantity="3" unit="Kg" size="small"
  colour="red" vegetable="onion" treatment="peeled and chopped fine"/>

Other material about quality, size, shape, and treatment may be interspersed: the example above would be needed for 3 Kg small red onions, peeled and chopped fine. The borderline between the preparation or treatment being attached to the ingredient, or being mentioned in the steps of the method is sometimes hard to determine: both are common, and the discussion above about style and consistency applies here also.

Measurement

Measurements have their own cultural conventions. In modern western cultures there are three common ‘standards’:

In most European-influenced cultures, metric units are standard (grams, kilos, liters).
In the UK and some of its former spheres of influence, metric units are the norm for published recipes, but Imperial units are still often used domestically (ounces, pounds, pints, with 20 fluid ounces to the pint).
In the USA, measurements are given by volume (cups, pints, with 16 fluid ounces to the pint) but also (in larger quantities) in pounds and quarts; the word ‘ounce’ is used to mean ‘fluid ounce’, as an ounce weight is rarely used. Canada and Australia officially use the metric system but many people still habitually use Imperial or US measurements. New Zealand uses metric measures but has a standard metric cup (250ml).

However, all these cultures tend to use similar measures for very small quantities, subject to some minor differences related to eating and drinking habits (see Figure 3):

tea-spoon or coffee-spoon (tsp, cuillère à café or càc, etc: about 5ml), although in cultures where tea-drinking predominates, a coffee-spoon is smaller than a tea-spoon, about 3ml
dessert-spoon (dsp, cuillère à dessert or càd, etc: about 10ml), common in UK and French-speaking cultures only, so far as I have been able to determine
table-spoon or soup-spoon (tbsp, cuillère à soupe or càs, etc: about 15ml, but a tbsp is 20ml in Australia); in the UK, a soup-spoon is the same size as a dessert-spoon (although a different shape) — [e]veryone knows how big a table-spoon is: it will just go into your mouth, though not if you have nice manners. [Freeling 1972]
Other spoons exist, of course: mustard-spoons and salt-spoons, for example, but I am not aware of any standard capacities. Extensive internationalisation would be needed for more widespread applicability: while the sizes appear not to vary much, the names and abbreviations are of course different.

Errors

It is not uncommon for recipes published in books and magazines and on the web to contain mistakes that can confuse even experienced cooks. This may be caused by many factors, including writing or typing up the recipe in a hurry; changing it while you experiment, and forgetting to update it; failing to get it edited professionally before publishing; working from illegible or out-of-date sources; misunderstanding a translation; or not testing the recipe — and doubtless many others including plain ordinary typographic errors.

Errors in recipes are an annoyance to readers when a dish fails; they are an embarrassment to their authors; they are damaging to the reputation of the publishers; and occasionally they can be the cause of serious financial loss, if a book has to be withdrawn because of them. It is therefore in everyone’s interests that recipes be as correct as possible. This research is an attempt to see if markup can contribute to a solution.

Complete omission of an ingredient (both from the list and from the method) is an editorial and testing problem, easily fixed online but not in print [Cloake 2011]. This class of error is not susceptible to treatment in software as the relevant data is by definition entirely absent in the first place, so there is nothing for a program to do anything with.

Cloake (2011) also quotes an example of the much more common problem of omitting the ingredient in one place but not the other:

Nigella’s Feast […] contains a recipe for a chocolate orange cake that includes a direction to ‘cream together the butter and sugar’ — which would come as a nasty surprise to the prospective baker, given no butter is mentioned in the ingredients. (When chocolatier Paul A Young tried both versions, he concluded the butter was a red herring — the cake turns out much better without it.)

Mismatched quantities can also confuse the cook. Jacob (2016) describes an error where the list of ingredients specified four cups (of shredded sharp cheddar cheese), but the method only used half a cup.

In an earlier article, Jacob (2010) identified seven classes of error (14 if we include a later list of seven more). Most of these are editorial problems which are important but out of scope for this research. The key concerns here are (using [Jacob 2010]’s original numbering for the first and second lists):

Ingredients out of order (1/1)
Missing ingredient (1/2)
Wrong amounts (1/3)
Making every step a separate number (2/6)

In item 2, [Jacob 2010] groups together the errors of omission and of commission (a listed ingredient which does not get used; and a step referring to an ingredient that is not listed), but we would argue that these are technically two separate classes of error.

Testing is probably the most essential part of recipe development, but for this very reason, each cycle of testing means changes to the recipe. Hart, in an article on writing cookery books, emphasises that while there are things a good editor will catch, it’s up to the cook/author to get it right to start with [Hart 2012].

An additional class of error is the inconsistent use of names, that is, using a different name for an ingredient in the List of Ingredients to the one used in the Method. This can occur where different cultures name things differently, and either lack of editorial oversight or authorial absent-mindedness results in both names being used for the same things in different places (‘spring onions’ and ‘scallions’ is one example that might need explaining out of its cultural context).

These problems are not new. Burros (1997) was blunter about it:

The prevalence of errors in cookbooks is the publishing world’s dirty little secret. The problem is likely to get worse as an industry mired in economic doldrums resorts to cost-cutting, practically guaranteeing less editing and testing before publication.

The publishing industry has indeed continued to get worse, and it is now a rare publisher who can offer to copyedit and proofread a manuscript, and the online publishing business has regrettably mirrored the worst practices of its print forebears. Burros (1997) goes on to explain the division of blame between publisher and author, both of whom feel the other could do more, and concludes that

[i]t is a haphazard system — further complicated by typesetting errors and editing that too often fails to eliminate confusion.

Elsewhere she refers to human error or computer gremlins, which is where the present research comes in.

Scope

With rare exceptions, published recipes nowadays are either on the web, which means HTML in one form or another; or in print, which means typesetting to PDF. The source format for new recipes is likely to be a Microsoft Word file, or a blog entry (perhaps Markdown), or an email message, or possibly still a typescript or manuscript. They may be original to an author (even though something similar may have existed for centuries elsewhere, unknown to the author), or they may have been copied or converted in many ways from recipes passed between friends and family, and they may of course also have been pirated: copyright notwithstanding, photocopies of recipes from magazines and books are legion, and it is not hard to do an OCR from a scan.

The scope for errors is enormous: the author’s own experience includes an edit of a typescript which listed half a pint of milk, originally typed (on a typewriter, from a handwritten recipe) as 1/2 pt milk . This became 1 or 2 pints milk in the editing (by someone unfamiliar with the lack of a ½ sign on an old typewriter), but it was corrected at proofing stage to ½ pt milk — and then the defective software used by the typesetter could only manage □ pt milk.

Recipe management software is available industrially, but tends to focus on very large volume production in the food industry and the automation of mixing and cooking equipment. However, a Belgian software company, youmeal.io, produces kitchen-oriented food analysis products for the catering and restaurant industry, and emphasises that using correct food data is of primary importance. They quote a study of their own claiming that 50% of technical sheets for compound products were incomplete or incorrect.

A software solution to at least some of the problems above was considered to be potentially of use to the cookery author, editor, or publisher, as well as to cooks who wants to write up their own recipes in a way that will pass the test of time — but many other problems will continue to rely on humans for a solution. (Historical recipes are interesting for the lack of detail as well as for the actual food: some of them read like recipes from a professional cook’s manual such as Le Répertoire de la Cuisine [Saulnier 1982] where for brevity the reader is assumed already to know everything from experience; others are virtually unusable because not all the relevant ingredients are mentioned, so expert guesswork is needed.)

From the errors discussed earlier, a candidate list of topics emerged, based on susceptibility to solution by software:

Ingredient referred to in method was never listed
Ingredient listed was never referred to in method
Ingredients out of order
Bogus quantities (eg too big or too small)
Mismatched quantities (different between the list of ingredients and the step of the method)
Inconsistent naming of ingredient between list and step
Steps too small (ie too many of them)

Of these, the control on bogus quantities was seen as unimplementable without a data history and suitable limits, which places it outside the scope of this experiment. The step size problem is also not easily susceptible to machine judgement. Both these classes were therefore dropped at this stage

Schematron was suggested by two reviewers, and could be used to calculate ‘reasonable’ measurements and highlight deviations, as well as to identify ingredient item conflicts, but in the time available this was not possible.

The objective, therefore, was to see if adding markup to the ingredients and steps could be used at or before the rendering stage to limit the remaining classes of error without creating too much work for the author or editor.

It was seen as important for potential solutions that they could be implemented in any programming language, and the data could be stored in a number of different ways, so while this implementation is in XML and XSLT, the data structure (50 lines) and the code (600 lines) are both small and should be easy to reimplement. The choice of XML was based on a number of considerations: ; a) many publishers already use XML as part of their workflow; b) it is commonplace in web systems; and c) a recipe is essentially narrative text (still), even if it is presented in the form of two lists, and XML was designed for dealing with mixed content (plain text mixed with special meanings). XML editing software also has controls which can be used on elements and attributes and references to them, early in the workflow, as well as at the point where output is created.

Taking this as a starting-point, some common XML markup features could readily be seen as having potential use: for example the built-in ID/IDREF checks could be used to test for the presence or absence of ingredients in the steps and vice versa; and enumerated (token list) attributes could be used to represent the options for different categories of ingredients. This would improve the accuracy of reproducing the textual form of the ingredients; allow for finer-grained checking; and enable indexing for book publication and for online searching.

During initial development it became apparent that a sufficiently accurate categorisation of the ingredient metadata could provide a solution to error class 6 by [re]generating the textual form of each ingredient programmatically from the categorised data.

Implementation

The implementation proceeded in two phases: developing and testing the ID/IDREF mechanism, used for error classes 1, 2, 3, and 5 in the list in §5, and developing the categorisation for ingredients, used in class 6.

Identity checks

Some initial tests showed that detecting the use of an ingredient was trivial. Given a schema that makes xml:id a REQUIRED attribute on an ingredient element, a conditional using an XPath statement such as count(idref(@xml:id))=0 is sufficient to determine if the ingredient is not referenced anywhere else in the recipe. Note that at this level, it does not control for where such a reference ought to occur, nor whether it would be meaningful in context: those are still tasks for a human editor or proofreader.

The reverse is simpler and even less controlled: if the references from the steps to ingredients are done using an element with an IDREF attribute, then standard validation techniques will throw an error on any such references that have no matching ID, even before regular processing starts.

As a first stage, therefore, we can use two declarations, one for the ingredients and one for references to them:

        <!ELEMENT ingredient (#PCDATA)>
        <!ATTLIST ingredient xml:id ID #REQUIRED>
        ...
        <!ELEMENT ing EMPTY>
        <!ATTLIST ing i IDREFS #REQUIRED>

The first element would occur as part of the content model for the list of ingredients, and the second element would be valid in mixed content in the steps of the method, as the reference to the ingredient[s] being used. In fact, if this system is to be implemented in an existing schema/DTD (as opposed to the nonce schema used for testing), only the attributes are required: the names of the element types could be anything.

Consistency

The ID/IDREF link used in section “Identity checks” can also be used to reproduce the name of the ingredient at the point of reference, instead of requiring it to be entered manually during composition, unless some special wording is required. In effect, if we write

        <ingredients>
          <ingredient xml:id="flour">brown flour</ingredient>
          <ingredient xml:id="sugar">muscovado sugar</ingredient>
        </ingredients>
        ...
        <method>
        ...
          <step>Add the <ing i="flour sugar"/> and mix well.</step>
        </method>

it is straightforward to write code which will produce

3. Add the brown flour and muscovado sugar and mix well.

This makes use of the binding between ingredient and mention which addressed the missing ingredients problem. However, merely reproducing the name of the linked ingredient does not solve the problem of the wrong ingredient being accidentally referenced, an in many cases the full name is not required (eg just ‘flour’ and ‘sugar’ are enough). Proofreading and recipe-testing are still important to prevent this.

Order

A test for the order or sequence of ingredients could be encoded into the handling of the mixed-content element type (ing in the example in section “Identity checks”), but in order to take account of potential previous references to the same ingredient, which encumbers the coding, it is preferable to do this at another stage, for example in the handling of the container of the steps of the Method.

For each grouped unique occurrence of descendant ID values (that is, in the steps of the Method), the position within the Method is compared with the position of the matching ingredient in the List of Ingredients.^[2]

This means (using the example in section “Consistency”) that

          <step>Add the <ing i="sugar flour"/> and mix well</step>

would throw an error because the flour is listed as an earlier ingredient than the sugar.

While it is conventional to list the ingredients in order of their mention, it is by no means universal; but where ingredients are grouped (for example into component parts of the recipe), then there are usually also multiple matching Method steps, and within them the rule of order-of-mention appears to be observed.

Categorisation

It became apparent that the disaggregation of the ingredient data could lead to the generation of the human-readable ingredient items both in the List of Ingredients and in the mentions in the Method. There is a formality here too, in the way in which ingredients are expressed, and there are conventions which vary by culture. It is possible to say 100 g walnuts, chopped fine as well as 100 g finely-chopped walnuts: both mean the same thing, although in English there is an implicit presumption in the first form that you take whole walnuts and chop them fine yourself; and in the second, that you buy the walnuts ready-chopped. While these variants are largely stylistic, published collections of recipes try to standardise on one way of saying things in order not to confuse the readers, especially if they are likely to be beginners and unfamiliar with the conventions.

It therefore became an additional task to equip the system with the ability to store the ingredient data as separate identities for units, quantities, different classes of foodstuffs, qualities, treatments, etc, so that the ingredients list could be generated in an acceptable format, especially across many recipes following a pattern. A side-benefit is that it could also result in the consistent use of names between ingredients and method. The categorisation of the ingredients required considerably more work, and remains open to much discussion.

Many categorisations or classifications are based on nutrition or source, both of which would require specialist knowledge to enter as data. Wikipedia suggests Dairy, Fruits, Grains/Beans/Legumes, Meat, Confections, Vegetables, and Water [Northamerica1000 2020], based largely on work by Nestlé (2013), which is closer to how a cook would think of ingredients. Bearing in mind that a categorisation for this purpose needs to be useful for decision-making (Is this recipe vegetarian, Is there alcohol in this recipe, Does it contains nuts?), a few changes were made to this scheme:

the Meat category was split into Meat and Fish (to cover seafood)
Nuts were separated out from other Vegetable materials, as was Pasta
Confections was ignored as a separate category (sugar is subsumed under Spices)
store-cupboard ingredients were given their own category of Basic (although there could be much dispute over what one person has in this category compared with another person)

Five additional categories were Herbs; Spices; Alcohol; Toppings, which covers edible decoration; and Prep, intended for ready-prepared ingredients usually bought pre-packaged.

This leaves unsolved some problems of categorisation which are not dealt with elsewhere because traditional food classifications omit items such as chocolate (technically a ready-prepared item, although humorists would have it a food group in is own right). In the current settings, chocolate is a store-cupboard item but chocolate-chips are a topping.

Markup

The current system provides for the following attributes on the ingredient element:

@xml:id, unique ID for the ingredient
@quantity, a number, possibly including a decimal fraction (but restricted to the half, quarters, eighths, thirds, and fifths, as these can be represented in text with existing Unicode fractions)
@unit, a list of standardised abbreviations (dl, dsp, fl.oz, g, Kg, lb, l, ml, oz, pt, tbsp, tsp) plus common measures such as cup, can, dash, drop, handful, etc
@unit-weight, text for describing a standard size of one of the common measures, like a 400 g can
@container, text for the name of the container of the @unit-weight
@size, a list of adjectives, eg large, medium, small, etc
@colour, a colour name used for description, like red apple
@quality, any adjective describing a pre-existing condition, eg dry, smooth, unsalted, etc (not a @treatment, see below)
Items (the material ingredients) — these are mutually exclusive (with the exception of @part):
- @meat, a list of meats, eg beef, chicken, lamb, pork, etc
- @fish, a list of seafood, eg salmon, hake, prawn, lobster, etc
- @part, a list of body parts or products, eg breast, kidney, wing, egg, seed, etc
- @dairy, a list of dairy products, eg milk, cheese, cream, yoghurt, etc
- @fruit, a list of fruits
- @alcohol, a list of drinks
- @herb, a list of herbs
- @vegetable, a list of vegetables
- @nuts, a list of nuts
- @pasta, a list of types of pasta, noodles, etc
- @spice, a list of spices
- @basic, a list of common store-cupboard ingredients, eg flour, oil, yeast, etc
- @toppings, a list of edible decorative items, eg Streusel
- @prep, text for any class of ready-prepared ingredient
@treatment, an adjective such as chopped, ground, melted, etc (something done to the foodstuff)
@note, a digit, for use in referring to footnotes (deprecated)
@comment, any text
@symbol, a symbol or emoji, provision for bullet labelling
@alt, text describing an alternative for substitution if the exact foodstuff is not available
@status, an enumerated list optional or required, so that optional ingredients can be identified

These are used to describe the foodstuff in a way that avoids the need for extensive typing in most cases, as the enumerated list values can be selected from a menu. It was regarded as important that the actual names of items should not be subject to typing errors on each occasion of entry.

<ingredients>
  <ingredient xml:id="avo" quantity="4" size="large" quality="very ripe"
    treatment="chopped fine" vegetable="avocado"/>
  <ingredient xml:id="toms" quantity="2" size="medium"
    treatment="chopped just as fine" vegetable="tomato"/>
  <ingredient xml:id="oil" quantity="1" size="hefty" unit="dash" note="1"
    quality="pimento" basic="oil"/>
  <ingredient xml:id="lj" quantity="2" unit="tsp" fruit="lemon" part="juice"/>
  <ingredient xml:id="garlic" quantity="1" unit="clove" size="fat"
    vegetable="garlic"/> 
  <ingredient xml:id="ff" quantity="2–4" unit="fl.oz" dairy="fromage-frais"
    comment="or double [heavy] cream if not on a diet"
    alt="Sour cream is also good here"/> 
  <ingredient xml:id="salt" spice="salt"/>
  <ingredient xml:id="pep" spice="pepper"/>
</ingredients>

A set of rules was developed in XSLT which implements the grammatical precedence of the attribute descriptive values (described below). This results in a list such as:

4 large very ripe avocados, chopped fine

2 medium tomatoes, chopped just as fine

1 hefty dash pimento oil¹

2 tsp lemon juice

1 fat clove garlic

2–4 fl.oz fromage frais (or double [heavy] cream if not on a diet). Sour cream is also good here.

Footnotes in ingredient lists are extremely rare and largely inadvisable, so they are not provided for; the one in this example was implemented manually.

In tests, all the classes of ingredient could be represented without the need for character data content. However, much more extensive testing would be needed to ensure the coverage of the enumerated lists, and to tighten up the rules on how the wording is generated.

The lists mentioned in the attributes are plain text files, one value per line, ending in a vertical bar (the standard delimiter for enumerated attributes), so for example the test file meat.list currently says:

        beef|
        chicken|
        duck|
        ham|
        lamb|
        pork|
        turkey|

As they are plain text files, they can be customised to the author’s desire, and can be as long or as short as needed provided they follow the rules for enumerated list items (compounds need a hyphen, not a space, like fromage-frais; this is removed in the XSLT on output), so there is no limit on the number of items or their order (alphabetic order was used purely for convenience) and they don’t need to be one per line: any additional spacing is entirely optional.

Rules using categorization

From inspection of existing recipes, it was possible to come up with a first conjecture on the order and precedence for expressing the ingredients in natural language, using the data in the attributes. Such a mechanism would require a much larger amount of data than was available for the rigorous regression testing needed before it could be widely used, but the current rules appear to work acceptably in many circumstances.

Quantity	This always comes first, except where it is implicit (knob butter) or where it is left to the cook (salt). Non-numeric quantities such as ranges (10–12 apples) or judgments (a few apples) are reproduced as-is, otherwise the integer portion of the quantity is used, and any (decimal) fractional part converted to the nearest vulgar fraction.
Size	Size is used as a prefix to the unit when the unit is common (eg large handful)
Unit weight	This is used when the quantity refers to an ingredient that comes supplied in a measured container, like a 400 g can of tomatoes. If it follows a numeric quantity, it gets a multiplication delimiter (×)
Container	This is only meaningful when `@unit-weight` is used, and gets output immediately after it
Unit	Unit follows quantity (but may have been prefixed by size and unit weight). Common units are pluralised if the quantity is more than one or is non-numeric (intervention: ‘dash’ requires an ‘e’)
Size	When the unit is standardised or absent, it is applied to the ingredient, not the unit (eg medium eggs)
Quality	This is a predetermined feature of the ingredient like best or home-grown, being one that the cook selects before use (see Treatment below)
Colour	Any colour; accepted as-is
Treatment	The actions ground, grated, and shredded are applied before the ingredient (see more below)
Ingredient	There are currently ten groups as described earlier. These are based on observation, and are largely pragmatic or conjectural: ; a) alcohol; b) basic (ie store-cupboard items); c) dairy; d) fruit; e) herb; f) meat; g) pasta; h) spice; i) toppings (decorative sprinkles); and j) vegetable. Order is not significant, as they must be mutually exclusive for any given ingredient. The lists can be tailored ad infinitum. If a value contains a hyphen, replace it with a space. This enables the use of hyphenated compounds like baking-powder, and two-word names like soy-sauce (the case where retention of the hyphen is needed is unresolved). Pluralisation of ingredients is a little more tricky than for quantities: if the quantity is more than one, or it is non-numeric, or the unit is a standardised unit (excluding tsp, tbsp, and dsp), and the ingredient is not among the values for meat, dairy, spice, pasta, basic, or herb (excluding spinach, seed, rice, and garlic), then pluralise it, adding an e to potato and tomato.
Part	If the ingredient is a part of a greater whole, like a flower, seedpod, kidney, skin, or egg, use it as-is, and pluralise it if the quantity is more than one or the unit is lb or Kg.
Treatment	The remaining actions (ie not ground, grated, and shredded handled above, and also excluding powder, butter, and to taste) are prefixed with a comma.

Alternative ingredients, if any, are added verbatim in parentheses; footnote marks are added if given; the [optional] indicator is added if required, and any comments are added in another set of parentheses.

At the time of writing, smaller, experimental, changes are being made, principally to accommodate syntactic needs revealed as more recipes are encoded. Two of the more common are the selective elision of adjectival @part and @colour values in references, where only the substantive is required; and the need for grouping, as in ‘add the spices’, which at the moment will cause omission of the order and reference tests.

Handling of conflicts

In examining the syntax of ingredient description compared with those of references in the method, it was clear that there were places where additional information was needed in the references, for example to distinguish between two or more sugars, or group them together or to highlight the fact that an ingredient needed to be referred to by more than just name at this stage.

As a palliative measure, a @mod attribute was added to the ing element type. This is an enumerated attribute whose values are the names of all the control attributes on the ingredient element type; that is, all the descriptive ones but not the actual food-item attributes: @quantity, @unit, @unit-weight, @container @size, @colour, @quality, @treatment.

Using this on the example in Appendix A, we could write

	<ing i="sugar" mod="quality"/>

which would result in dark brown sugar. This does not solve the problem of (hopefully edge) cases where identifying an ingredient accurately would need more than one such qualifier.

A related requirement is to disambiguate multiple related ingredients, such as all-purpose flour and whole-wheat flour. Currently, the XSLT code checks for the existence of one or more other ingredients with the same item name, and checks if they all have at least one of the control attributes in common (set to different values, like @quality). If so, the attribute value is used as a prefix on the items to make the reference.

Results and conclusions

Testing

The testing of ingredient and reference co-presence was shown to be trivial using the ID/IDREF mechanism in XML, which covers error classes 1 and 2.

The testing of ingredient order for error class 3 was not as trivial, but relatively straightforward to implement in XSLT. No attempt was made to implement any other order, such as quantity or semantic relevance.

The potential mismatch in quantities between ingredient list and step (error class 5) was not tested: in the sample recipes used, there were no occurrences of partial quantities being used in one step, with the remainder used in another. There were indeed recipes using a single ingredient type in two or more places, but in those cases the quantities were given as separate ingredient items. An aggregate quantity test is needed where an ingredient is divided (a practice decried by Jacob (2010)).

The naming (and regeneration of names) was by far the most complex matter. The reconstruction of ingredient listings from the disaggregated data is non-trivial, and a comprehensive solution would involve extension of the current system well into the future in order to handle the infinite number of ways that recipe authors will have of expressing themselves. However, for practical purposes, it appears that (unquantified) most recipes can be represented accurately, in the sense that the need to add new ingredient items to the lists diminished rapidly as testing proceeded. The current system appears to handle correctly the generation of items for the list of ingredients and their matching references in the method (error class 6), but it is in no way comprehensive and needs much more testing with a greater range of ingredients.^[3]

There was considerable conflict over the assignment of a few items to lists: should garlic be under vegetables or spices? Are beans a sufficiently large class to warrant their own list? Are nuts? It is simple enough to edit the files and change the classes, but some agreed standard would make it more useful.

Benefits and drawbacks

The benefits of a system checking these errors would include greater reliability, accuracy, and consistency; three things that publishers insist on from their contributors, whatever about the utility to personal web recipe sites.

Identifying the ingredient data in a form a computer can manage also has a benefit separate from these quality control aspects: it might make that hoary old chestnut ‘recipe search’ actually work for once, both in the sense of locating a recipe using specific ingredients, distinct from whatever the title says, as well as in the sense of letting cooks find out exactly what they can make with the ingredients in the quantities on hand.

I leave to others the dubious usefulness of having your recipe selection trigger your fridge into ordering the missing ingredients. While it is perfectly possible, the effort in maintaining the metadata after every midnight snack is probably not worth the candle.

The most obvious drawback in the system as it currently stands is that implementing it requires some form of programming in a target system. Cooks, and cookery authors and contributors, are not part of the target market for XML systems: although implementation in an XML editor should be straightforward, they are not going to buy an editor for recipes, and they won’t be using Emacs.

Commonplace editors like Microsoft Word can certainly be coerced into providing prompted or drop-down categorisation, although embedding the error-checking logic currently implemented in XSLT would require more effort. Web-based systems running Javascript are perhaps more likely targets, as would be Wordpress plugins. Unless someone makes me an offer I can’t refuse, the current code will be released under a suitable public licence later in the year.

Conclusions

In general, this work satisfied the requirements and demonstrated that a limited amount of data checking can eliminate (or at least, signal) five of the seven classes of errors described.

However, the need to have an authorial or editorial interface written to handle data input (encoding) accurately means that wider implementation would need to rely on demand, unless there is sufficient interest in a collaborative, possibly open-source, implementation.

Encoding would still remain a time-consuming operation, even with sophisticated software, because of the need to apply domain expertise, which in turn would require relatively experienced users (cooks, collectors, publishers). Given the fairly strict formatting of published recipes, however, it might be possible to write a semantic and syntactic filter to identify at least quantity, units, and name from published recipes. This has not been investigated in the current iteration.

The work on the category lists confirms the well-known principle that data should be stored at the lowest practicable level of disaggregation because it can always be aggregated for implementation, whereas data stored aggregated can never be broken back down into its components. It also confirms the long-held, if anecdotal, belief in systems design that time spent planning the data model shortens the overall development time: if the data model is right (that is, it matches reality), most requirements tend to click into place; if the data model is wrong, the entire project may be irretrievably damaged from the start.

However, the corollary is that if you do get the data model right, you will still need to front-load enough data for it to be workable as a model before you start to develop it into a full system. In the current circumstances, nowhere near enough recipes have been tested, so the front-loading is a potential point of failure, and for this reason the current system remains experimental and open to more widespread testing and updating.

Appendix A. Worked example

This is an example of a partly-edited recipe from the author’s collection, with unresolved issues (at the time):

<!DOCTYPE recipe SYSTEM "recipe.dtd">
<recipe id="cashewscones">
  <nav/>
  <info>
    <title>Butterscotch and Cashew Drop-scones</title>
    <author>Anon</author>
    <copyright year="2019" web="https://www.teatimemagazine.com/"
      contrib="Ann Marie O’Connell">Tea Time Magazine</copyright>
  </info>
  <intro>
    <para>Anna mentioned this online and I asked her for the recipe.
      The original was from Tea Time Magazine (Jan/Feb 2019, but is
      not in their archive¹). She notes that it works fine with all
      white whole-wheat flour, and she also added large-crystal raw
      sugar as a topping, instead of an egg glaze, because of the
      additional caramel notes.</para>
  </intro>
  <ingredients>
    <ingredient xml:id="plainflour" quantity="1.5" unit="cup"
      quality="all-purpose" basic="flour"/>
    <ingredient xml:id="wwflour" quantity=".5" unit="cup"
      quality="whole-wheat" basic="flour"/>
    <ingredient xml:id="sugar" quantity="0.333" unit="cup"
      quality="dark brown" treatment="packed" spice="sugar"/>
    <ingredient xml:id="bp" quantity="1" unit="tbsp"
      basic="baking-powder"/>
    <ingredient xml:id="salt" quantity=".5" unit="tsp" spice="salt"
      comment="use ¼ tsp if the cashews are already salted"/>
    <ingredient xml:id="butter" quantity=".5" unit="cup"
      quality="unsalted" dairy="butter" treatment="chilled and
      diced"/>
    <ingredient xml:id="chips" quantity=".5" unit="cup"
      treatment="slightly heaping" topping="butterscotch-chips"
      alt="any preferred chips"/>
    <ingredient xml:id="cashews" quantity=".5" unit="cup"
      quality="toasted" treatment="slightly heaping"
      vegetable="cashew"/>
    <ingredient xml:id="cream" quantity=".5" unit="cup"
      quality="heavy" dairy="cream"/>
    <ingredient xml:id="egg" quantity="1" size="large"
      treatment="beaten" part="egg"/>
  </ingredients>
  <method>
    <step>
      <para>Preheat oven to 400°F.</para>
    </step>
    <step>
      <para>Combine together <ing i="plainflour wwflour sugar 
          bp salt"/> in medium bowl.</para>
    </step>
    <step>
      <para>Add the <ing i="butter"/>; using fingertips,
        rub to form coarse meal.</para>
    </step>
    <step>
      <para>In separate bowl, whisk the <ing i="milk"/> and
        the <ing i="egg"/>.</para>
    </step>
    <step>
      <para>Gradually add the <ing i="milk egg"/> mix to the 
        flour mixture, keeping back 1 tsp of the egg mix to use
        for glazing.</para>
    </step>
    <step>
      <para>Toss or knead it to thoroughly moisten it and form a
        clumpy dough (add more milk if too dry).</para>
    </step>
    <step>
      <para>Mix in the <ing i="chips"/>.</para>
    </step>
    <step>
      <para>Drop the dough by ¼ cupfuls onto a nonstick or lightly
        greased baking sheet at least 1 inch apart, to give 8–10
        drop-scones. (You can line a regular pan with aluminum foil
        instead of greasing it.)</para>
    </step>
    <step>
      <para>Brush the remaining <ing i="egg"/> on top as a
        glaze.</para>
    </step>
    <step>
      <para>Bake for about 20 minutes or until golden brown.</para>
    </step>
  </method>
  <para>You can also use a mini-scone baking pan, like the Nordic Ware
    cast-aluminum one, which gives you 16 triangular scones.</para>
  <para>If you use the “freeze the portioned dough” technique, they
    will need to bake 3–5 minutes longer.</para>
  <para>¹ Possibly because the ingredients didn’t match the method in
    several places.</para>
</recipe>

Running the current XSLT code produces the following log:

Processing cashewscones.xml using xml2html.xsl to cashewscones.html
Using parameters
8. Unused ingredient "slightly heaping ½ cup toasted cashews"
9. Unused ingredient "½ cup heavy cream"
Checking 1. @plainflour
Checking 2. @wwflour
Checking 3. @sugar
Checking 4. @bp
Checking 5. @salt
Checking 6. @butter
Checking 7. @milk
Ingredient "" (milk) is listed 1st but mentioned 7th
Checking 8. @egg
Ingredient "1 large egg, beaten" (egg) is listed 10th but mentioned 8th
Checking 9. @chips
Ingredient "slightly heaping ½ cup butterscotch chips (or any preferred chips)" 
            (chips) is listed 7th but mentioned 9th
4. No ingredient matching ID "milk"
5. No ingredient matching ID "milk"

The amended and functional recipe is available on the author’s web site at http://xml.silmaril.ie/recipes/cashewscones.html.

References

[Acton 1845] Acton, Eliza (1845) Modern Cookery for Private Families. Longman, London, 644pp.

[Anon 2016] Anon (2016) ‘Recipes’. In Archaeology May/June 2016 May 2016, Archaeological Institute of America, Palm Coast, FL.

[Vehling 1936] Vehling, Joseph Dommers (1936) Cookery and Dining in Imperial Rome. Walter M Hill, Chicago, IL, 301pp.

[Beeton 1861] Beeton, Isabella (1861) [Mrs] Beeton’s Book of Household Management. S.O. Beeton Publishing, London, 1112pp.

[Burros 1997] Burros, Marian (1997) ‘Cookbook Follies’. In New York Times September 1997.

[Cloake 2011] Cloake, Felicity (2011) ‘Cookbook errors’. In The Guardian September 2011.

[Farmer 1896] Farmer, Fannie Merritt (1896) The Boston cooking-school cook book. Little, Brown, & Co, Boston, MA, 620pp. URI:https://d.lib.msu.edu/fa/8#page/2/mode/2up (retrieved 7 February 2020).

[Freeling 1972] Freeling, Nicolas (1972) The Cook Book. Hamish Hamilton, London, 154pp. ISBN:0879238623.

[Hart 2012] Hart, Alice (2012) ‘How to write your first cookbook’. In The Guardian July 2012.

[Sitwell 2012] Sitwell, William (2012) ‘A history of cookbooks’. In The Bookseller June 2012, Bookseller Media Ltd, London.

[Jacob 2010] Jacob, Dianne (2010) 7 Most Common Recipe Writing Errors. Author’s web site, Oakland, CA. URI:https://diannej.com/2010/7-most-common-recipe-writing-errors/ (retrieved 14 December 2019).

[Jacob 2016] Jacob, Dianne (2016) When a Reader Found a Cookbook Error. Author’s web site, Oakland, CA. URI:https://diannej.com/2016/reader-finds-cookbook-recipe-error/ (retrieved 18 December 2019).

[Knauf 2017] Knauf, Torsten (2017) Definition der TEI-basierten culinary editions Markup Language (cueML), Bewertung von Verfahren für die automatische Extraktion von Zutatenlisten aus Rezepten und die Auszeichnung des Praktischen Kochbuchs für die gewöhnliche und feinere Küche von Henriette Davidis (1849). URI:https://shaman-apprentice.github.io/MyMasterThesis/ (retrieved 11 February 2020).

[Klug 2017] Klug, Helmut (2017) ‘Cooking Recipes of the Middle Ages’. URI:https://static.uni-graz.at/fileadmin/gewi-zentren/Informationsmodellierung/PDF/Laurioux__Klug_-_Scientific_Proposal_ANR-FWF_-_full.pdf (retrieved 11 February 2020).

[Masters 2013] Masters, Kristin (2013) ‘The Incredible Treasures of Manuscript Cookbooks’. In ILAB July 2013, International League of Antiquarian Booksellers, Geneva.

[Nestlé 2013] Nestlé, Marion (2013) Food Politics. University of California Press, Berkeley, CA. ISBN:9780520275966.

[Saulnier 1982] Saulnier, Louis (1982) Le Répertoire de la Cuisine. Leon Jaeggi & Sons Ltd, Ashford, UK, 239pp. ASIN:B00I637XDK.

[Shane 2020] Shane, Janelle C (2020) AI recipes are bad (and a proposal for making them worse). AI Weirdness, Lafayette, CO. URI:https://aiweirdness.com/post/190569291992/ai-recipes-are-bad-and-a-proposal-for-making-them (retrieved 9 February 2020).

[Shane 2020] Shane, Janelle C (2020) AI + Vintage American cooking: a combination that cannot be unseen. AI Weirdness, Lafayette, CO. URI:https://aiweirdness.com/post/190721709472/ai-vintage-american-cooking-a-combination-that (retrieved 8 February 2020).

[Northamerica1000 2020] Northamerica1000 (2020) Food group. Wikipedia, The Free Encyclopedia, San Francisco, CA. URI:https://en.wikipedia.org/w/index.php?title=Food_group&oldid=939771878 (retrieved 19 February 2020).

^[1] Freeling’s The Cook Book is possibly one of the last from a modern author in Europe to use the narrative style throughout [Freeling 1972].

^[2] My thanks to Michael Kay for his suggestions on how to achieve this most efficiently.

^[3] As an edge case, the system was tested with a few AI-generated recipes courtesy of [Shane 2020][Shane 2020] where a neural net created recipes without reference to feasibility or edibility (and much else!). However, having coded them to the above standard, they tested correctly, all the errors being picked up.

Acton, Eliza (1845) Modern Cookery for Private Families. Longman, London, 644pp.

Anon (2016) ‘Recipes’. In Archaeology May/June 2016 May 2016, Archaeological Institute of America, Palm Coast, FL.

Vehling, Joseph Dommers (1936) Cookery and Dining in Imperial Rome. Walter M Hill, Chicago, IL, 301pp.

Beeton, Isabella (1861) [Mrs] Beeton’s Book of Household Management. S.O. Beeton Publishing, London, 1112pp.

Burros, Marian (1997) ‘Cookbook Follies’. In New York Times September 1997.

Cloake, Felicity (2011) ‘Cookbook errors’. In The Guardian September 2011.

Farmer, Fannie Merritt (1896) The Boston cooking-school cook book. Little, Brown, & Co, Boston, MA, 620pp. URI:https://d.lib.msu.edu/fa/8#page/2/mode/2up (retrieved 7 February 2020).

Freeling, Nicolas (1972) The Cook Book. Hamish Hamilton, London, 154pp. ISBN:0879238623.

Hart, Alice (2012) ‘How to write your first cookbook’. In The Guardian July 2012.

Sitwell, William (2012) ‘A history of cookbooks’. In The Bookseller June 2012, Bookseller Media Ltd, London.

Jacob, Dianne (2010) 7 Most Common Recipe Writing Errors. Author’s web site, Oakland, CA. URI:https://diannej.com/2010/7-most-common-recipe-writing-errors/ (retrieved 14 December 2019).

Jacob, Dianne (2016) When a Reader Found a Cookbook Error. Author’s web site, Oakland, CA. URI:https://diannej.com/2016/reader-finds-cookbook-recipe-error/ (retrieved 18 December 2019).

Knauf, Torsten (2017) Definition der TEI-basierten culinary editions Markup Language (cueML), Bewertung von Verfahren für die automatische Extraktion von Zutatenlisten aus Rezepten und die Auszeichnung des Praktischen Kochbuchs für die gewöhnliche und feinere Küche von Henriette Davidis (1849). URI:https://shaman-apprentice.github.io/MyMasterThesis/ (retrieved 11 February 2020).

Klug, Helmut (2017) ‘Cooking Recipes of the Middle Ages’. URI:https://static.uni-graz.at/fileadmin/gewi-zentren/Informationsmodellierung/PDF/Laurioux__Klug_-_Scientific_Proposal_ANR-FWF_-_full.pdf (retrieved 11 February 2020).

Masters, Kristin (2013) ‘The Incredible Treasures of Manuscript Cookbooks’. In ILAB July 2013, International League of Antiquarian Booksellers, Geneva.

Nestlé, Marion (2013) Food Politics. University of California Press, Berkeley, CA. ISBN:9780520275966.

Saulnier, Louis (1982) Le Répertoire de la Cuisine. Leon Jaeggi & Sons Ltd, Ashford, UK, 239pp. ASIN:B00I637XDK.

Shane, Janelle C (2020) AI recipes are bad (and a proposal for making them worse). AI Weirdness, Lafayette, CO. URI:https://aiweirdness.com/post/190569291992/ai-recipes-are-bad-and-a-proposal-for-making-them (retrieved 9 February 2020).

Shane, Janelle C (2020) AI + Vintage American cooking: a combination that cannot be unseen. AI Weirdness, Lafayette, CO. URI:https://aiweirdness.com/post/190721709472/ai-vintage-american-cooking-a-combination-that (retrieved 8 February 2020).

Northamerica1000 (2020) Food group. Wikipedia, The Free Encyclopedia, San Francisco, CA. URI:https://en.wikipedia.org/w/index.php?title=Food_group&oldid=939771878 (retrieved 19 February 2020).

BalisageThe Markup Conference2020