Methodology for the Construction of Multi-Structured Documents

Authors: Pierre-Édouard Portier & Sylvie Calabretto
Date: 2009-08-13
logo LIRIS logo CNRS

Multi-Structured Documents (1)

structures as point of views

Multiple uses of a same document

Multiplication of documentary structures

SGML, XML: overlapping hierarchies

Last Minute (1)

peinture

"The Structure Of Appearance"

The Appearance Of Structure

Last Minute (2)

peinture

Aporia:

A document has no structure

=> The "String-In-A-Role" strategy

Multi-Structured Documents (2)

desanti picture

Working environment

Institut Jean-Toussaint Desanti

Multi-Structured Documents (3)

two pages from Desanti

ZI: meaningful fragment of textual content spanning two pages

S1: pages

S2: "regions of interest"

E1, E2: equations

2 or 3 structures ...

Representation of M-S.D. (1)

4 categories

Representation of M-S.D. (2)

5 dimensions

Representation of M-S.D. (3)

CONCUR

MuLaX: adaptation of SGML CONCUR to XML

Representation of M-S.D. (4)

Ad-Hoc solutions

Representation of M-S.D. (5)

TexMECS and LMNL

Representation of M-S.D. (6)

Annotation graphs

Have been developed to model linguistics phenomena.

annotation graph

RDF graphs: annotation graphs in a well known formalism

Last Minute (3)

peinture

EARMARK: Extreme Annotations RDF Markup

Representation of M-S.D. (7)

GODDAG

Representation of M-S.D. (8)

multi colored trees

MCT (Multi-Colored Trees)

Representation of M-S.D. (9)

MSXD (Multi-Structured XML Documents)

Representation of M-S.D. (10)

multi colored trees

Delay Nodes

Representation of M-S.D. (11)

MonetDB/XQuery

Representation of M-S.D. (11)

MSDM

MSDM: a document is a graph D composed of:

Representation of M-S.D. (12)

benchmark

MultiX, XML instance of MSDM (1)

First Structure:

<s1>
 <page>Autrement dit la distinction signe-signifie ...
   Remarque,
 </page>
 <page>ce discours, ...
   par ex le discours 3+2=0-1 est-il un texte ? ...
 </page>
</s1>

MultiX, XML instance of MSDM (2)

Second Structure:

<s2>
<p>Autrement dit la distinction signe-signifie...</p>
<p>Remarque, ce discours, ...</p>
<p>par ex le discours
 <eq>3 + 2 = 0 - 1</eq> est-il un texte ? ...</p>
</s2>

MultiX, XML instance of MSDM (3)

Base Structure:

<seg xml:id="F1">Autrement dit la distinction
signe-signifie ...</seg>
<seg xml:id="F2">Remarque, </seg>
<seg xml:id="F3">ce discours, ...</seg>
<seg xml:id="F4">par ex le discours </seg>
<seg xml:id="F5">3 + 2 = 0 - 1</seg>
<seg xml:id="F6"> est-il un texte ? ...</seg>

MultiX, XML instance of MSDM (4)

First Structure:

<s1>
<page>
 <xi:include href="b.xml" xpointer="element(F1/1)"/>
 <xi:include href="b.xml" xpointer="element(F2/1)"/>
</page>
<page>
 <xi:include href="b.xml" xpointer="element(F3/1)"/>
 <xi:include href="b.xml" xpointer="element(F4/1)"/>
 <xi:include href="b.xml" xpointer="element(F5/1)"/>
 <xi:include href="b.xml" xpointer="element(F6/1)"/>
</page>
</s1>

MultiX, XML instance of MSDM (5)

Second Structure:

<s2>
<p>
 <xi:include href="b.xml" xpointer="element(F1/1)"/>
</p>
<p>
 <xi:include href="b.xml" xpointer="element(F2/1)"/>
 <xi:include href="b.xml" xpointer="element(F3/1)"/>
</p>
<p>
 <xi:include href="b.xml" xpointer="element(F4/1)"/>
 <eq>
  <xi:include href="b.xml" xpointer="element(F5/1)"/>
 </eq>
 <xi:include href="b.xml" xpointer="element(F6/1)"/>
</p>
</s2>

MultiX, XML instance of MSDM (6)

XQuery functions:

let $physique := doc("physique.xml")
let $logique := doc("logique.xml")
for $page in $physique//page,
    $para in $logique//p
where multix:share-fragments($page,$para) and
      not(multix:include-content-of($page,$para))
return $para

Finds region of interest overlapping two pages.

Construction of M-S.D. (1)

Three categories of methods:

Construction of M-S.D. (2)

restructuring

Restructuting stage

Pages and regions of interests have been tagged until a region overlaps two pages.

Construction of M-S.D. (3)

automatic restructuring

Automatic restructuting

Creation of a new structure is a purely formal operation consisting in the transformation of a graph into two trees.

Construction of M-S.D. (4)

user integration

Integration of the user

Automatic restructuring is a good occasion for a user to make modeling choices.

Construction of M-S.D. (5)

Recommendation system for documents authors

Two users are close, insofar as they are editing specific documents, if the implied tags trees of their structures are close.

Users 1, 2 User 3

theorem

statement

proof

lemma

statement

proof

cocycle

proposition

proof

operators

cohomology

cocycle

Prototype implementation (1)

REST interface

ex: POST to http://desanti.org/collections/148/structures/math/taggees with content:

<taggee>
  <tag name="equation" />
  <interval start="14" end="26" />
</taggee>
</programlisting>

Prototype implementation (2)

Javascript user interface

ihm

Conclusion

Multi-Structured documents: from their representation to their construction

The enforcement of tree structures, for a long time considered as the crux of the M-S.D. problem, triggers the creation of new structures that have to be validated by the user.

We propose an open methodology that can be used for the incremental and collective emergence of documentary structures.

Questions

questions