Sunday, March 17, 2013

CTS is complete under OHCO2


My preceding post promised to compare experiences implementing the Canonical Text Services protocol with three equivalent data structures for text:  trees (formatted in XML), tables, and graphs (expressed in RDF).    Before turning to the first of these data structures, however, I should expand briefly on the comment in that post that, in developing the CTS protocol, "we relied heavily on the OHCO2 model."  More precisely, I mean that we developed CTS so that it fully expresses the semantics of OHCO2:  hence the title of the present post.

The CTS protocol uses CTS URNs to cite passages of texts.  The semantics of CTS URNs by themselves give us two of the four OHCO2 properties, since a CTS URN specifies where in a citation hierarchy a passage of text is situated, and where in a hierarchy of versions a particular version is situated.  A URN like urn:cts:greekLit:tlg0012.tlg001.msA:9.119 for example, refers to a passage set in a version of the Iliad (the work tlg0012.tlg001) identified as msA (i.e., the Venetus A manuscript), and refers to a citable line (119) contained within a citable book (9).

The remaining two OHCO2 properties are provided by a pair of CTS requests.  The GetPrevNext request places a passage within an ordered sequence;  the GetPassage request returning the contents of the passage supports a mixed content model.

After some initial experience developing applications built on CTS, Chris Blackwell suggested that it would be convenient for developers to have both GetPrevNext and GetPassage information available via a single request.  We introduced the CTS GetPassagePlus request for just this purpose.  His intuition is now gratifyingly justified by the observation that the GetPassagePlus request tells us everything about a cited passage of text that the OHCO2 model guarantees.





Sunday, March 10, 2013

Data structures for texts

My best scholarship that no one has ever read is probably the work I did with Gabe Weaver on the structure of citable texts. (I sense potential for a dinner-party game similar to “Humiliation” in David Lodge’s novel Changing Places…)
We proposed a model of citable text as an ordered hierarchy of citation objects (the “OHCO2” model). In OHCO2, every citable node has four defining properties:

  • every node belongs to a citation hierarchy
  • every node belongs to a FRBR-like version hierarchy
  • nodes belonging to the same version are ordered
  • nodes support a mixed content model
Two representations of a text that preserve these properties for every citable node are considered equivalent under OHCO2.
As I worked with Gabe, Chris Blackwell and others on both the Canonical Text Services protocol (CTS) and the CTS URN notation, we relied heavily on the OHCO2 model. I have recently completed a new implementation of the CTS protocol — the third of three implementations I have written using three different technologies for working with three completely different representations of text. Since all of the representations are OHCO2 equivalent, we know that they preserve the semantics of citable text, and we can consider other criteria to compare the advantages and disadvantages of these formats for specific purposes. In a following series of posts, I want to highlight some of the pluses and minuses of the following OHCO2-equivalent formats for representing citable texts:
  1. XML
  2. tabular structures
  3. RDF triples
I’ll tag this series with the label "text data structures".