Sunday, March 10, 2013

Data structures for texts

My best scholarship that no one has ever read is probably the work I did with Gabe Weaver on the structure of citable texts. (I sense potential for a dinner-party game similar to “Humiliation” in David Lodge’s novel Changing Places…)
We proposed a model of citable text as an ordered hierarchy of citation objects (the “OHCO2” model). In OHCO2, every citable node has four defining properties:

  • every node belongs to a citation hierarchy
  • every node belongs to a FRBR-like version hierarchy
  • nodes belonging to the same version are ordered
  • nodes support a mixed content model
Two representations of a text that preserve these properties for every citable node are considered equivalent under OHCO2.
As I worked with Gabe, Chris Blackwell and others on both the Canonical Text Services protocol (CTS) and the CTS URN notation, we relied heavily on the OHCO2 model. I have recently completed a new implementation of the CTS protocol — the third of three implementations I have written using three different technologies for working with three completely different representations of text. Since all of the representations are OHCO2 equivalent, we know that they preserve the semantics of citable text, and we can consider other criteria to compare the advantages and disadvantages of these formats for specific purposes. In a following series of posts, I want to highlight some of the pluses and minuses of the following OHCO2-equivalent formats for representing citable texts:
  1. XML
  2. tabular structures
  3. RDF triples
I’ll tag this series with the label "text data structures".

1 comment:

Unknown said...

If your planning on coding this up from scratch, I'd recommend starting with a "gap buffer". It's not the best solution but it is the simplest. Once you have that implemented, you will have something working and you'll be familiar with some of the issues involved in coding up your own text-editor. Here you can find more info on this theme.