Wednesday, November 26, 2008

The vocabulary of ancient Greek

What is the vocabulary of ancient Greek? That is, what set of words, or lexical entities, actually occur in our extant texts?

The First Thousand Years of Greek project (announced here) aims to simplify posing such straightforward questions, but we need more than online texts to talk unambiguously about words. One essential piece of infrastructure is an inventory of uniquely identified lexical entities in Greek. In print publications, lexical entities have traditionally been identified by a word's lemma form. While lemmata are valuable labels, they are potentially ambiguous. Instead, basic principles of information design dictate that arbitrary identifiers guaranteed to be unique should be associated with lemma strings, so that references to a lexical entity can be unambiguously machine processed (using the identifier), and remain intelligible to human readers (using the labelling lemma string).

The Perseus project has given classicists two monumental resources that must be coordinated with an inventory of lexical entities: the digital LSJ lexicon of Greek, and the Morpheus morphological parsing system that can associate surface forms of words with a lemma. Taken together with the invaluable list Peter Heslin has created by running Perseus' morphological parser over the word list of the TLG project's E disk, they provide an obvious starting point for an inventory of Greek lexical entities would be to compare these two resources.

The digital LSJ has already been provided with unique identifiers for each entry, and each entry includes a lemma string. Perseus' morphological analyses identify entities by lemma. Where there is a one-to-one mapping between the parser's lemma and the LSJ lemma (normalized so that LSJ's markings of long and short vowels are removed), we can fairly assume that they represent the same entity, and could simply adopt the LSJ identifier to refer to the more general notion of the lexical entity — an unambiguous reference that could be associated with an entry in the lexicon, with morphological analyses, or with any other information.

While this simple (and easily automated) task takes care of the vast majority of the vocabulary in both the LSJ and in the parser's output, there are several categories of problematic cases. They include:

  • entities where LSJ's orthography differs from the parser's orthography. This is actually a large group with several subcategories, some of which can probably be reliably resolved automatically. For example, LSJ and Morpheus sometimes disagree on whether the lemma form of a verb should be active or middle/passive voice: a a careful script could accommodate that kind of variation, but human intervention would be necessary when LSJ and Morpheus use alternate forms of the lemma.

  • entities that appear in the parser's list of lemmata, but not in LSJ. This occurs frequently with compound verbs that are not given separate articles in LSJ. In these cases, since there is no LSJ identifier to reuse, we would, obviously, need to create new identifiers for those entities not in LSJ.

  • "ghost entities." For reasons that are not clear to me, LSJ routinely lists verbal adjectives in -τέον as distinct entities, unconnected to the verb from which they are formed. (E.g., the adjective λυτέον is a distinct entry, unrelated to the verb λύω.) Whatever the reasoning, in a digital environment, this is the wrong taxonomy: the morphological analysis should allow applications to distinguish verbal adjectives from other forms deriving from the same verbal root, while the identifier for the lexical entity should recognize verbal adjectives and conjugated forms of a verb alike as forms of the same entity. Mapping these LSJ and Morpheus lemmata to the correct verbal lemmata will be a relatively straightforward task, but again will need human supervision for some common cases (e.g., δοτέον < δίδωμι).

  • entities in LSJ but not in the list of lemmata generated by running the parser over the TLG E word list. Presumably, these result from the contributors to LSJ covering texts that are beyond the scope of the TLG E disk's corpus. As a basic principle, we should make absolutely explicit what digital corpus of texts an inventory of lexical entities is based on. Since our first pass is working from Heslin's analysis of the TLG E corpus, we should not enter these LSJ IDs into our inventory — at least, not yet. As the inventory is checked against further texts, new vocabulary may appear, and at that time new candidates for addition to the inventory will need to be checked in both LSJ and Morpheus.

That is a substantial, but I think manageable, list of tasks. One easy way to begin would be to limit the scope of coverage further, and rather than beginning from the entire TLG E word list, start with a word list created from a specified corpus of texts. As lemmatized word indices for the First Thousand Years of Greek are released, we will guarantee that all surface forms of a word are resolved to a uniquely identified lexical entity.

Thursday, October 30, 2008

Beyond text

If you are interested in the architecture of scholarly resources, run, don't walk, to Gabe Weaver's new sourceforge site, episteme. The nascent site (opened to coincide with the public release of "digital product" from the Archimedes Palimpsest project) documents his work representing and manipulating information encoded as mathematical diagrams.

There's already a lot to think about here, but one intriguing aspect is that entities in figures are referred to with identifiers that can be coordinated with canonical references to passages of textual content from the same document. (Short-term consequence for me personally — urgent need to re-think my presentation for the "Text and Graphics" panel at next week's TEI meeting in London. Ouch.)

Oh, and if you just want to enjoy some beautiful drawings, there's an Easter egg with a larger display of the image above — a collage of figures from book 1 of Archimedes' treatise On Floating Bodies. You can see it here.

(Updated Oct. 31: Episteme now includes interactive eye candy, too.)

Wednesday, August 6, 2008

Epidoc transcoding transformer bats 1.000

Hugh Cayless's transcoding transformer library (available from the Epidoc project's sourceforge site here) is indispensable for anyone working with ancient Greek texts in java or groovy. How reliable is it?

I decided to test it against two significant lists of unique Greek strings. For each list, I converted the TLG's beta code word to UTF-8, then converted the resulting UTF-8 back to beta code, and compared that result to the original. (For an overview of the TLG's beta code conventions, see this guide.)

The first list was composed of 858715 words excluding proper names. The transcoder round tripped to its starting point in 858709 cases. Six failures doesn't sound bad (99.999% success rate). But look more closely: in five of the six failures, the TLG entry in fact breaks the TLG's encoding rules about order of accents, breathings and iota subscripts, while the transcoder correctly follows the rules with the consequence that its conversion back to beta code actually corrects a data entry error in the TLG! The sixth case is a sequence found only in a papyrus fragment. The beta code series o(= should represent an omicron with rough breathing and circumflex – an accentuation that is not possible in Greek.

The second word list I tried was composed of proper names, including the tricky sequences beta code introduces in its conventions for capitalization. Out of 53167 capitalized words, the transcoder round tripped perfectly in all but one – again, an error in the TLG data entry that the transcoder corrected!

That's a total of 911882 unique strings. (That's going way beyond carefully chosen unit tests!) Remarkably, the transcoder had a 100% success rate in correctly formed words.

Thursday, July 10, 2008

Half empty or half full?

I frequently assert that classicists, along with biblical scholars, share the distinction of using logical citation schemes to refer to the works they study. This practice is important, since it means that references can apply to any version of a work, in print or digital form. (Briefly, in an earlier post.)

I have made this claim so often, that I decided it would be a good idea to find out if it were true.

The TLG offers the largest corpus of ancient Greek, so one way to evaluate how classicists cite their works would be simply to count and summarize the citation schemes used in the TLG. Sadly, athough this would have been possible until 2000 when the TLG distributed data to its licensees, there is in 2008 no way around the preconceived query interface of the TLG web site. (The fact that such a simple question as "what citation schemes are used?" is now out of reach illustrates the catastropic consequences for classical studies of the TLG's decision to reverse its decades-old policy of distributing data, in favor of selling access to predetermined user interfaces.)

As in an earlier post estimating the size of the surviving Greek corpus by period, we can still use the 2000 version of the TLG Canon distributed on the TLG E disk to get an impression of classicists' citation practice, however.

As in that post, we'll want to limit ourselves to works transmitted by manuscript copying. I'll take the simplest approach possible: count the number of "works" that use each citation scheme. I won't attempt to normalize in any way the definition of a work: the five-line Homeric Hymn to the Dioscuri is one work, as is the entire Iliad. With that caveat in mind, let's look at the results.

The TLG E canon includes 3810 works transmitted by manuscript and having defined citation schemes. (Note that the Canon includes works not in the E disk; 584 of these works did not yet have a defined citation scheme at the time of the E disk's publication, so I exclude them from our results.) These 3810 works are represented by an astonishing 194 distinct citation schemes!

As we might expect, however, the distribution of these schemes is very uneven: 104 citation schemes are used for a single work; only 16 citation schemes are used for more than 13 works. Let's look more closely at these top 16 citation schemes, which cover 3426 (90%) of the works surveyed.

Citation schemeNumber
stephanus page/section/line114
jebb page/line54
bekker page/line44
kuehn volume/page/line39
harduin page/section/line32
Total physical schemes1814 (53%)
Total logical schemes1612 (47%)
Grand total3426

The overall results are not encouraging. The entries in black are logical schemes: they total only 47% of the 3426 works. The entries in red refer instead to physical artifacts like book pages, 53% of the group. It's small consolation that the numbers are a worst-case scenario: some works may be cited by both logical and physical reference; where the TLG uses a logical reference, we can be sure that a logical scheme exists, but where the TLG uses a physical reference system, we can't always exclude the possibility that an alternative logical scheme is available. For example, the 44 works cited by Bekker page are, of course, the Aristotelian corpus: many of these have alternative citation schemes by chapter or section.

If we break the numbers down further by the chronological period of the original text, however, the picture changes. With the notable exception of Plato, where Stephanus' great edition became the standard for citation, citation by logical scheme is much more prevalent in works of the classical period. The following table breaks out from the previous listing works dating before about 300 BC.

Citation schemes in works of classical date
section/line 229
line 98
bekker page/line 43
stephanus page/section/line 38
chapter/section/line 20
volume/page/line 18
page/line 16
book/chapter/section/line 11
fable/line 9
book/line 5
ode/line 4
book/section/line 4
tetralogy/section/line 3
demonstratio/line 3
epistle/section/line 3
book/demonstratio/line 2
thevenot page/line 2
epistle/line 2
idyll/line 1
page+column/line 1
sententia/line 1
lexical entry/line 1
proverb/line 1
folio/line 1
fable/version/line 1
exordium/section/line 1
usener page/line 1
Total physical schemes120 (23%)
Total logical schemes399 (77%)
Grand total519

The 519 works are cited in 27 different citation schemes. We could think of that as an "average density" of about 19-20 works per citation scheme, essentially the same as for the overall corpus (194 schemes for 3810 works is also a density of about 19-20 works per citation scheme). But in this listing, only 23% (120) of the classical works use physical reference systems. The corpora of Plato and Aristotle constitute the bulk of this material (81 works); apart from the two great philosophical corpora, only 39 works of the classical period are cited in the TLG by physical reference system – about 8%.

It's probably the height of political incorrectness to suggest that the most traditional canon of work has been the object of better quality scholarly study (although it's plausible enough that more scholarship should produce better results), but by the single, one-dimensional yardstick of how a work is cited, editors of classical texts have done a far better job capturing the logical structure of their texts than have editors of ancient Greek overall.

So for classicists interested in creating a digital corpus of Greek, the "news" is mixed. Roughly half the works in the TLG E Canon already depend on logical reference systems, so we already have a good standard in place for many of our texts. The classical period is in markedly better shape.

Friday, April 11, 2008

Citation schemes: empty content elements considered harmful

Classicists have, by and large, relied on standard, logical citation schemes to cite works of ancient literature. In the scheme of the Functional Requirements for Bibliographic Records (or FRBR), we could say that classicists have cited notional works using references that could then be applied to any manifestation or expression of that work.

In the print world, this practice has made it possible for scholars to apply a reference to different printed editions or translations of a work. As the internet becomes our library, this practice can turn references into machine-actionable entry points to the library (whether the reference is automatically discovered, or manually cited by a scholars). It is therefore a vital prerequisite that digital editions encode standard, logical citation data such as the book/chapter/section divisions of Thucydides, or the book/line divisions of the Iliad.

The TEI Guidelines (as so often) offer more than one way to approach the problem. It is valid TEI to encode citation values as attributes on containing elements that define the logical structure of a document. Book/chapter/section in Thucydides might be represented by a successive hierarchy of TEI div elements, for example, or book/line in the Iliad by div elements containing l elements; the citation values could be placed in the @n attribute of each container.

Alternatively, since the earliest work of the TEI in the 1980s, the Guidelines have included empty elements (such as the milestone) that could be used to mark transitional points in a document. It is easy to find examples of scholarly texts using such empty elements to mark the beginning of a new unit like a chapter or section.

Arguably, there was little difference between these two approaches in SGML. In XML, however, scholars should avoid using empty elements to encode citation data.

A host of supporting and related technologies have developed around XML in its first decade. One of the most important is XPath, a notation for referring to parts of an XML document by the document's structure. Higher-level technologies such as XSLT or implementations of the DOM model in many programming languages in turn support XPath expressions. The result is that programmers working in many environments can succinctly retrieve a unit like "book 2, chapter 5" of Thucydides with a simple XPath expression like

/TEI.2/text/body/div[@type='book' and @n='2']/div[@type='chapter' and @n='5']
Content between empty elements, on the other hand, cannot be addressed directly with XPath expressions.

Placing citation data on empty elements cuts programmers off from a galaxy of technologies they can use when citation data is kept on containing elements. Empty citation elements should never be necessary if the citation scheme is in fact a logical hierarchy: if it is not, consider whether there is a problem either with your choice of citation scheme or with your design of the rest of the document's structure.

Separation of concerns applies to document content, too

Twenty years ago, before the internet was open to the public, the print publishing industry was a leader in SGML document markup, and scholarly markup projects tended to think of "documents" as the content bound between a pair of covers. This heritage is clearly reflected in the TEI Guidelines' thorough inventory of elements to identify "front" and "back" material of documents, or a variety of groupings or collections of texts.

The major syntactic differences between XML and SGML — insistence on a single hierarchy of elements, each with explicitly marked end — were introduced in part to adapt markup to the needs of a very different environment: a network of computers exchanging information dynamically. The already well-understood distinction between semantic markup and presentational markup certainly contributed to the articulation of "separation of concerns" in the design of network applications. Individuals with different skills could apply appropriate technologies to the different parts of a network application, so in creating an application to run in a web browser, programmers might write the controlling code in javascript, and design specialists define its appearance with CSS. In a network of semantically structured content, XML plays the vital roles of defining the data structure (explicitly via a schema or DTD, or implicitly in the case of well-formed XML), and of providing a format for data exchange. The question of what this XML should look like — the kind of question the TEI has considered since the 1980s — had to be rethought. Humanists might rephrase Sun Microsystem's famous slogan, "The network is the computer," as "The network is the library."

When applications can exchange structured content, it is straightforward to create compound documents. Asymmetrically, it can be more difficult to disaggregate a complex document into component parts, since an application then needs a more detailed knowledge of the internal structure of a necessarily more complex document. An application could easily juxtapose a document in original language with a document in translation, or weave together a commentary with a text associated through a common citation system, for example, but disentangling interleaved translation or commentary from a complex document is more problematic.

I've been thinking about this in designing a set of TEI documents to represent the multiple texts of the famous Venetus A manuscript of the Iliad. There are four distinct sets of scholia, in addition to the manuscript's text of the Iliad. I chose to treat each set as an independent document, and as I am now reaching the stage of putting together applications drawing on those documents, I am glad that I did: cleanly separated, discrete documents are making that job much easier than it otherwise would be.

I expect that I will never use the elaborate TEI mechanisms to document the relation of a transcribed document to graphic images. In keeping with the guiding principle of separate, discrete documents, I'm associating images of the manuscript with ranges of text through external indices: here, too, the standoff markup of a separate, simple (non-TEI) document is easy to marshall together with the TEI document of the transcribed text.

In many ways, TEI P5, with its support for XML namespaces, is nudging scholars towards this kind of document organization. But we need to push harder: it's time to move away from monolithic TEI replicas of print or even manuscript sources. In editing scholarly texts for use on the internet, let each logical component stand alone.

Coordinating separate documents in a networked library requires a common understanding of how to cite them. I'll follow up with a note on how editors of TEI texts should think about that part of their markup.

Wednesday, March 5, 2008

The first thousand years of Greek

How much Greek survives from the classical period? From the Hellenistic period?

Those questions were impossible to quantify when I was an undergraduate. It still might be difficult to get a very precise answer if we wanted to consider inscriptions and papyri, but if we limit ourselves to ancient Greek transmitted to us by manuscript copying, we can get a pretty satisfactory answer for the first thousand years or so of ancient Greek very quickly using the Canon from the TLG E disk.

The data in the Canon can be systematically manipulated using the Diogenes perl library. For each work in the TLG, the Canon contains three fields of information that are of special interest for this question: one indicates the method of transmission; another contains the word count of the TLG's on line text; and a third field contains a date description. The method of transmission is important, because the TLG includes "works" that are known only through testimonia or citation — "fragments," as classicists misleadingly call them — where we instead want to estimate how much Greek actually exists. (We don't care about geographic "fragments" of Hipparchus that are really passages of Strabo. To get an idea of how much of the TLG is made up of this doubling of content, the TLG E disk contains roughly 75-76 million words; over 4 million words — roughly 5% of the whole TLG E disk — are quoted "fragments" or testimonia!)

While it would be possible to write perl code to query the TLG Canon directly via the Diogenes API, most people would probably find it easier to transform the contents of the Canon into some format where they can use standard technologies. (I have created both a hierarchical XML version of the Canon, and a normalized relational database version; possible topics for another blog entry perhaps.)

The word counts are integer values; the methods of transmission are indicated by a controlled vocabulary: manuscript transmission is either 'Cod' or 'cod'. The only challenge is parsing the Canon's quasi-regular strings describing dates, but there are only a little over 100 unique strings, so scripting a little text munging in your favorite language that supports regular expressions is pretty straightforward.

The Canon's dates are to a precision of a century, so I interpret all dates as ranges. A date of "first century AD" could be interpreted as a range of 1-100 AD, and a date of "third or second century BC" could be interepreted as a range of 299 - 100 BC, for example.

At this point, it's time to let the computer do the counting. Here are some results to consider: through 300 AD, the TLG contains over 20 million words, but their chronological distribution is very uneven:

For works dated after or equal to... ... but before Number of words Running total
Earliest Greek writing500 BC384528384528
500 BC400 BC22517662636294
400 BC300 BC17629444399238
300 BC200 BC9212555320493
200 BC100 BC1786555499148
100 BC1 AD17453207244468
1 AD200 AD758375914828227
200 AD300 AD537309520201322


Roughly 10% of the contents of the TLG E corpus (7680878 words) have dates given as "INCERTUM" or "VARIA": these are completely omitted from the counts. We can't really know how Greek is distributed beyond the period of the TLG E Canon's coverage, because the TLG project no longer makes the Canon available, except through its "one-size-fits-all" interface (or to answer the questions raised here, "one size fits none"). This is the more troubling since the TLG's online corpus is now a third again as large as it was in 2000 when the E disk was prepared (by the estimate of the TLG website, 99 million words vs. 76 million words for the E disk).

Friday, February 22, 2008

Prediction confirmed: ubiquitous Diogenes

In January, when I blogged a short note about running Diogenes on Linux PPC, I called Diogenes "ultra portable," and ended by asking, "Diogenes on your XO laptop, iPhone, or other device, anyone?"

Bruce Robertson has now answered. He emails, "I thought you'd like to know that I've been running diogenes quite happily on a Nokia N800 palmtop computer running OS2008 for the past two days."

Another great example of what happens when scholarship and software are open: other people can (in this instance, quite literally!) take it places the original author probably never imagined.

Tuesday, February 19, 2008

Scholarly markup in XML's second decade

XML is now ten years old. (For those interested in an insider's view of how that all happened, Tim Bray has republished XML People.) For scholarly projects involving semantically structured texts, it is practically a given that they will rely on XML.

But in actual practice, texts produced by current projects often don't look very different from scholarship based on SGML in the 1980s. In the next postings on this blog, I want to discuss three suggestions based on my experience with XML over the last decade, and how it contrasts with my experience of SGML in the preceding decade. In each case, I'll focus on how to follow these suggestions using the Text Encoding Initiative's guidelines.

  1. Separation of concerns applies to document content, too. (Now here.)

  2. Citation schemes: empty content elements considered harmful (Now here.

  3. What's the diff? Rethinking the critical apparatus.

Stay tuned.

Wednesday, January 9, 2008

Looking for an honest man — on Linux PPC

Peter Heslin's Diogenes 3.1 is extremely cleanly designed, and ultra portable. The server functionality is written in perl, and the new user interface is a XUL application. One result is that Heslin can provide simple binary installations for Mac OS X, various Windows operating systems, and Linux on x86 architecture.

This design also makes it easy to install and run Diogenes on any operating system with perl and a XUL application environment. Using Ubuntu Linux 7.04 on a PPC system, for example, after you download and install the Linux version of Diogenes, you can run Diogenes at least three different ways:

1) use xulrunner to run the graphic interface

If xulrunner is not already installed on your system, use Synaptic or apt-get to install it. (xulrunner is in the Development section of the Ubuntu universe repository.) You can now start Diogenes from a terminal with the command
   xulrunner /usr/local/diogenes/application.ini

Better still, edit the properties for the Diogenes menu item that was created by the Diogenes installer. In the Launcher Properties, enter the command to start xulrunner as illustrated here. Now you can run diogenes from the menu selection.

2) use Firefox 3 to run the graphic interface

Version 3 of Firefox includes a full XUL environment that can run external XUL programs like Diogenes. Beta version 2 of FF3 was released in December; when a stable release version appears, look for it to show up as an upgrade to Firefox in your Ubuntu repository. When Firefox 3 is installed on your system, you may alternatively start Diogenes with the command
   firefox -app /usr/local/diogenes/application.ini

As with option 1, you can edit the Diogenes menu item to run this command.
Technically inclined users who are eager to play with the beta version can download source code for the beta release, and follow the very clear instructions here to install it. All the prerequisites are standard libraries that are available in Ubuntu repositories.

3) Browse and search texts from the command line

The command line user program (named dio) works just as it does on any other Linux. Run dio with no arguments to see its various options.

The importance of this flexibility is not that it opens up Diogenes to a vast number of Greek scholars using Linux PPC, Solaris, or some other particular operating system. Its importance is rather that it keeps Diogenes open to any platform meeting its simple requirements — including future platforms.

Diogenes on your XO laptop, iPhone, or other device, anyone?