Wednesday, November 26, 2008

The vocabulary of ancient Greek

What is the vocabulary of ancient Greek? That is, what set of words, or lexical entities, actually occur in our extant texts?

The First Thousand Years of Greek project (announced here) aims to simplify posing such straightforward questions, but we need more than online texts to talk unambiguously about words. One essential piece of infrastructure is an inventory of uniquely identified lexical entities in Greek. In print publications, lexical entities have traditionally been identified by a word's lemma form. While lemmata are valuable labels, they are potentially ambiguous. Instead, basic principles of information design dictate that arbitrary identifiers guaranteed to be unique should be associated with lemma strings, so that references to a lexical entity can be unambiguously machine processed (using the identifier), and remain intelligible to human readers (using the labelling lemma string).

The Perseus project has given classicists two monumental resources that must be coordinated with an inventory of lexical entities: the digital LSJ lexicon of Greek, and the Morpheus morphological parsing system that can associate surface forms of words with a lemma. Taken together with the invaluable list Peter Heslin has created by running Perseus' morphological parser over the word list of the TLG project's E disk, they provide an obvious starting point for an inventory of Greek lexical entities would be to compare these two resources.

The digital LSJ has already been provided with unique identifiers for each entry, and each entry includes a lemma string. Perseus' morphological analyses identify entities by lemma. Where there is a one-to-one mapping between the parser's lemma and the LSJ lemma (normalized so that LSJ's markings of long and short vowels are removed), we can fairly assume that they represent the same entity, and could simply adopt the LSJ identifier to refer to the more general notion of the lexical entity — an unambiguous reference that could be associated with an entry in the lexicon, with morphological analyses, or with any other information.

While this simple (and easily automated) task takes care of the vast majority of the vocabulary in both the LSJ and in the parser's output, there are several categories of problematic cases. They include:


  • entities where LSJ's orthography differs from the parser's orthography. This is actually a large group with several subcategories, some of which can probably be reliably resolved automatically. For example, LSJ and Morpheus sometimes disagree on whether the lemma form of a verb should be active or middle/passive voice: a a careful script could accommodate that kind of variation, but human intervention would be necessary when LSJ and Morpheus use alternate forms of the lemma.

  • entities that appear in the parser's list of lemmata, but not in LSJ. This occurs frequently with compound verbs that are not given separate articles in LSJ. In these cases, since there is no LSJ identifier to reuse, we would, obviously, need to create new identifiers for those entities not in LSJ.

  • "ghost entities." For reasons that are not clear to me, LSJ routinely lists verbal adjectives in -τέον as distinct entities, unconnected to the verb from which they are formed. (E.g., the adjective λυτέον is a distinct entry, unrelated to the verb λύω.) Whatever the reasoning, in a digital environment, this is the wrong taxonomy: the morphological analysis should allow applications to distinguish verbal adjectives from other forms deriving from the same verbal root, while the identifier for the lexical entity should recognize verbal adjectives and conjugated forms of a verb alike as forms of the same entity. Mapping these LSJ and Morpheus lemmata to the correct verbal lemmata will be a relatively straightforward task, but again will need human supervision for some common cases (e.g., δοτέον < δίδωμι).

  • entities in LSJ but not in the list of lemmata generated by running the parser over the TLG E word list. Presumably, these result from the contributors to LSJ covering texts that are beyond the scope of the TLG E disk's corpus. As a basic principle, we should make absolutely explicit what digital corpus of texts an inventory of lexical entities is based on. Since our first pass is working from Heslin's analysis of the TLG E corpus, we should not enter these LSJ IDs into our inventory — at least, not yet. As the inventory is checked against further texts, new vocabulary may appear, and at that time new candidates for addition to the inventory will need to be checked in both LSJ and Morpheus.


That is a substantial, but I think manageable, list of tasks. One easy way to begin would be to limit the scope of coverage further, and rather than beginning from the entire TLG E word list, start with a word list created from a specified corpus of texts. As lemmatized word indices for the First Thousand Years of Greek are released, we will guarantee that all surface forms of a word are resolved to a uniquely identified lexical entity.