Formats of Annotation

Currently, there are no widely agreed standards of representing information in texts and in the past many different approaches have been adopted, some more lasting than others. One long-standing annotation practice is known as COCOA refernces. COCOA was an early computer program used for extracting indexes of words in context from machine readable texts. Its conventions were carried forward into several other programs, notably the OCP (Oxford Concordance Program). The Longman-Lancaster corpus and the Helsinki corpus have also used COCOA references.

Very simply, a COCOA reference consists of a balanced set of angled brackets (< >) which contains two entities:

A code which stands for a particular variable name.
A string or set of strings, which are the instantiations of that variable.

For example, the code "A" could be used to refer to the variable "author" and the string would stand for the author's name. Thus COCOA references which indicate the author of a passage of text would look like the following:

COCOA references only represent an informal trend for encoding specific types of textual information, e.g. authors, dates and titles. Current trends are moving more towards more formalised international standards of encoding. The flagship of this current trend is the Text Encoding Iniative (TEI), a project sponsored by the Association for Computational Linguistics, the Association for Literary and Linguistic Computing and the Association for Computers and the Humanites. Its aim is to provide standardised implementations for machine-readable text interchange.

The TEI uses a form of document markup known as SGML (Standard Generalised Markup Language). SGML has the following advantages:

Clarity
Simplicity
Formally rigourous
Already recognised as an international standard

The TEI's contribution is a detailed set of guidelines as to how this standard is to be used in text encoding (Sperberg-McQueen and Burnard, 1994).

In the TEI, each text (or document) consists of two parts - a header and the text itself. The header contains information such as the following:

author, title and date
the edition or publisher used in creating the machine-readable text
information about the encoding practices adopted.

Click here to read more about headers and text

You might also want to read about the EAGLES advisory body in chapter 2 of Corpus Linguistics (page 29).