The machine readable corpus

The term corpus is almost synonymous with the term machine-readable corpus. Interest in the computer for the corpus linguist comes from the ability of the computer to carry out various processes, which when required of humans, ensured that they could only be described as psuedo-techniques. The type of analysis that Kading wait years for can now be achieved in a few moments on a desktop computer.


Considering the marriage of machine and corpus, it seems worthwhile to consider in slightly more detail what these processes that allow the machine to aid the linguist are. The computer has the ability to search for a particular word, sequence of words, or perhaps even a part of speech in a text. So if we are interested, say, in the usages of the word however in the text, we can simply ask the machine to search for this word in the text. The computer's ability to retrieve all examples of this word, usually in context, is a further aid to the linguist.

The machine can find the relevant text and display it to the user. It can also calculate the number of occurrences of the word so that information on the frequency of the word may be gathered. We may then be interested in sorting the data in some way - for example, alphabetically on words appearing to the right or left. We may even sort the list by searching for words occuring in the immediate context of the word. We may take our initial list of examples of however presented in context (usually referred to as a concordance), and extract from the another list, say of all the examples of however followed closely by the word we, or followed by a punctuation mark.

The processes described above are often included in a concordance program. This is the tool most often implemented in corpus linguistics to examine corpora. Whatever philosophical advantages we may eventually see in a corpus, it is the computer which allows us to exploit corpora on a large scale with speed and accuracy.