Corpus Representativeness

As we saw in Session One, Chomsky criticised corpus data as being only a small sample of a large and potentially infinite population, and that it would therefore be skewed and hence unrepresentative of the population as a whole. This is a valid criticism, and it applied not just to corpus linguistics but to any form of scientific investigation which is based on sampling. However, the picture is not as drastic as it first appears, as there are many safeguards which may be applied in sampling to ensure maximum representativeness.

First, it must be noted that at the time of Chomsky's criticisms, corpus collection and analysis was a long and pains-taking task, carried out by hand, with the result that the finished corpus had to be of a manageable size for hand analysis. Although size is not a guarentee of representativeness, it does enter significantly into the factors which must be considered in the production of a maximally representative corpus. Thus, Chomsky's criticisms were at least partly true at the time of those early corpora. However, today we have powerful computers which can store and manipulate many millions of words. The issue of size is no longer the problem that it used to be.

Random sampling techniques are standard to many areas of science and social science, and these same techniques are also used in corpus building. But there are additional caveats which the corpus builder must be aware of.

Biber (1993) emphasises that we need to define as clearly as possible the limits of the population which we wish to study, before we can define sampling procedures for it. This means that we must rigourously define our sampling frame - the entire population of texts from which we take our samples. One way to do this is to use a comprehensive bibliographical index - this was the approach taken by the Lancaster-Oslo/Bergen corpus who used the British National Bibliography and Willing's Press Guide as their indices. Another approach could be to define the sampling frame as being all the books and periodicals in a particular library which refer to your particular area of interest. For example, all the German-language books in Lancaster University library that were published in 1993. This approach is one which was used in building the Brown corpus.

You can read about a different kind of approach which was used in collecting the spoken parts of the British National Corpus, in Corpus Linguistics, chapter 3, page 65.

Biber (1993) also points out the advantage of determining beforehand the hierarchical structure (or strata) of the population. This refers to defining the different genres, channels etc that it is made up if. For example, written German could be made up of genres such as:

Stratificational sampling is never less representative than pure probablistic sampling, and is often more representative, as it allows each individual stratum to be subjected to probablistic sampling. However, these strata (like corpus annotation) are an act of interpretation on the part of the corpus builder and others may argue that genres are not naturally inherent within a language. Genre groupings have a lot to do with the theoretical perspective of the linguist who is carrying out the stratification.

You can read about optimal lengths and number of sample sizes, and the problems of using standard statistical equations to determine these figures in Corpus Linguistics, chapter 3, page 66.