Working with Proportions

Frequency counts are useful, but they have certain disadvantages. When one wishes to compare one data set with another, for example a corpus of spoken language with a corpus of written language. Frequency counts simply give the number of occurences of each type, they do not indicate the prevalence of a type in terms of a proportion of the total number of tokens in the text. This is not a problem when the two corpora that are being compared are of the same size, but when they are of different sizes frequency counts are little more than useless. The following example compares two such corpora, looking at the frequency of the word boot

Type of corpusNumber of wordsNumber of instances of boot
English Spoken50,00050
English Written500,000500

A brief look at the table seems to show that boot is more frequent in written rather than spoken English. However, if we calulate the frequency of occurrence of boot as a percentage of the total number of tokens in the corpus (the total size of the corpus) we get:

spoken English: 50/50,000 X 100 = 0.1%
written English: 500/500,000 X 100 = 0.1%

Looking at these figures it can be seen that the frequency of boot in our made-up example is the same (0.1%) for both the written and spoken corpora.

Even where disparity of size is not an issue, it is often better to use proportional statistics to present frequencies, since most people find them easier to understand than comparing fractions of unusual numbers like 53,000. The most basic way to calculate the ratio between the size of the sample and the number of occurences of the type under investigation is:

ratio = number of occurrences of the type / number of tokens in the entire sample

This result can be expressed as a fraction, or more commonly as a decimal. However, if that results in an unwieldy looking small number (in the above example it would be 0.0001) the ratio can then be multiplied by 100 and represented as a percentage.