Type of corpus | Number of words | Number of instances of boot |
English Spoken | 50,000 | 50 |
English Written | 500,000 | 500 |
A brief look at the table seems to show that boot is more frequent in written rather than spoken English. However, if we calulate the frequency of occurrence of boot as a percentage of the total number of tokens in the corpus (the total size of the corpus) we get:
spoken English: 50/50,000 X 100 = 0.1%
written English: 500/500,000 X 100 = 0.1%
Looking at these figures it can be seen that the frequency of boot in our made-up example is the same (0.1%) for both the written and spoken corpora.
Even where disparity of size is not an issue, it is often better to use proportional statistics to present frequencies, since most people find them easier to understand than comparing fractions of unusual numbers like 53,000. The most basic way to calculate the ratio between the size of the sample and the number of occurences of the type under investigation is:
ratio = number of occurrences of the type / number of tokens in the entire sample
This result can be expressed as a fraction, or more commonly as a decimal. However, if that results in an unwieldy looking small number (in the above example it would be 0.0001) the ratio can then be multiplied by 100 and represented as a percentage.