Significance Testing

Significance tests allow us to determine whether or not a finding is the result of a genuine difference between two (or more) items, or whether it is just due to chance. For example, suppose we are examining the Latin versions of the Gospel of Matthew and the Gospel of John and we are looking at how third person singular speech is represented. Specifically we want to compare how often the present tense form of the verb "to say" is used ("dicit") with how often the perfect form of the verb is used ("dixit"). A simple count of the two verb forms in each text produces the following results:

Textno. of occurences of dicitno. of occurences of dixit
Matthew
46
107
John
118
119

From these figures is looks as if John uses the present form ("dicit") proportionally more often than Matthew does, but to be more certain that this is not just due to co-incidence, we need to perform a further calculation - the significance test.

There are several types of significance test available to the corpus linguist: the chi squared test, the [Student's] t-test, Wilcoxon's rank sum test and so on. Here we will only examine the chi-squared test as it is the most commonly used significance test in corpus linguistics. This is a non-parametric test which is easy to calculate, even without a computer statistics package, and can be used with data in 2 X 2 tables, such as the example above.

The main disadvantage of chi-squared is that it is unreliable with very small frequencies, and proportional data (percentages etc) can not be used with the chi-squared test.

The test compares the difference between the actual frequencies (the observed frequencies in the data) with those which one would expect if no factor other than chance had been operating (the expected frequencies). The closer these two results are to each other, the greater the probablity that the observed frequencies are influenced by chance alone.

Having calculated the chi-squared value (we will omit this here and assume it has been done with a computer statistical package) we must look in a set of statistical tables to see how significant our chi-squared value is (usually this is also carried out automatically by computer). We also need one further value - the number of degrees of freedom which is simply:

(number of columns in the frequency table - 1) x (number of rows in the frequency table - 1)
In the example above this is equal to (2-1) x (2-1) = 1.

We then look at the table of chi-square values in the row for the relevant number of degrees of freedom until we find the nearest chi-square value to the one which is calculated, and read off the probability value for that column. The closer to 0 the value, the more significant the difference is - i.e. the more unlikely that it is due to chance alone. A value close to 1 means that the difference is almost certainly due to chance. In practice it is normal to assign a cut-off point which is taken to be the difference between a significant result and an "insignificant" result. This is usually taken to be 0.05 (probablity values of less than 0.05 are written as "p < 0.05" and are assumed to be significant.)

In our example about the use of dicit and dixit above we calculate a chi-squared value of 14.843. The table below shows the significant p values for the first 3 degrees of freedom:

Degrees of Freedomp = 0.05p = 0.01p = 0.001
1
3.846.6310.83
2
5.999.2113.82
3
7.8111.3416.27

The number of degrees of freedom in our example is 1, and our result is higher than 10.83 (see the final column in the table) so the probability value for this chi-square value is 0.001. Thus, the difference between Matthew and John can be said to be significant at p < 0.001, and we can therefore say with a high degree of certainty that this difference is a true reflection of variation in the two texts and not due to chance.

In depth: You can also read about Type I and Type II errors in the glossary.