Multiple Variables

The tests that we have looked at so far can only pick up differences between particular samples (i.e. texts and copora) on particular variables (i.e. linguistic features) but they cannot provide a picture of the complex interrelationship of similarity and difference between a large number of samples, and large numbers of variables. To perform such comparisons we need to consider multivariate techniques. Those most commonly encountered in linguistic research are: The aim of multivariate techniques is to summarise a large set of variables in terms of a smaller set on the basis of statistical similarities between the orginal variables, whilst at the same time losing the minimal amount of information about their differences.

Although we will not attempt to explain the complex mathematics behind these techniques, it is worth taking time to understand the stages by which they work: All the techniques begin with a basic cross-tabulation of the variables and samples.

For factor analysis an intercorrelation matrix is then calculated from the cross-tabulation, which is used to attempt to "summarise" the similarities between the variables in terms of a smaller number of reference factors which the technique extracts. The hypothesis being that the many variables which appear in the original frequency cross-tabulation are in fact masking a smaller number of variables (the factors) which can help exaplain better why the observed frequency differences occur.

Each variable receives a loading on each of the factors which are extracted, signifying its closeness to that factor. For example, in analysing a set of word frequencies across several texts one might find that words in a certain conceptual field (i.e. religion) received high loadings on one factor, whereas those in another field (e.g. government) loaded highly on another factor.

Follow this link for an example of factor analysis.

Correspondence analysis is similar to factor analysis, but it differs in the basis of its calculations.

Multidimensional scaling (MDS) also makes use of an intercorrelation matrix, which is then converted to a matrix in which the correlation coefficients are replaced with rank order values. E.g. the highest correlation value recieves a rank order of 1, the next highest receives a rank order of 2 and so on. MDS then attempts to plot and arrange these variables so that the more closely related items are plotted closer together than the less closely related items.

Cluster analysis involves assembling the variables into unique groups or "clusters" of similar items. A matrix is created, in a similar fashion to factor analysis (although this may be a distance matrix showing the degree of difference rather than similarity between the pairs of variables in the cross-tabulation). The matrix is then used to group the variables contained within it.

Read more about cluster analysis in Corpus Linguistics, Chapter 3, pages 76, 78 and 79.