Orthography

It might be thought that converting a written or spoken text into machine-readable form is a relatively simple typing optical scanning task, but even with a basic machine-readable text, issues of encoding are vital, although to English speakers their extent may not be apparent at first.

In languages other than English, the issue of accents and of non-Roman alphabets such as Greek, Russian and Japanese present a problem. IBM-compatible computers are capable of handling accented characers, but many other mainframe computers are unable to do this. Therefore, for maximum interchangeability, accented characters need to be encoded in other ways. Various strategies have been adopted by native speakers of languages which contain accents when using computers or typewriters which lack these characters. For example, French speakers omit the accent entirely, writing Hélenè as Helene. To handle the umlaut, German speakers either introduce an extra letter "e" or place a double quote mark before the revelant letter, so Frühling would become Fruehling or Fr"uhling. However, these strategies cause additional problems - in the case of the French, information is lost, while in the German extraneous information is added.

In response to this the TEI has suggested that these characters are encoded as TEI entities, using the delimiting characters of & and ;. Thus, ü would be encoded by the TEI as

&uumlaut; You can also read about the handling of non-Roman alphabets and the transcription of spoken data in Corpus Linguistics, chapter 2, pages 34-36.