Publication:

Using electronic corpora in historical dialectology research: the problem of document length variation (2009)

Author(s): Moisl HL

    Abstract: The proliferation of computational technology has generated an explosive production of electronically encoded information of all kinds. In the face of this, traditional philological methods for search and interpretation of data have been overwhelmed by volume, and a variety of computational methods have been developed in an attempt to make the deluge tractable. These developments have clear implications for corpus-based linguistics in general, and for corpus-based study of historical dialectology in particular: as more and larger historical text corpora become available, effective analysis of them will increasingly be tractable only by adapting the interpretative methods developed by the statistical, information retrieval, pattern recognition, and related communities. To use such analytical methods effectively, however, issues that arise with respect to the abstraction of data from corpora have to be understood. This paper addresses an issue that has a fundamental bearing on the validity of analytical results based on such data: variation in document length. The discussion is in four main parts. The first part shows how a particular class of computational methods, exploratory multivariate analysis, can be used in historical dialectology research, the second explains why variation in document length can be a problem in such analysis, the third proposes document length normalization as a solution to that problem, and the fourth points out some difficulties associated with document length normalization.

      • Book Title: Studies in English and European Historical Dialectology
      • Pages: 67-90
      • Publisher: Peter Lang
      • Publication type: Book chapter
      • Bibliographic status: Published
        Staff

        Dr Hermann Moisl
        Senior Lecturer