Publication:

Normalization for Variation in Document Length in Exploratory Multivariate Analysis of Text Corpora (2008)

Author(s): Moisl HL

    Abstract: The advent of large electronic text corpora has generated a range of technologies for their search and interpretation. Variation in document length can be a problem for these technologies, and several normalization methods for mitigating its effects have been proposed. This paper assesses the effectiveness of such methods in specific relation to exploratory multivariate analysis. The discussion is in four main parts. The first part states the problem, the second describes some normalization methods, the third identifies poor estimation of the population probability of variables as a factor that compromises the effectiveness of the normalization methods for very short documents, and the fourth proposes elimination of data matrix rows representing documents which are too short to be reliably normalized and suggests ways of identifying the relevant documents.

      • Date: 27-29 March 2008
      • Conference Name: INFOS2008: 6th International Conference on Informatics and Systems
      • Publication type: Conference Proceedings (inc. abstract)
      • Bibliographic status: Unknown

        Keywords: Exploratory multivariate analysis, document length normalization, cluster analysis

        Staff

        Dr Hermann Moisl
        Senior Lecturer