Author(s): Moisl HL
Abstract: The advent of large electronic text corpora has generated a range of technologies for their search and interpretation. Variation in document length can be a problem for these technologies, and several normalization methods for mitigating its effects have been proposed. This paper assesses the effectiveness of such methods in specific relation to exploratory multivariate analysis. The discussion is in four main parts. The first part states the problem, the second describes some normalization methods, the third identifies poor estimation of the population probability of variables as a factor that compromises the effectiveness of the normalization methods for very short documents, and the fourth proposes elimination of data matrix rows representing documents which are too short to be reliably normalized and suggests ways of identifying the relevant documents.
Keywords: Exploratory multivariate analysis, document length normalization, cluster analysis
|
Dr Hermann Moisl
|
|