How to Normalize Cooccurrence Data? An Analysis of Some Well-Known Similarity Measures

Eck, Nees Jan van; van Eck, Nees Jan; Waltman, Ludo

Published in

Association for Information Science and Technology (ASIS&T), Journal of the American Society for Information Science and Technology, 8(60), p. 1635-1651, 2009

DOI: 10.1002/asi.21075

Tools

Export citation

Search in Google Scholar

How to Normalize Cooccurrence Data? An Analysis of Some Well-Known Similarity Measures

Journal article published in 2009 by Nees Jan van Eck, Nees Jan van Eck

, Ludo Waltman

This paper is available in a repository.

Full text: Download

Preprint: archiving allowed

Upload

Postprint: archiving allowed

Upload

Published version: archiving forbidden

Policy details

Data provided by

Abstract

In scientometric research, the use of co-occurrence data is very common. In many cases, a similarity measure is employed to normalize the data. However, there is no consensus among researchers on which similarity measure is most appropriate for normalization purposes. In this paper, we theoretically analyze the properties of similarity measures for co-occurrence data, focusing in particular on four well-known measures: the association strength, the cosine, the inclusion index, and the Jaccard index. We also study the behavior of these measures empirically. Our analysis reveals that there exist two fundamentally different types of similarity measures, namely set-theoretic measures and probabilistic measures. The association strength is a probabilistic measure, while the cosine, the inclusion index, and the Jaccard index are set-theoretic measures. Both our theoretical and our empirical results indicate that co-occurrence data can best be normalized using a probabilistic measure. This provides strong support for the use of the association strength in scientometric research.

Published in

Links

Tools

How to Normalize Cooccurrence Data? An Analysis of Some Well-Known Similarity Measures

Abstract