Published in

Cambridge University Press, Natural Language Engineering, 4(4), p. 325-344

DOI: 10.1017/s1351324998002071

Links

Tools

Export citation

Search in Google Scholar

Finding a domain-appropriate sense inventory for semantically tagging a corpus

Journal article published in 1998 by Alessandro Cucchiarelli ORCID, Paola Velardi
This paper was not found in any repository, but could be made available legally by the author.
This paper was not found in any repository, but could be made available legally by the author.

Full text: Unavailable

Green circle
Preprint: archiving allowed
Orange circle
Postprint: archiving restricted
Red circle
Published version: archiving forbidden
Data provided by SHERPA/RoMEO

Abstract

Semantically tagging a corpus is useful for many intermediate NLP tasks such as: acquisition of word argument structures in sublanguages; acquisition of syntactic disambiguation cues; terminology learning; etc. The general idea is that semantic tags allow the generalization of observed word patterns, and facilitate the discovery of recurrent sublanguage phenomena and selectional rules of various types. Yet, as opposed to POS tags in morphology, there is no consensus in the literature about the type and granularity of the semantic tags to be used. In this paper, we argue that an appropriate selection of semantic tags should be domain-dependent. We propose a method by which we select from WordNet an inventory of semantic tags that are ‘optimal’ for a given corpus, according to a scoring function defined as a linear combination of general and corpus-dependent performance factors. We believe that an optimal selection of a category inventory is a necessary premise for obtaining better results in all lexically learning algorithms that are based on, or concerned with, semantic categorization of words. Furthermore, an adequate inventory (one which intuitively ‘fits’ with the semantics of a domain, e.g. phenomenon for Natural Science, or part, piece for a technical handbook) may facilitate the manual annotation of large corpora.