A learner-indipendent evaluation of the usefulness of statistical phrases for automated text categorization

M. F. Caropreso, S. Matwin, and F. Sebastiani. Text Databases and Document Management: Theory and Practice, chapter A learner-indipendent evaluation of the usefulness of statistical phrases for automated text categorization, pages 78–102. Idea Group Publishing, Hershey, US, 2001. [url]

——————————

This paper present a technique and proof to use a n-grams indexing technique for Text Classification (TC). A phrase is a textual unit usually larger than a word but smaller than a fuill sentence. Phrases have a smaller degree of ambiguity than their constituents words, thanks to the mutual disambiguation effect of words.

An n-gram is an alphabetically ordered sequence of unigrams. The authors’ learner-indipendent study has shown that feature evaluation functions routinely used in the text categorization experiments tend to score many bigrams higher than unigrams that they would themeselves select in unigram-only feature selection task, sometimes giving rise to a too high bigram “penetration level”.

Tags: ,

Leave a Reply