Asymmetric Classification: Constructing Channels from Sources in Real-Time

Authors

  • Ron G. Katriel Acquire Media Corporation
  • Lawrence C. Rafsky Acquire Media Corporation

DOI:

https://doi.org/10.7152/acro.v14i1.14115

Abstract

In this note we describe a system, developed under contract for major newspaper and newswire publishers (and currently deployed commercially), that constructs "topic channels" in real-time by simultaneously assigning one or more category codes to newspaper stories immediately upon publication and to newswire stories "on the run". The sources are diverse, but the category code taxonomy (developed by human domain experts) is unified. The system, named "Cogent" (COdinG ENgine Technology), is asymmetric in the following two senses: (1) speed of classification is far more important than speed of training, and (2) precision is far more important than recall. The last two statements must be taken in the extreme: a failure to classify in under a few milliseconds, or the inclusion of an irrelevant story in a channel, are both considered system failures of the first magnitude, not mere "glitches". The approach is statistical, and the thresholdadjustment used to favor precision over recall has direct interpretation as a likelihood ratio. Novel aspects include a new feature selection algorithm that drastically reduces dimensionality, and the use of publisherassigned metadata as features. Comparison with published results indicate that Cogent performs as well as the best available text categorizers for newswires but uses substantially fewer features and computational resources during classification.

Downloads

Published

2003-10-01