Use of Subject Field Codes from a Machine-Readable Dictionary for Automatic Classification of Documents

Authors

  • Elizabeth D. Liddy Syracuse, New York
  • Woojin Paik Syracuse, New York
  • Joseph K. Woelfel Syracuse, New York

DOI:

https://doi.org/10.7152/acro.v3i1.12598

Abstract

We are currently eveloping a system whose goal is to emulate a human classifier who peruses a large set of documents and sons them into richly defined classes based solely on the subject content of the documents. To accomplish this task, our system tags each word in a document with the appropriate Subject Field Code (SFC) from a machine-readable dictionary. The within- document SFCs are then summed and normalized and each document is represented as a vector of the SFCs occurring in that document. These vectors are clustered using Ward's agglomerative clustering algorithm (Ward, 1963) to form classes in a document database. For retrieval, queries are likewise represented as SFC vectors and then matched to the prototype SFC vector of each cluster in the database. Clusters whose prototype SFC vectors exhibit a predetermined criterion of similarity to the query SFC vector are passed on to other system components for more computationally expensive representation and matching.

Downloads

Published

1992-10-25