Genre-Based In-Document Content Type Classification

Bei Yu, Duane Searsmith, Duane Searsmith, Duane Searsmith


This paper presents an in-document content classification approach that combines genre analysis and shallow natural language processing techniques to do document segment-level content classification. Given a document in a particular genre, we can classify the content of each segment (e.g. a paragraph) based on the recognized content type and typical linguistic features of the genre. The informal evaluative document genre is chosen as the test genre, and the online consumer review is used as the test data set. The classification results support our hypothesis that the content type of segment in a document of a particular genre could be predicted from the linguistic features. This approach may be used as a component in faceted search, multi-document summarization and many other information processing applications.

Full Text: