Automatic Categorization of Statute Documents

Tom Curran, Paul Thompson


Automatic classification offers publishers of large document collections the possibility of improved production efficiencies in print and online environments. In this paper we explore the possibility of automating the classification of statutory legal materials through the application of machine learning software designed to generate automatic text categorization. Our investigations focus on a specific methodology. Our plan aimed to train classifications from a pre-classified dataset of statute documents and associated index references. Accordingly, we observed that each index feature I like 'insurance', or 'corporations' appended a set of document locators. These locators make up the local collection for that index feature. The total of all documents in the dataset, whether assigned an index feature or not, makes up the global collection. The fundamental idea was to develop an algorithm based on text features whose frequency in the local collection was high but whose frequency in the global collection was moderate to low. The system would be provided with a set of descriptors taken from the text of statute documents from which it generates, by algorithm, a lexicon. The lexicon is evaluated by domain experts who assess its relationship to the semantic content of the index feature sought to be modeled. Once a satisfying lexicon has been created, machine learning software is used to generate classification rules from the lexicon. The rules in turn .generate classifications for documents in a test collection.

Full Text: