|Page tools: Print Page Print All|
THE INTELLIGENT CODER: DEVELOPING A MACHINE-LEARNING CLASSIFICATION SYSTEM
The ABS has developed the Intelligent Coder, a text classification application suited to the needs of a National Statistical Office (NSO) in classifying short free text responses to large classification hierarchies, such as ANZSCO or ANZSIC, the Australian and New Zealand Standard Classification of Occupation and Industry Classification respectively.
A large amount of effort is expended by NSOs in developing complete hierarchical descriptions of statistical classifications of interest, like industry, occupation, education, commodity, language or country of original. However it is unreasonable to expect survey respondents to be able to volunteer their relevant code. It falls to the NSO itself to receive respondents’ descriptions of their relevant characteristics and map these descriptions to the standard classification code.
The original and most widely accepted way of mapping descriptions to classifications is using clerical coding: where officers are trained to have a full understanding of a set of classification hierarchies. These officers then manually assign classification codes to responses. Clerical coding is expensive and time-consuming, so automated solutions must be pursued.
The ABS has long used an index-based coder for automated classification. This involves the creation of an index file: a set of rules that map the presence of particular words and phrases to the code that should be assigned. This is an attempt to mechanise the heuristics that a clerical coder might use to assign codes, and it succeeds in speeding up the classification process, as a large numbers of records are able to be classified very quickly.
The identification of patterns in responses and codifying these patterns is a procedure for which there are automated options; instead of manually creating an index file an automated procedure could be used which, given a set of examples, determines the optimal rules for classification. This allows the creation and refinement of index files to proceed much more quickly, as long as coded example records exist. The Intelligent Coder is this solution.
The Intelligent Coder represents text as points in vocabulary space, implements a hierarchical multi-class classification algorithm, and replicates the classification algorithm to ensure that generalisation to unseen data can be judged.
Text responses are processed to a numeric vector by the bag-of-words approach, where a vocabulary of all unique words is listed from the text data available. Then an individual text record is represented as a binary vector of the same length as the vocabulary list, with 1 for each vocabulary word that is contained in the record, and 0 for words that are not. This can be thought of as representing a record as a point in vocabulary space. The bag-of-words approach does not respect the order, the importance, or the context of words in a record, but the presence or absence of words captured by the bag-of-words approach probably captures most of the distinguishing information in records – descriptions provided by respondents tend to be semantically simple and terse.
In lieu of implementing a natural multiclass classification algorithm, the Intelligent Coder classifies records by combining a set of binary support vector machine classifiers. Specifically, a record begins at the root of the classification tree and is recursively classified to the most likely child node, where “most likely” is judged by two factors: the set of binary classifiers that combine to classify to the set of child nodes, the confidence that the coder has for that child node. The binary classifiers can be combined by creating a binary classifier for all pairs of child nodes, and a record is assigned to the child node that is assigned by the most classifiers.
The confidence that the coder has is created by bagging: resampling from the training data with replacement and creating an independent coder for each resample. These coders then vote on each record – this vote is used to evaluate the confidence of the classification.
The Intelligent Coder was trained and tested on a set of text responses to questions about occupation and industry collected between 2013 and 2015, which had been classified using an index-based coder with clerical coding for remaining records. Initial results showed that with little effort the Intelligent Coder could increase the rate of automated coding by 20% without a degradation in accuracy.
For more information, please contact Rory Tarnow-Mordi Rory.Tarnow-Mordi@abs.gov.au.
These documents will be presented in a new window.