Australian Bureau of Statistics
1504.0 - Methodological News, Dec 2013
Previous ISSUE Released at 11:30 AM (CANBERRA TIME) 11/12/2013
|Page tools: Print Page Print All RSS Search this Product|
Data Mining and Editing
Various data mining algorithms are being explored with a view to improving our data editing processes, especially for large administrative datasets.
To develop an editing strategy for a new dataset, we need to understand the relationships between the variables, learn how to detect anomalous observations, and develop criteria for detecting and treating errors. This is often an exploratory process, using a variety of ad hoc methods.
As the ABS handles more data every year, there is an increasing emphasis on automating as much of this process as possible. One special challenge is to find appropriate edit rules: that is, to describe a set of logical constraints that error-free data should satisfy. Traditionally this has required a great deal of input from subject matter specialists and has been a very labour-intensive process.
Data mining offers a possible solution to this challenge, using machine learning algorithms to identify outliers and to find and describe patterns in the data. So far, we have obtained promising results from three different methods.
Cluster analysis is a collection of methods for sorting records into a number of related groups. Hierarchical agglomerative clustering is an iterative method of doing this. Start by finding the two most similar records: this forms the first group. Then look for the next closest pair, and decide whether they should join the first group or form a new group. Repeat the process until all records are joined into a small number of groups. This process is easy to automate, and we can identify which records are quick to join a nearby group, and which ones remain isolated until near the end of the process. The latter records are the ones most likely to be anomalous.
Random forests fall into the class of ensemble methods. These use repetition and randomness to improve the decision making process. A large number of decision trees are generated, each one representing a simplistic model of the data, and each one incorporating some randomness. The average of all these models gives more precise results than any of the models singly. Furthermore, each branch of a decision tree represents an edit rule.
Association rule mining was originally applied to supermarket transactions: what items are typically purchased together? There are several algorithms for discovering such associations, and we are investigating ways to express associations as edit rules. The same concepts can be applied to survey data and administrative data: which characteristics or responses are typically found together in the same record?
These three methods, and other machine learning algorithms, are potentially valuable tools for understanding new sources of data and for streamlining our existing processes. There is also possible application of these methods to other areas, for example modelling for small area estimation, and output validation for population surveys.
These documents will be presented in a new window.
This page last updated 25 March 2014