UNSUPERVISED MACHINE LEARNING FOR DATA EDITING AND CORRECTION
Many administrative datasets and survey datasets have missing data and errors. There may be little subject matter knowledge available to indicate what checks need to be done to identify errors, and specifying a set of such edit rules is laborious. It is also not always clear how to impute for missing item values in a way that is consistent with reported items.
Supervised learning is already being explored in the ABS for situations where training data are available from historical editing by humans. In this situation the goal is to establish a model using the pre-edited data to predict an outcome such as "reported value is incorrect". Predictions from such a model can be used to guide future editing, but supervised machine learning can replicate biases that were inherent in the manual editing processes.
For this study, the direction proposed is to use machine learning or automated modelling techniques to explore features of a dataset, for example by fitting a model for the joint density of the reported items on a dataset. Such a model would give low probability to units with item relationships that appear infrequently in the dataset. Methods would be preferred in which the unusual relationships can be reported to a human expert who can then determine whether the situation signifies an error. This can then be incorporated in edit rules, in a revised model that can highlight items for correction and in imputing a new item value.
For more information, contact Philip Bell Methodology@abs.gov.au