MACHINE LEARNING FOR EDITING AND IMPUTATION OF BIG DATA
Within statistical agencies, a large amount of human effort and time is put into checking out and correcting data before publication. As we get access to new "big data" sources the amount of data available may increase dramatically, making it infeasible to apply consistent human scrutiny of the data. This could lead to missed opportunities to use existing datasets to inform public decision making.
A new project is looking at how humans and computers can work together to treat potential errors in big data. The computer will identify the types of variables and their relationships in a dataset, using approaches that are robust to common types of errors. It will then suggest rules that identify apparently anomalous values. Human input will be in identifying the right rules to use, rather than in fixing individual cases. The computer will apply these decisions across the dataset and report on the impact of changes.
The project will start by encoding the decisions that humans make easily as we look over the dataset. Unlike a human, the computer can apply these decisions consistently across the whole dataset. The work will identify a set of data features that are useful in understanding items and their relationships, and show how these can be used as a basis for decisions. This basic scrutiny will, at a later stage of the project, be improved and supplemented by machine learning of the important relationships in the data. This allows for imputation of values that were identified as anomalous or missing.
The United Nations Economic Commission for Europe (UNECE) High Level Group for the Modernisation of Official Statistics (HLG-MOS) has recently instituted a Machine Learning project, and the ABS has proposed this project as a pilot study. For this purpose it would be useful to identify a publicly-available dataset to be used in the project, ideally one that is big (with many records and many variables) and that may contain incorrect data. Suggestions are welcome; such a dataset could become part of the ABS pilot study.
For more information, please contact Phil Bell Methodology@abs.gov.au