This issue contains one article:
- Methods for the Repair of Administrative Data
Features important work and developments in ABS methodologies
This issue contains one article:
Administrative datasets have the potential to provide valuable statistics at fine levels of disaggregation. However, the statistics based on these secondary data sources will typically contain biases due to representation and measurement errors. In the September 2021 edition of Methodological News we described early research into the potential of masked deep neural networks (DNNs) for repairing biased administrative data where the biases are of a missing not at random nature. Such repair is dependent on the availability of a sample survey that collects corresponding data items but does not assume the survey and administrative data can be linked. This article provides an update on that work.
Two alternative methods were investigated. Both are based on using masked DNNs to estimate the approximate joint distribution ('structure summary') of the administrative dataset. The 'Transfer Learning' approach fine tunes the administrative data structure summary using the sample survey data and simulates administrative data values using the fine-tuned model. The 'Distributional' approach estimates the structure summary of the survey data and uses both estimated structure summaries to minimally adjust the observed administrative data values towards the sample data model. The distributional approach was applied to simulated categorical data and the transfer learning approach to real-world continuous economic data in a collaboration with the Centre for Data Science at QUT.
The simulations indicated that the distributional approach can effectively mitigate bias and yield low variance disaggregated statistics so long as the true underlying structure summaries are known and well-estimated. However, this is unrealistic in practice and, in particular, the uncertainty in estimating the sample survey structure summary is problematic. The key issue with the transfer learning approach is how to determine the extent to which the administrative data model should be altered by the representative but smaller sample survey data. In both approaches, reflecting the sampling error in the adjusted estimates is a difficulty and we do not at present have a computationally feasible solution.
The challenges posed by the use of masked DNNs for repair of administrative data have also led to examination of alternative approaches. These include the method described by Kim and Tam 2020. This method is powerful, although it depends on knowing whether each unit in the representative sample also belongs to the administrative dataset (e.g. as determined by unit record linkage or semi-supervised classification).
Another alternative being investigated is a calibration approach to the repair of biased administrative data. Like the masked DNN approach, it is assumed that a representative sample is available, but record linkage or classification is not required. This method uses the standard calibration weighting machinery typically used in producing estimates for survey statistics. A preliminary step is to weight the administrative data to adjust for under-coverage and, if present, any time lag between the sample and administrative data collections - this is referred to as the 'weight-adjusted' estimator. Calibrated weight and value adjustments are then determined so that the weighted estimates of means or percentiles for target variable/s on the administrative dataset match their corresponding weighted sample estimates- producing a 'value-adjusted' estimator. A composite weighted estimator may then be constructed by combining the high bias, low variance 'weight-adjusted' estimator with the low bias, high variance 'value-adjusted' estimator. Simulations suggest that the disaggregated estimates using this method have significantly smaller MSEs than those based on the sample data alone. Further work is being conducted to look at ways of creating a single set of calibrated weight and value adjustments given multiple target variables.
Further ABS research into the use of masked DNNs for administrative data repair will be discontinued for now. However, the potential of masked DNNs for modelling and imputation continues to be investigated. Meanwhile, with administratively-sourced data continuing to grow in importance for the production of official statistics, the ABS will continue to actively explore other methods for administrative data repair.
For more information, please contact Sean Buttsworth.
Please email firstname.lastname@example.org to:
Alternatively, you can post to:
Methodological News Editor
Australian Bureau of Statistics
Locked Bag No. 10
Belconnen ACT 2617
Releases from June 2021 onwards can be accessed under research.
Releases up to March 2021 can be accessed under past releases.