|Page tools: Print Page Print All|
WEIGHTING AND IMPUTATION FOR MISSING DATA
In readiness for the current transformation program happening at the ABS, research was undertaken into the set of weighting and imputation methods available for dealing with missing data.
A major focus of the research was looking at two-phase calibration estimators for treating non-response. A wide variety of such estimators were tested on simulated household and business surveys. The project recommended that the first choice weighting method for the treatment of missing data is the response propensity calibration estimator, as it was the simplest estimator among those with the best performance. In this estimator a model is formed to estimate the response probability of each responding unit then the initial weight of each responding unit is multiplied by the inverse of its response probability.
An issue with this estimator is that if the estimated response propensity is very small, the resulting weight can be very large which can result in some responses having very large influence on the final estimates. The research presented a modified boxplot method that appears to provide a suitable treatment for trimming extreme estimated response probability weight adjustments.
A review of the current imputation methods used in ABS business surveys recommended that some rationalisation of the large number of imputation methods available to ABS business surveys should be undertaken. Looking at the deterministic imputation methods available in ABS business surveys, the current 39 imputation methods available for ABS business surveys could be replaced by the more general Deterministic Regression Imputation Method, the Deterministic Nearest Neighbour Donor Imputation Method and a Zero Imputation Method (i.e. method to set missing values to 0).
Some surveys don't ask all questions of all units, e.g. smaller businesses may not be asked some questions due to it being unlikely they have the required information readily available. This is referred to as item missingness which is missing by design. Imputation for these missing values adds a non-negligible amount to totals, so as the imputation methods contain variability, this variability needs to be estimated to give a good final estimate of variance. It was found that the imputation should therefore be done using multiple stochastic imputation methods, although there may be some situations where the missing data could be treated using weight adjustment (i.e. the units with missing data be dropped entirely and the other units be weighted to compensate). Weight adjustment significantly reduces the number of replicate weights that need to be computed and stored, however it does not produce good estimates for domains that are not benchmarked to (e.g. if benchmarks are at Australia by industry level, the state estimates will be poor) or for variables that were reported on for the dropped units (i.e. where useful responses were deleted when the units were dropped).
A research and evaluation study was undertaken into appropriate imputation methods for imputation of categorical data items in ABS household surveys. It found that CHAID (Chi-squared Automatic Interaction Detection) could be used to quickly and automatically identify useful groups for hot-deck imputation. In the simulation study the CHAID selected groups identified using the particular study variable always produced the highest percentage of correct imputations, and often produced the smallest relative root mean square errors. In almost of the all situations, using CHAID selected groups performed better than using the manual selected groups.
These documents will be presented in a new window.