Improving prediction of dwelling occupancy for Census 2021
For some non-responding dwellings in the Census we do not have good information about whether they were occupied on Census night. For these dwellings a prediction about their occupancy status must be made. If they are deemed to be occupied then occupants will be imputed to them. In the 2016 Census these predictions were not based on a model and occupants were erroneously imputed to some non-responding dwellings. The Post Enumeration Survey adjusted for over- and under-count at broad levels. However, particularly in areas with large numbers of secure apartment buildings, these imputations contributed to inflated small area population counts.
For Census 2021 a prediction model has been developed to better classify non-responding dwellings to an occupancy status. Occupancy labels from the 2016 Census were used to train a model incorporating predictors from administrative data sources linked at the dwelling level. Key predictors were indicators of being on the Personal Income Tax, Medicare or Social Services datasets. Other useful predictors included area-level occupancy counts from the previous Census, dwelling type and geography. An extreme gradient boosting algorithm, implemented in R, was used and the models were trained, tuned and evaluated separately for each state.
We found these models were good for predicting long term unoccupied dwellings - holiday or rental homes being key examples. Prediction for dwellings that were short term unoccupied - i.e. where occupants were away just around the date of the Census - was less successful.
For an increasing number of regions Smartmeter electricity data is available and is a strong indicator of both long- and short-term dwelling occupancy. Due to privacy concerns this data was not used at the dwelling level. Instead, the Smartmeter data was used to derive a predicted count of occupied dwellings at an area level and these counts were used as an input to the models.
The predictions from the models take the form of probabilities of individual dwellings being occupied. In determining how best to use these probabilities two goals were considered: (i) classifying occupancy status for individual dwellings and (ii) obtaining accurate occupancy counts for small areas. The optimal approaches for these two goals were at odds. For (i) a probability cut-off value applied to each dwelling is optimal, but for (ii) the optimal approach would be to sum the predicted dwelling probabilities for the whole small area. A compromise using a stochastic process to generate an occupancy classification for each dwelling proportional to its predicted occupancy probability was chosen. This approach yielded expected values equivalent to the optimal approach for (ii) while being simple to implement within Census processing systems.
The prediction model and use of aggregate (area level) electricity data for determining dwelling occupancy status in the 2021 Census are still under investigation. However, the indications are that these models will provide a significant improvement to area level dwelling occupancy counts particularly for those areas with high numbers of secure apartment buildings. When combined with an improved method for imputation of occupants to non-responding dwellings, it is expected that these models will improve the quality of Census 2021 person counts.
For more information, please contact Sean Buttsworth at methodology@abs.gov.au.
The ABS Privacy Policy outlines how the ABS will handle any personal information that you provide to us.