1504.0 - Methodological News, Dec 2010  
Previous ISSUE Released at 11:30 AM (CANBERRA TIME) 09/12/2010   
   Page tools: Print Print Page Print all pages in this productPrint All RSS Feed RSS Bookmark and Share Search this Product

Methods for Imputing Age in the Census

Imputing age for partial non-respondents in the Census is a difficult problem. Age is related to almost every item in the Census, including education, employment, disability and income. It is important to impute an age that is consistent with other responses the respondent may have given. It is also important that the age of a respondent be consistent with their position in their household and with the ages of other members of the household (so that parents have a realistic age with respect to their children and so on).

The current method used for Census imputation has been used for many Censuses and has some unfortunate properties. It depends on age distributions from the previous Census which are five years out of date, and it has a tendency to produce too many imputed values around 'threshold' ages associated with particular characteristics (for instance at age five when most children start school). This method was scheduled for review and replacement in the 2011 Census, but budget cuts delayed this review. The Methodology Development Unit (MDU) is now looking at possible methods for replacing it in the 2016 Census.

Donor based methods (hotdecking) are in use for imputing other items in the Census. These work well for imputing age in non-contact dwellings, as ages drawn from a responding Census dwelling will automatically be consistent within the dwelling. Hotdecking is not suitable for the case of partial non-response because it is very difficult to find donors that are consistent with all the additional information available for respondents.

Because of the complexities associated with age imputation we are looking instead to emerging techniques from the field of data mining or predictive analytics. These methods allow the construction of complex models with minimal input from the analyst. Already we have conducted some experiments with a method called Bayesian Additive Regression Trees (BART). This is a recent addition to the regression tree family of models in which many small trees are generated and then added together to give an overall model. The use of many small trees allows for the approximation of complex relationships including interactions and additive effects. There is a package that implements BART in R, which is very convenient. Unfortunately R does not handle large volumes of data well, so the testing that could be done on Census data was very limited. Also unfortunately the performance of BART in creating suitable age imputes was less than satisfactory. In particular it imputed a large number of ages that were inconsistent with the reported Year of Arrival.

While there are options for tailoring BART to give more suitable imputes the nature of the model underpinning BART means it is likely we will never be able to completely avoid these sorts of edit failures. We are currently looking into other methods.

Further information can be obtained from Claire Clarke on (08) 8237 7468 or claire.clarke@abs.gov.au