How administrative data improved the quality of the 2021 Census

How we used administrative data to improve the quality of the information we collected

Released
20/01/2023

Background

Prior to the 2021 Census, we shared our plans to use administrative data to support the Census. Following the Census, the Statistical Independent Assurance Panel, in its report on the quality of 2021 Census data, was “pleased to note data quality improvements resulting from enhancements to occupancy determination and imputation for non-responding households, most notably as a result of the use of administrative data.”

This article explains what we did and shows how administrative data improved the quality of the 2021 Census.

What is administrative data?

Administrative data is information that government departments, businesses and other organisations collect. They collect information for a range of reasons such as:

  • registrations
  • sales
  • record keeping.

Some examples of administrative data:

  • personal income tax information from the Australian Taxation Office
  • information about the number of people who use Medicare from the Department of Health.

We only collect and use administrative data for statistics and research. We don't share or release this information in a way that could identify anyone.

Methods to improve the quality of Census counts

To make sure that we count all Australians in the Census, we need to:

  • decide whether a house was occupied on Census night, when we didn’t receive a Census form
  • adjust the count for people who were missed.

Results from the 2016 Post Enumeration Survey showed us that we don’t always get these decisions right, and we might think a house was occupied on Census night when it was actually empty, or we might adjust the count by adding people of the wrong age. For 2021 Census, we used information from administrative data to improve our methods.

Assessing which houses were empty (occupancy determination)

For houses where we didn’t receive a Census form and our field staff couldn’t work out if the house was occupied, we used administrative data to help us decide whether the house was empty or occupied on Census night. We explained the way we did this in a previous article, Using administrative data to improve the Census count.

We made one improvement to the approach described above, by adding publicly available information about rental vacancies. If a house was listed for rent around Census time, then we were more likely to decide it was empty.

For most houses, we had good information about whether they were occupied or vacant, either from a Census form or from our field staff. We only used our statistical model to help decide occupancy for 2% (about 218,000) of all houses. The model predicted that almost three-quarters of these (72% or about 156,000) were empty.

In the states that were impacted by COVID-19 lockdowns around Census time, we used our model to set occupancy for more houses (2.5% for New South Wales and Victoria, and 2.2% for ACT) than in states that were not in lockdown (from 0.9% in Tasmania to 1.9% in Northern Territory). This is likely due to field staff finding it more difficult to work out whether a house was occupied when they couldn’t knock on any doors (Census field work was contactless when COVID lockdowns were in place).

Adjusting the Census count (imputation)

For houses where we didn’t receive a Census form and the house was occupied (based on field information or our model), we needed to adjust our count for the people we missed. Like previous Censuses, we did this using a process called imputation. Imputation is where we copy basic Census information (number of people with their age and sex) from another similar household to represent the missed people.

For the 2021 Census, we used administrative data to help us choose a representative household (known as a ‘donor’) where the people are more similar in age to those who were missed. We described our method in this article Using administrative data to improve the Census count.

For the 2021 Census, there were about 379,000 (3.5%) dwellings where we needed to make up for missed people. This was down from around 430,000 (4.4%) dwellings in the 2016 Census.

Measuring improvements in the quality of Census counts

Improvements in how we adjusted for people we missed (occupancy determination and imputation)

Results from the 2021 Post Enumeration Survey showed that, by using administrative data, we did a better job of deciding whether non-responding houses were occupied and adjusting the count for the people we missed. Together, these improvements delivered Census counts that were more accurate. This was especially true for counts in inner city areas and for older Australians.

First, the 2016 Post Enumeration Survey showed that we set too many non-responding houses to occupied and, as a result, we added about 320,000 too many people when we adjusted the count (about 1.4% of the population). In the 2021 Post Enumeration Survey this reduced to about 230,000 too many people (about 0.9% of the population).

While this is only a small change at a national level, improvements were much more significant for inner city high-rise areas where it’s hard to tell if people are home. For example, Figure 1 below shows the proportion of non-responding houses that were set to occupied for the whole of Sydney, the Inner-Sydney SA4 and the Potts Point SA2. The greatest difference was for the Potts Point SA2 in central Sydney where high-rise apartments are very dense and where we probably set too many non-responding houses to occupied in 2016.

A second major improvement to the Census counts was in the age profile, where administrative data helped us choose donor houses with people who were more similar in age to the people who were missed. You can see in Figure 2 that the 2016 Post Enumeration Survey shows we added too many older people in our adjustment for the 2016 Census. In contrast, the 2021 Post Enumeration Survey shows we reduced the net overcount in these older age groups, making it more similar across all adult age groups.

Another way we can see the improvements in 2021 Census data is when we compare the Census population with the Census night population estimate from the Post Enumeration Survey. In 2016, the count of Census people under 35 was noticeably lower than the estimate from the Post Enumeration Survey (gap between shaded region and line in Figure 3a) and the count of Census people aged 55-74 was higher than the estimate from the Post Enumeration Survey (shaded region above the line in Figure 3a).

Note: Excludes Other Territories and overseas visitors.

In 2021, the count of Census people under 35 was closer to the estimate from the Post Enumeration Survey (smaller gap between the shaded region and line in Figure 3b) and the count of Census people aged 55-74 was closer to the estimate from the Post Enumeration Survey (very little shading above the line in Figure 3b).

Note: Excludes Other Territories and overseas visitors.

In summary, using administrative data helped to improve the 2021 Census count by:

  • reducing the overadjustment for missing persons, particularly for inner city high-rise areas
  • giving a more balanced adjustment for missing people across age groups, reducing the overadjustment for older people.

Checking Census data quality (quality assurance)

We used administrative data as an independent check for Census counts and occupancy rates (the proportion of houses that are occupied). This helped give us confidence that the Census data was accurate, particularly when it wasn’t what we expected.

In some areas, the occupancy rate for the 2021 Census was quite different to the occupancy rate in the 2016 Census. We predicted occupancy rates for these areas using administrative data and these rates generally matched the Census data, giving us confidence that the data was accurate.

We also used administrative data to check Census counts when they didn’t match what we expected from our official measure of Australia’s population at that time (the Estimated Resident Population, or ERP). The Census counts and ERP matched well for all of Australia, but we saw some differences in specific areas. For example, the Census count for people aged 25-39 in Tasmania was higher than we expected from ERP (difference between the dark blue and orange lines in Figure 4). When we looked at the administrative data (light blue line), the count of people aged 25-39 in Tasmania was higher than ERP as well. This gave us confidence that the Census data was accurate.

  1. Estimated Resident Population (ERP) is unrebased

Enhancing Census with administrative data

Preparing for unexpected events

When we shared our plans to use administrative data to support the Census, we included information on preparing for unexpected events. We developed and tested a method for using administrative data to fill gaps in Census data. Even though the COVID-19 pandemic was an unexpected event, it didn’t cause any significant gaps in Census data, and we didn’t need to use this method. We are well prepared if we need this in the future.

Enhancing Census income data

We have added to the income data available in the Census using linked administrative data. We used income data from the Australian Taxation Office and the Department of Social Services to provide extra information. This includes:

  • Weekly income earned in the 2020/2021 financial year (in $10 per week categories)
  • Main source of income
  • Main type of government benefit payment.

Like other data we collect, administrative data is collected under Census and statistics laws (the Census and Statistics Act 1905). This means any comparison between income reported in Census and administrative data can only be for statistical reasons, not for compliance.

When it's available, we will add a link to the analysis here.

Future plans

Given the improvements that integrated administrative data made to the 2021 Census, we will be exploring new ways to use it to support 2026 Census. This includes:

  • reducing visits to houses that we predict to be vacant around Census time
  • filling more gaps in Census data, particularly for areas where we know it is harder to reach everyone
  • adding to the information on the Census, for example, adding information about people's main source of income or more information on where people lived since the last Census
  • potentially replacing some information currently collected on the Census, for example, removing the income range tick box question and replacing it with more detailed income data sourced from administrative records.

As part of our work to support the 2021 Census, we created two administrative datasets: one about people and one about houses. Like the Census, these are a snapshot of Australian people and houses as of August 2021. We think these datasets could be a useful addition alongside the Census snapshot in helping to understand and improve the lives of Australians. We plan to make these administrative datasets available to researchers in 2023, with release of data cubes in June and possible further releases in the second half of the year.

Appendix - Measuring the quality of our administrative data

The administrative data we used to support 2021 Census comprised mostly government data, along with rental vacancies data and electricity use information. The government data we used was from the Multi-Agency Data Integration Project (MADIP), which combines data on healthcare, government payments and personal income tax with population demographics.

In the lead up to the 2021 Census, we released an article to demonstrate the quality of this data by showing how well it would have captured the 2016 population, Assessing administrative data quality to enhance the 2021 Census. We did this by comparing the data to 2016 population estimates.

In this section we indicate the quality of the data we used to support the 2021 Census by doing a similar comparison with rebased 2021 population estimates.  Note this data has had many small improvements throughout the Census dissemination period; here we show a more recent version from late 2022. 

Working out which people to include

Administrative records in MADIP cover people who were resident in Australia from January 2006 to June 2021. For this project, we were interested in the population around Census time (we used 30 June 2021, just six weeks prior to Census night).

To identify the people who were alive and living in Australia at Census time, we applied rules (see Table 1) to remove people who had either left the country or had died prior to 30 June 2021.

Table 1: Rules used to scope the administrative data to persons living in Australia at 30 June 2021
Scoping ruleMeasured byDatasets used
Remove people recorded as deceased prior to 30 June 2021Date of death

Medicare

Social Security

Death Registrations

Remove people recorded as having left the country prior to 30 June 2021

Date of departure

Date of arrival

Travellers
Remove people who have not recently used a government serviceUse of a government service in the last 1-5 years(a)

Medicare and Pharmaceutical Benefits

Social Security

Immunisation Register

Single Touch Payroll

  1. Exact duration depends on a person's age

Working out where to place people

As well as working out which people to include, we applied rules to put people in the most appropriate geographic location. We used location information from a range of administrative data sources including Medicare, Social Security, Tax and Immunisation Register location information. To pick the best location for our time point of interest (30 June 2021), we used information on how recently it was updated, its precision and whether it was likely to be a residential location.

How well does administrative data match the population?

To understand how well our administrative data represents the population, we compare it to the official ABS population count, the Estimated Resident Population (ERP).

National results

The national count of people in our administrative data (25,625,000) closely matches the ERP count (25,688,000), with a difference of around 63,000 people or 0.2% of the population. When we compare the age profiles, administrative data is very close to ERP (Figure 5).

  1. Estimate Resident Population (ERP) is rebased

If we look at the difference between administrative data and ERP more closely, we see that, for most ages, the difference (shown by the light blue line in Figure 6) falls within the range of possible error that is present in ERP (shown by the shaded band). This uncertainty in ERP is introduced when we apply the sample-based adjustment factor for undercounting in the Census, as determined by the Post Enumeration Survey.

Where the percentage difference stays within the band, it is not clear whether the administrative count or ERP is closer to the true population. One clear difference is that there are fewer babies in the administrative data than in ERP. This is due to a delay in babies being registered in the administrative data.

  1. Estimated Resident Population (ERP) is rebased

State and territory results

When we compare the administrative data with ERP at the state and territory level, we see that the count of people in administrative data is close to ERP, being within the margin of error for most states (NSW, Vic, Qld, SA and Tas). The count of people in administrative data for WA, NT and ACT is less than ERP (Figure 7).

  1. Estimated Resident Population (ERP) is rebased
  2. Error bars represent the uncertainty in ERP introduced by sampling error in the Post Enumeration Survey

Comparing in more detail, we see that the administrative data count is close to ERP in all capital cities except Melbourne, Perth and Canberra where it is lower to a significant degree (outside the margin of error). Outside of the capital cities the administrative data count is higher than ERP to a significant degree in regional NSW and Vic and is substantially lower than ERP in regional NT (about 17% lower). The large difference in regional NT is mostly due to an absence of reliable location information for people in the remote Northern Territory in administrative data (Figure 8).

  1. Estimated Resident Population (ERP) is rebased
  2. Error bars represent the uncertainty in ERP introduced by sampling error in the Post Enumeration Survey

Note: Y axis is truncated at -10

Back to top of the page