How the data is processed

Latest release
Census methodology
Reference period
2021
Released
28/06/2022
Next release Unknown
First release

Processing overview

The goal for processing the 2021 Census was to ensure the timely release of data while maintaining and improving data quality.

Data processing includes all steps from receipt of Census responses, either online or in paper form, through to the production of a clean Census data file. These steps include:

  1. Data capture
  2. Coding
  3. Frame reconciliation
  4. Derivation
  5. Imputation
  6. Editing
  7. Quality assurance
  8. Introduced random error/perturbation

Data capture

For the 2021 Census, a Data Capture Centre (DCC) was established to register, scan and capture data from the paper and online Census forms.

Upon arrival to the DCC, paper Census forms had their unique form ID electronically captured. These forms were scanned and validated using Intelligent Character Recognition. Similarly, after a respondent pressed 'submit' on the online Census form, the form ID was captured by the DCC. Information about form response status was communicated to field staff to ensure follow up activities for responding dwellings ceased.

A reconciliation process was conducted to ensure that all forms received at the DCC were captured.

Online forms were encrypted and sent securely to ABS processing systems. These forms were then decrypted and loaded into systems alongside paper forms.

Coding

The Census forms collect information from respondents in a number of different ways including written text responses, radio buttons and multi mark check boxes. Sometimes it can be a combination of all three options for a single question, such as the Ancestry question. Just like there are different ways of collecting information from respondents, there are also different ways of coding data to a classification.

While responses collected through a radio button are assigned directly to a classification code during data load processes, text responses are processed through coding systems. This process includes:

  • Automatic (auto-coding) – most text responses have sufficient information for a computer to code the response directly to a classification without clerical involvement.
  • Computer assisted – computer programs are used by coding specialists to assign classification codes to like text responses in bulk. This reduces the amount to records requiring manual coding.
  • Manual – Census coding staff review text responses for each individual to determine the best fit to a classification and assign a code.

Census also undertakes a coding quality assessment process, whereby a sample of manually coded records are re-coded by another member of the coding staff. Where a record is coded differently, it is forwarded to a coding specialist for adjudication on which code is correct. For Census 2021, the sample rate was 10%.

Frame reconciliation

The frame refers to the list of units (e.g. persons, households, businesses) in a survey population. The aim of frame reconciliation for Census is to finalise the list of Australia’s dwellings and assign the correct people to these dwellings for the Census reference night. The ABS does this by reconciling Census form data with units from the Census frame by reviewing their information and addresses. This involves:

  • ensuring Census forms are linked to the correct address
  • removing duplicate forms and people
  • removing invalid people and dwellings
  • verifying the dwelling’s occupancy when no Census form was received
  • moving dwellings and persons to their correct area if they have been placed in the wrong location.

This process ensures the correct person and household information is represented in the correct location in the Census data. This allows the ABS to produce more accurate statistics for both small and large geographical areas.

Derivation

Following on from coding and frame reconciliation, the next step in the Census processing cycle is to apply derivations. This involves assigning values based on responses to other questions where no response has been provided. Census derive these responses based on responses from other family members present in the same dwelling.

Variables that may be derived from responses given by other family members present in the same dwelling are:

  • Country of birth of person (BPLP)
  • Country of birth of father (BPMP)
  • Country of birth of mother (BPFP)
  • Language spoken at home (LANP)

If there is insufficient information provided to derive a response for these items, they are determined to be 'not stated'.

In addition, the derivation process is used to create new variables by combining responses from several questions. Variables which are created this way include:

  • Tenure type (TEND) - derived from responses to the Tenure type question
  • Mortgage repayments (monthly) dollar values (MRED) - derived from Tenure type and Housing costs questions
  • Rent (weekly) dollar values (RNTD) - derived from Tenure type and Housing costs questions
  • Labour force status (LFSP) - derived using responses to questions on full/part-time job, job last week, hours worked, looking for work and availability to start work
  • Core activity need for assistance (ASSNP) - derived using four Census questions related to assistance needed for self-care, movement or communication activities.

Imputation

Imputation is a statistical process for predicting values where no response was provided to a question or where a response could not be derived. The ABS imputes Census data to reduce non-response bias and deliver a robust dataset.

The key demographic variables that require a response and are imputed if no response is given are:

The primary imputation method used for the 2021 Census is known as hotdecking. This involves randomly selecting a donor record and copying the relevant responses to the record requiring imputation. The donor record will have similar characteristics and must also have the required variables stated. In addition, the donor record will be located geographically as close as possible to the location of the record to be imputed. In 2021, administrative data was used in hotdecking imputation to improve the selection of donors.

Imputation occurs in two situations:
  1. Where no Census form was returned – all five key demographic variables are imputed. The remaining variables are coded ‘not stated’.
  2. Where a partially completed Census form was returned – only the key demographic variables that did not have a response are imputed. For example, if a person responded to all key demographic questions except Age, only the Age (AGEP) variable is imputed.

1. No Census form returned

Where a private dwelling was identified as occupied on Census Night but a Census form was not returned, the number of people normally in the dwelling and their key demographic variables are imputed. In these cases, the non-demographic variables are set to 'not stated' or 'not applicable'.

For these private dwellings, the hotdecking imputation process is performed. The non-responding dwellings are matched with donor dwellings and the count of people and their key demographic variables are copied from the donor record to the imputed record. The donor records must meet several conditions:

  • they must be occupied private dwellings where a form was returned, contain a maximum of 6 persons, and all of those persons responded to the key demographic questions
  • they must have a similar Dwelling structure (STRD) and Dwelling location (DLOD) to the record to be imputed
  • they must be located geographically as close as possible to the location of the record to be imputed
  • where available, they have similar sex and age group counts from administrative data.

For 2021 Census data, improvements were made to the imputation of non-responding private dwellings. Administrative data was used to help choose the donor dwellings with similar numbers of people and ages to those that did not respond. For example, administrative data may show that two males aged 30–34 years live in a house that was determined to be occupied of Census Night. We would then choose a donor house from the Census where administrative data also shows that there were two males aged 30–34 years. The key-demographic Census variables from this donor dwelling are copied across for the non-responding dwelling. This method aims to give us counts of people in the right age group rather than just choosing a random household in a similar dwelling type in the same geographical area.

Where a person in a non-private dwelling did not return a form, their demographic characteristics are copied from a randomly selected person in a similar non-private dwelling using Type of non-private dwelling (NPDD).

2. Partially completed form returned

Where a partially completed form was returned, some or all of the demographic characteristics may require imputation. Characteristics are imputed using a combination of hotdecking and probability techniques.

Age (AGEP)

Where date of birth or age details are incomplete or missing, the variable Age (AGEP) is imputed based off distribution patterns found in the responding population. Variables used in the imputation of age include:

  • Sex (SEXP)
  • Relationship in household (RLHP)
  • Registered marital Status (MSTP)
  • Indigenous status (INGP)
  • Type of education institution attending (TYPP)
  • Type of non-private dwelling (NPDD)

Additional variables may also be used where they are shown to correlate with age.

Sex (SEXP)

If there is not enough information on the form to determine the Sex (SEXP) of the person (or it is not appropriate to do so) then each record is randomly allocated a male or female sex.

Registered marital status (MSTP)

Where Registered marital status (MSTP) is missing, this variable is imputed by finding a similar person in a similar responding dwelling based on the variables:

  • Sex (SEXP)
  • Relationship in household (RLHP)
  • Age (AGEP)
  • Dwelling type (DWTD)
  • Type of non-private dwelling (NPDD)

Registered marital status is only imputed for people aged 15 years and over and set to 'not applicable' for people aged under 15 years.

Place of usual residence (PURP)

Where a complete Place of usual residence (PURP) on Census Night is not provided, the information that is provided is used to impute an appropriate area, in this instance a Mesh Block as well as Statistical Area Level 1 and Statistical Area Level 2. A similar person in a similar dwelling is located and missing usual residence fields are copied to the imputed variable. These are based on the variables:

  • Dwelling type (DWTD)
  • Dwelling location (DLOD)
  • Type of non-private dwelling (NPDD)
  • Residential status in a non-private dwelling (RLNP)
Place of work (POWP)

Where a complete Place of work (POWP) is not provided, the information that is provided is used to impute an appropriate Destination Zone (as well as Mesh Block). A similar person is located, and missing Place of work fields are copied to the imputed variable. Depending on the level of imputation required, place of work imputation may use the following variables (where available) in its method:

  • Place of usual residence (PURP)
  • Industry of employment (INDP)
  • Method of travel to work (MTWP)
Records that have required imputation can be identified using the imputation flags:
  • Imputation flag for number of males and females in dwelling (IFNMFD)
  • Imputation flag for age (IFAGEP)
  • Imputation flag for sex (IFSEXP)
  • Imputation flag for registered marital status (IFMSTP)
  • Imputation flag for place of usual residence (IFPURP)
  • Imputation flag for place of work (IFPOWP)

Editing

Editing is a process that looks to correct errors in the data and is undertaken as part of the validation strategy to produce a consistent, valid dataset. The kinds of error which editing procedures can detect are limited to responses and codes which are invalid, or which do not align with Census definitions.

For example, if someone mistakenly states that they are 5 years old and in a registered marriage, their record is flagged for investigation and a resolution is established to ensure it aligns with the definition of 2021 Census variables.

Quality assurance

Quality assurance practices are applied across the various Census systems and processes to monitor and review data quality and ensure the accuracy, consistency and coherence of final Census outputs.

These practices include:

  • comparison of the data with previous censuses
  • comparison of the data with other sources of information including (but not limited to) Survey of Income and Housing, Construction statistics and Estimated Resident Population
  • assessing and validating the data with real world changes (for example, where new suburbs were developed between censuses, or where natural disasters impacted dwellings in specific areas).

For more information about data quality, see Managing Census quality.

Introduced random error / perturbation

Under the Census and Statistics Act 1905 it is an offence to release any information collected under the Act that is likely to enable identification of any individual or organisation. To minimise the risk of identifying individuals in aggregate statistics, a technique has been developed to randomly adjust values.

Random adjustment of the data, known as random error or perturbation, is considered to be the best technique for avoiding the release of identifiable data while maximising the range of information that can be released.

Many classifications used in ABS statistics have an uneven distribution of data throughout their categories. For example, the number of people who are Anglican or born in Italy is quite large (3,101,187 and 174,042 respectively in 2016) while the number of people who are Buddhist or born in Chile (563,675 and 26,082 respectively in 2016) is relatively small. When religion is cross-classified with country of birth, the number in the table cell who are Anglican and who were born in Italy could be small, and the number of Buddhists born in Chile even smaller. These small numbers increase the risk of identifying individuals in the statistics.

Even when variables are more evenly distributed in the classifications, the problem still occurs. The more detailed the classifications, and the more of them that are applied in constructing a table, the greater the incidence of very small cells.

When random error is applied, all values are slightly adjusted, including the totals values in tables, to prevent any identifiable data being exposed. These adjustments result in small introduced random errors where the true value has been either increased or decreased by a small amount. These adjustments have a negligible impact on the underlying pattern of the statistics. The technique allows very large tables, for which there is a strong client demand, to be produced even though they contain numbers of very small cells.

For tabular outputs, these adjustments may cause the sum of rows or columns to differ by small amounts from totals. The counts are adjusted independently in a controlled manner, so the same information is adjusted by the same amount. However, tables at higher geographic levels may not be equal to the sum of the tables for the component geographic units.

It is not possible to determine which individual figures have been affected by random error adjustments, but the small variance which may be associated with derived totals can for the most part, be ignored.

Caution when aggregating finely classified data

No reliance should be placed on small cells as they are impacted by random adjustment, respondent and processing errors.

Many different classifications are used in Census tables and the tables are produced for a variety of geographical areas. The effect of the introduced random error is minimised if the statistic required is found direct from a tabulation rather than from aggregating more finely classified data. Similarly, rather than aggregating data from small areas to obtain statistics about a larger standard geographic area, published data for the larger area should be used wherever possible.

When calculating proportions, percentages or ratios from cross-classified or small area tables, the random error introduced can be ignored except when very small cells are involved, in which case the impact on percentages and ratios can be significant.

See also the Data confidentiality guide.

Back to top of the page