2062.0 - Census Data Enhancement Project: An Update, Oct 2010

ARCHIVED ISSUE Released at 11:30 AM (CANBERRA TIME) 15/10/2010

Page tools: Print

Print Page Print all pages in this product

Contents >> Census Data Enhancement Project - 2011 >> 3 Wave 2 of a 5% Statistical Longitudinal Census Dataset

3 WAVE 2 OF A 5% STATISTICAL LONGITUDINAL CENSUS DATASET

An important feature of the Census Data Enhancement (CDE) project is the formation of a Statistical Longitudinal Census Dataset (SLCD) by bringing together data from the 2006 Census with data from the 2011 Census and future Censuses to build a picture of how society moves through various changes: which groups are affected by different types of change and in what way.

Wave 1 of the SLCD was created from the 2006 Census dataset by selecting a random sample of 5% of persons in the 2006 Census of Population and Housing. Wave 2 of the SLCD will endeavour to bring together the wave 1 records with their corresponding records in the 2011 Census.

Subsequent waves will be created with each new Census, providing a longitudinal dataset of information about 5% of the Australian population.

At each Census, the 5% SLCD will be augmented with a 5% sample of children who have been born and immigrants who have arrived since the previous Census. There will also be some provision for topping up the sample to maintain a dataset that is consistently 5% of the Census population at any point in time.

The third wave of the 5% SLCD will be created in 2016. For this wave, the ABS will make use of a non-identifying grouped numeric code based on name to improve the accuracy of the linked dataset as well as improve the efficiency of the linking process. The decision to use a non-identifying grouped numeric code is based on the outcomes of a 2006 CDE quality study which investigated the statistical techniques used to undertake data linkage and evaluated the feasibility of creating the 5% SLCD without using name and address. The study demonstrated that, in the absence of name and address, inclusion of a non-identifying grouped numeric code when linking records can improve accuracy and efficiency. For further information, see Assessing the Likely Quality of the Statistical Longitudinal Census Dataset (ABS cat. no. 1351.0.55.026).

The non-identifying grouped numeric code will be assigned to all records in the 5% SLCD dataset from 2011. It will be created from a combination of letters from first and last names using a secure one-way process, meaning that it cannot be reversed to identify individuals. Each code will represent approximately 2000 people and therefore will not be unique to an individual. The code will only be accessible to those ABS staff creating the linked dataset, and will not be released outside the ABS.

The non-identifying grouped numeric code will be used in conjunction with characteristics such as age, sex, geographic region and country of birth to link records from the 5% SLCD to the 2016 Census and future Censuses using probabilistic record linkage techniques. Name and address information will not be used in the linkage process and will not be available for the 5% SLCD dataset as they are deleted at the end of Census processing.

What has changed since the 2006 Census

The formation of an SLCD was foreshadowed in 2005. The addition of a second wave of Census data to the 5% SLCD from 2006 will provide the first longitudinal view of the Census, for statistical and research purposes. The retention of a non-identifying grouped numeric code on the 5% SLCD to assist the data linking process for the future is a change being made to improve accuracy and efficiency. The code is not an identifier and does not add privacy risk.

Benefits of the Statistical Longitudinal Census Dataset

Each five-yearly Census provides a rich set of information about Australian people and households at a point in time. It provides information on topics such as family structure; education and qualifications; presence of a severe or profound disability; work, including hours worked, occupation and industry; income and housing; country of birth; year of arrival and indigenous status. It is able to provide a rich picture of social and economic conditions at a particular point in time, and how these conditions are changing over time and across population groups.

What the Statistical Longitudinal Census Dataset (SLCD) adds to this, is the ability to study patterns in how social and economic conditions change over time at the individual level, and provide insight into the pathways that tend to lead to particular outcomes, and how these pathways vary for different population groups. It also enables the study of likely consequences of certain socio-economic circumstances for different population groups, in terms of the likely outcomes as evidenced by the patterns in the longitudinal data. It can help develop strategies to achieve positive pathways, and avoid negative ones, and can help policy makers in assessing both the social and financial benefits of related intervention strategies.

As well as using the longitudinal Census data in its own right, the very large Census sample can be used to help inform on the quality of transition probabilities measured in more frequent smaller longitudinal studies, and, particularly for sub population groups may allow adjustment mechanisms to improve the socio-economic modelling that frequently underpins government policy making and research.

The 5% SLCD containing 2006 and 2011 Census data will be available for statistical analysis and research purposes from 2013. Standard ABS confidentiality methods will be applied and the data will be accessible through standard ABS secure data access arrangements. No information that is likely to enable identification of an individual or household will be released (See 'Confidentiality and Privacy' ).

Data involved in the Statistical Longitudinal Census Dataset

The creation of the 5% SLCD itself only involves the use of data from the Census of Population and Housing.

The 2006 SLCD dataset and the 2011 Census dataset will be brought together using a statistical method referred to as 'probabilistic record linkage'. This involves bringing together data from the two datasets without using names and addresses but by using a number of characteristics common to both datasets such as age, sex, geographic region and country of birth. All possible linkages based on these data items are evaluated and the records for which the linkage is most likely to be correct are brought together. For many individuals this linkage would be correct while for some others it will not. Some inaccuracy in the linkage will not generally affect statistical conclusions drawn from the linked data, although care does need to be taken in the interpretation of results.