|Page tools: Print Page Print All|
8 Statistical data integration involves combining information from different data sources such as administrative, survey and/or Census to provide new datasets for statistical and research purposes.
9 Data linking is a key part of statistical data integration and involves combining records from different source datasets using variables that are shared between the sources. Data linkage is performed on unit records that represent individual persons.
Linkage between the Temporary Visa Holder data and the 2016 Census
10 The 2016 temporary entrant records were linked to the 2016 Census of Population and Housing data using a combination of deterministic and probabilistic linkage methodologies. The linkage method used in this project is considered a silver standard linkage because encoded name and address information was used. Further information about name and address encoding can be found in Information paper: Name encoding method for Census 2016.
11 Deterministic data linkage, also known as rule-based linkage, involves assigning record pairs across two datasets that match exactly or closely on common variables.
12 Probabilistic linking allows links to be assigned in spite of missing or inconsistent information, providing there is enough agreement on other variables to offset any disagreement. In probabilistic data linkage, records from two datasets are compared and brought together using several variables common to each dataset (Fellegi & Sunter, 1969).
13 A key feature of the methodology is the ability to handle a variety of linking variables and record comparison methods to produce a single numerical measure of how well two particular records match, referred to as the 'linkage weight'. This allows ranking of all possible links and optimal assignment of the link or non-link status (Solon and Bishop, 2009).
17 The first step of the calibration process adjusted for missed links. The methodology adopted was originally developed to adjust for non-response in sample surveys. Concepts of non-response and non-links differ in that the former is generally a result of an action by a person selected in a sample, and the latter is the failure to link a record likely as a result of the quality of its linking variables. However, both situations may result in under/over representation, and as such the methodology developed to adjust for non-response is suitable to apply to adjust for non-links. Unlike non-response in a sample survey, in this case many of the characteristics of the non-linked records are known, and these characteristics can therefore be used as inputs into an adjustment for unlinked records.
18 The propensity of a Temporary entrants record to be linked to a Census record was modelled using a logistic regression, which estimates the probability of each record having been linked based on that record's characteristics. The logistic regression was performed separately for student visa holders, temporary skilled workers, Special Category (New Zealand citizen) visa holders, and others. Each record was then assigned an initial weight given by the inverse of the linkage probability estimated by the relevant regression model. For example, if the regression model estimated that a Temporary visa holder record had a 75% chance of being successfully linked to a Census record, the initial weight would be 1 divided by 0.75, or 1.33. This ensures that records in the linked dataset which share characteristics with unlinked records are given higher weights, so that the characteristics associated with unlinked records are adequately represented on the linked file.
19 The second step of the calibration process uses the weights derived from the first step as an input into the calibration to known totals from the Temporary entrants dataset. This adjusts for residual bias not accounted for by the regression model, and ensures that totals from the linked dataset exactly match totals from the Temporary entrants dataset for characteristics considered to be of particular interest, such as visa group, applicant status (primary or secondary) and state/ territory of residence.
20 Following the two-step calibration process, weights are applied to the 974,803 linked records so that estimates will align to the 1,635,498 in scope records from the Temporary entrants population. The mean weight is therefore around 1.68, though the weights range between 1.0 and 12.5.
These documents will be presented in a new window.
3419.0.55.001 - Microdata: Australian Census and Temporary Entrants Integrated Dataset, 2016 Quality Declaration
Latest ISSUE Released at 11:30 AM (CANBERRA TIME) 14/02/2019 First Issue