4321.0 - Research Paper: Integration of the National Health Survey with the Multi-Agency Data Integration Project (MADIP), 2014-15
Latest ISSUE Released at 11:30 AM (CANBERRA TIME) 12/02/2020 First Issue
|Page tools: Print Page Print All|
The Multi-Agency Data Integration Project (MADIP) is a partnership among Australian Government agencies to combine information on healthcare, education, government payments, personal income tax, and the Census to create a comprehensive picture of Australia over time. Information in MADIP is combined by linking person-level datasets to a central linking infrastructure, or ‘Spine’ that serves as a base dataset representing the ‘ever-resident’ population of Australia. The Spine used to create the MADIP asset is made up of the Medicare Consumer Directory, Personal Income Tax data and Social Security and Related Information data. More information on MADIP, including how the data is kept secure and confidential, is available on the ABS Website.
Integration of the NHS and MADIP asset provides potential rich insights about health service usage. High-level policy issues that could be informed by this linkage include:
A Privacy Impact Assessment into linking the 2014-15 NHS to MADIP was undertaken.
This paper summarises scope, coverage and quality of the 2014-15 NHS dataset, as well as linking methodology used and results achieved from the linkage process undertaken in 2018-19.
Data Scope and Coverage
2014-15 National Health Survey
The NHS was conducted in all states and territories from July 2014 to June 2015. It included 19,257 people (14,560 adults and 4,697 children aged 0 to 17 years) in 14,723 private dwellings.
Table 1: Scope of the NHS
The Spine used to create the MADIP asset consists of 32,152,764 person records and includes persons who were present in any of the following datasets:
All Spine person records were used in the linkage.
Quality of NHS Linking Variables
A number of metrics can be used to assess dataset quality, including rates of data missing from the dataset (referred to as ‘missingness’). Missingness rates were calculated for the key linking variables of first name, surname, date of birth, address related information and sex. These rates were low except for date of birth and surname (see Table 2).
The quality of the NHS linking variables is considered high given the generally low missingness rates observed for these variables. This helped to achieve a high overall linkage rate (95%) involving minimal false links. In particular, the high quality of address related information compensated for the lower quality of surname data in the linkage process.
Table 2: Missingness rates for linking variables
The variables used for linking NHS to the MADIP asset were name, address related information, date of birth and sex.
Names were repaired, standardised, and anonymised. Repairing names when they closely match those in master name indexes and standardising them so that variations are grouped into a common name (e.g. Jess, Jessie to Jessica) helps ensure consistency between datasets and maximise linkage rates. There was a high rate of surnames which were missing (8.2%) and with only one character (19.2%). To maximise use of surname, a first initial (FI) surname field was created and used in the linkage.
Address related information
The ABS Address Register provides a comprehensive list of all physical addresses in Australia and includes an Address Register ID (ARID) for each physical address. An anonymised version of ARID is used in data linkage projects. Addresses were geocoded to ARID, Mesh block, Statistical Area Level 1 (SA1), Statistical Area Level 2 (SA2), and Statistical Area Level 4 (SA4), according to the ASGS 2016 classification. Addresses with missing information include those missing addresses and those with addresses that could not be geocoded to the specified level of geography.
Date of Birth
For the 10.2% of records which were missing this variable, year of birth was estimated using the person’s age on the date they completed the survey. Age data was available for each record and is accurate to within one year (as the survey is conducted over a financial year).
No records were missing sex; hence no data cleaning was necessary for this variable.
The linkage was completed using probabilistic linking, where records from datasets are compared and brought together using several variables common to each dataset. It consisted of seven passes, each designed to attain agreement on highly distinct information. Linkage weights were assigned based on the level of agreement of each linkage variable in each linkage pass. All potential links identified during probabilistic linkage were assessed by a decision algorithm to determine the single, best unique link to achieve a one-to-one match between the NHS data and the MADIP asset.
A total of 18,287 links were achieved, giving a total linkage rate of 95.0%.
Links identified during the probabilistic linking process were assigned an estimate of precision (likelihood that the link is true). It is estimated that the precision of the linked file is 99%, with a false link rate of 1%. This is based on the calculation of the cumulative precision of probabilistic links.
Linkage rates by Sex, Age and State/Territory are presented in Tables 3A and 3B. High linkage rates were achieved across most demographics, with lower rates achieved for younger people and those living in the Northern Territory.
Table 3A: Distribution of linkage rate by Sex and Age
Table 3B: Distribution of linkage rate by State and Territory
Using the linked data asset
Given the high linkage rates and level of precision achieved, the linked data asset is considered highly suitable for research and analysis purposes to inform the development of health related policies and evidence based decisions. As such, the original survey weights calculated for the NHS can be used to give a good approximation for population estimates.
When the NHS sample is weighted to the total population, no compensation is made for Indigenous status to correct for sampling bias that may occur in Aboriginal and Torres Strait Islander population estimates. In addition, very remote areas and Discrete Aboriginal and Torres Strait Islander communities were outside the survey scope. This means that use of the analytical dataset for Aboriginal and Torres Strait Islander population research may not be appropriate and would require careful consideration.
To take into consideration the non-linkage rate, newly developed weights for the linked NHS records have been created which will calibrate exactly to the population estimates for the NHS. These adjusted weights are recommended for use when producing estimates from the linked data asset.
The 2014-15 NHS is the first ABS survey to be linked to the MADIP asset. The high linkage rate (95%) was likely due in part to the NHS data collection period in 2014-15 being within the 2006 to 2016 reference period of the Spine. In addition, the quality of address related information and first name in the NHS was better than initially expected which helped to achieve high and good quality linkage rates across most demographic groups.
These documents will be presented in a new window.