Integration of the National Health Survey with the Multi-Agency Data Integration Project (MADIP)

Integration of the National Health Survey with the Multi-Agency Data Integration Project



The National Health Survey (NHS) is an Australia-wide health survey which collects information about the health of people, including: 

  • prevalence of long-term health conditions
  • health risk factors such as smoking, overweight and obesity, alcohol consumption and exercise
  • use of health services such as consultations with health practitioners and actions people have recently taken for their health
  • demographic and socioeconomic characteristics

The Multi-Agency Data Integration Project (MADIP) is a partnership among Australian Government agencies to combine information on healthcare, education, government payments, personal income tax, and the Census to create a comprehensive picture of Australia over time. Information in MADIP is combined by linking person-level datasets to a central linking infrastructure, or ‘Spine’ that serves as a base dataset representing the ‘ever-resident’ population of Australia. The Spine used to create the MADIP asset is made up of the Medicare Consumer Directory, Personal Income Tax data and Social Security and Related Information data. More information on MADIP, including how the data is kept secure and confidential, is available on the ABS Website.

Integration of the NHS and MADIP asset provides potential rich insights about health service usage. High-level policy issues that could be informed by this linkage include:

  • the connection between people’s lifestyles, risk factors and health conditions and their use of government services
  • the impact of health status and health conditions on social and economic participation
  • the extent to which use of pharmaceuticals and medical services are consistent with appropriate pathways of care and meet clinical needs
  • patterns of healthcare service use for different patient cohorts

Privacy Impact Assessment into linking the 2014-15 NHS to MADIP was undertaken.

This paper summarises scope, coverage and quality of the 2014-15 NHS dataset, as well as linking methodology used and results achieved from the linkage process undertaken in 2018-19.

Data scope and coverage

2014 - 15 National Health Survey

The NHS was conducted in all states and territories from July 2014 to June 2015. It included 19,257 people (14,560 adults and 4,697 children aged 0 to 17 years) in 14,723 private dwellings.

Table 1: Scope of the NHS
In scopeOut of scope
Urban and rural areas in all states and territoriesVery Remote parts of Australia
Discrete Aboriginal and Torres Strait Islander communities
Permanent residents of AustraliaPersons whose usual place of residence was outside Australia
Overseas visitors who have been working or studying in Australia for the last 12 months or more, or intend to do soMembers of non-Australian Defence forces (and their dependents) stationed in Australia
Persons usually resident in a private dwellingNon-private dwellings (e.g. motels, hotels, hospitals, nursing homes, short-stay caravan parks)
Certain diplomatic personnel of overseas governments, customarily excluded from the Census and estimated resident population
One adult (aged over 18) and one child (aged 0-17) were randomly selected from each surveyed private dwellingVisitors to private dwellings

MADIP asset

The Spine used to create the MADIP asset consists of 32,152,764 person records and includes persons who were present in any of the following datasets:

  • Medicare Consumer Directory (MCD) during the period 2006 to 2016
  • Personal Income Tax (PIT) during the period 2010-11 to 2015-16
  • Social Security and Related Information (SSRI) during the period 2009 to 2016

All Spine person records were used in the linkage.

Quality of NHS linking variables

A number of metrics can be used to assess dataset quality, including rates of data missing from the dataset (referred to as ‘missingness’). Missingness rates were calculated for the key linking variables of first name, surname, date of birth, address related information and sex. These rates were low except for date of birth and surname (see Table 2).

The quality of the NHS linking variables is considered high given the generally low missingness rates observed for these variables. This helped to achieve a high overall linkage rate (95%) involving minimal false links. In particular, the high quality of address related information compensated for the lower quality of surname data in the linkage process.

Table 2: Missingness rates for linking variables
Linking variableNumber of persons with missing informationMissingness rate (%)
Date of Birth                                       1,97310.2
First Name5632.9
Address Register ID (a)3001.6
Mesh Block (a)1220.6
SA1 (a)970.5
SA2 (a)540.3
SA4 (a)540.3
State/Territory (a)350.2
  1. see Address related information below

Data preparation

The variables used for linking NHS to the MADIP asset were name, address related information, date of birth and sex.


Names were repaired, standardised, and anonymised. Repairing names when they closely match those in master name indexes and standardising them so that variations are grouped into a common name (e.g. Jess, Jessie to Jessica) helps ensure consistency between datasets and maximise linkage rates. There was a high rate of surnames which were missing (8.2%) and with only one character (19.2%). To maximise use of surname, a first initial (FI) surname field was created and used in the linkage.

Address related information

The ABS Address Register provides a comprehensive list of all physical addresses in Australia and includes an Address Register ID (ARID) for each physical address. An anonymised version of ARID is used in data linkage projects. Addresses were geocoded to ARID, Mesh block, Statistical Area Level 1 (SA1), Statistical Area Level 2 (SA2), and Statistical Area Level 4 (SA4), according to the ASGS 2016 classification. Addresses with missing information include those missing addresses and those with addresses that could not be geocoded to the specified level of geography.

Date of birth

For the 10.2% of records which were missing this variable, year of birth was estimated using the person’s age on the date they completed the survey. Age data was available for each record and is accurate to within one year (as the survey is conducted over a financial year).


No records were missing sex; hence no data cleaning was necessary for this variable.

Linkage methodology

The linkage was completed using probabilistic linking, where records from datasets are compared and brought together using several variables common to each dataset. It consisted of seven passes, each designed to attain agreement on highly distinct information. Linkage weights were assigned based on the level of agreement of each linkage variable in each linkage pass. All potential links identified during probabilistic linkage were assessed by a decision algorithm to determine the single, best unique link to achieve a one-to-one match between the NHS data and the MADIP asset.

Linkage results

A total of 18,287 links were achieved, giving a total linkage rate of 95.0%.

Links identified during the probabilistic linking process were assigned an estimate of precision (likelihood that the link is true). It is estimated that the precision of the linked file is 99%, with a false link rate of 1%. This is based on the calculation of the cumulative precision of probabilistic links.

Linkage rates by Sex, Age and State/Territory are presented in Tables 3A and 3B. High linkage rates were achieved across most demographics, with lower rates achieved for younger people and those living in the Northern Territory.

Table 3A: Distribution of linkage rate by Sex and Age
 Total recordsLinked recordsLinkage rate %


Under 15 years3,8633,57092.4
15-24 years1,9591,80492.1
25-34 years2,4862,33593.9
35-44 years2,7632,65596.1
45-54 years2,5192,42696.3
55-64 years2,4032,32096.5
65-74 years1,9151,86597.4
75-84 years1,0241,00097.7
85 years and over32531296.0
Table 3B: Distribution of linkage rate by State and Territory
 Total recordsLinked recordsLinkage rate %
NSW                    3,2753,08094.0

Using the linked data set

Given the high linkage rates and level of precision achieved, the linked data asset is considered highly suitable for research and analysis purposes to inform the development of health related policies and evidence based decisions. As such, the original survey weights calculated for the NHS can be used to give a good approximation for population estimates.

When the NHS sample is weighted to the total population, no compensation is made for Indigenous status to correct for sampling bias that may occur in Aboriginal and Torres Strait Islander population estimates. In addition, very remote areas and Discrete Aboriginal and Torres Strait Islander communities were outside the survey scope. This means that use of the analytical dataset for Aboriginal and Torres Strait Islander population research may not be appropriate and would require careful consideration.

To take into consideration the non-linkage rate, newly developed weights for the linked NHS records have been created which will calibrate exactly to the population estimates for the NHS. These adjusted weights are recommended for use when producing estimates from the linked data asset.


The 2014-15 NHS is the first ABS survey to be linked to the MADIP asset. The high linkage rate (95%) was likely due in part to the NHS data collection period in 2014-15 being within the 2006 to 2016 reference period of the Spine. In addition, the quality of address related information and first name in the NHS was better than initially expected which helped to achieve high and good quality linkage rates across most demographic groups.

Previous catalogue number

This release previously used catalogue number 4321.0

Back to top of the page