Construction of IHAD
This chapter describes the methods used to construct IHAD, some important technical specifications and basic outputs.
Principal Component Analysis
Principal component analysis (PCA) has been used since the first release of SEIFA to summarise Census variables related to socio-economic advantage and disadvantage. The same methodology is used to create the IHAD, modified where necessary to use binary variables.. The aim of PCA is to reduce a large number of correlated variables into a smaller set of transformed variables, called "principal components". Each component is a weighted linear combination of the original. It is possible to extract as many components as there are variables. If the original variables are highly correlated, much of the variation can be summarised by a single principal component.
The first principal component is the weighted linear combination of variables that captures the maximum amount of variation present in the original dataset. This is calculated using the correlations between the variables. In general, variables that are strongly correlated with many others in the list will receive high weights. The first principal component is used to create the IHAD index.
The PCA used the binary candidate variables and the correlation matrix of these variables to give an indication of how significantly each variable contributes to the measurement of the unobserved latent variable of interest, namely socio-economic advantage and disadvantage. Each variable receives a loading that indicates the correlation of that variable with the index. A positive loading indicates an advantaging variable whereas a negative loading indicates a disadvantaging variable. The variables with the highest loadings are the ones that have the highest correlation with the index value.
Polychoric correlations were used instead of the standard Pearson correlations for the correlation matrix; this is appropriate for binary variables to ensure the correlation coefficients used in the PCA are unbiased. Using polychoric correlations is considered to be more accurate when running a PCA on discrete data such as the binary variables used in the IHAD.
The candidate variables listed in Description of candidate IHAD variables were used in the PCA for the IHAD and removed if their loading was less than or equal to 0.3 on the grounds that they were not particularly strong indicators of advantage or disadvantage. This process was performed iteratively, until all of the variables had a loading above 0.3. This is the same procedure used to create the SEIFA. The final variables and their loadings following this process are presented in the Technical details for IHAD: variables and loadings.
The first principal component scores were derived by taking the product of each standardised variable with its respective weight, then taking the sum. For convenience and consistency with the approach taken for SEIFA, these raw component scores were then standardised to a mean of 1,000 and a standard deviation of 100 to produce the index.
The sign of the PCA weights is arbitrary, but intuitively we want more disadvantaged households to have lower scores, for example NOCAR is a disadvantage variable and so should have a negative weight. The weights were multiplied by -1 to give advantage indicators positive weights and loadings, and disadvantage indicators negative weights and loadings. Accordingly, high index scores indicate relative advantage, and low index scores indicate relative disadvantage.
Step-by-step process
With the preceding two sections providing context, a step-by-step process for constructing IHAD is presented below:
1: Creating the initial variable list
Given the data available, we created a list of variables related to the definition of relative household socio-economic advantage and disadvantage.
2: Removing households with 10+ missing responses and imputing missing responses
We applied the IHAD scope to the dataset, and then identified households with 10 or more applicable missing responses. We removed these households from the dataset, imputed missing responses for most of the required variables, and then applied Hotdeck imputation for HIED and HEAP to create the dataset we used to construct the candidate variables.
3: Constructing the variables
We created binary indicators from household, family, and person level variables. These indicators take a value of 1 if the characteristics is present, and 0 if it isn’t.
4: Removing very highly correlated variables
We removed highly correlated variables to avoid over-representing any specific socio-economic characteristic. When two variables had a correlation coefficient greater than 0.8 in absolute value and were measuring conceptually similar aspects of advantage or disadvantage, we generally removed one of them. However, we applied some discretion, depending on the variables in question and the size of the correlation.
5: Conducting the initial PCA
We conducted principal component analysis (PCA) using the binary candidate variables and the correlation matrix of these variables, to obtain the loading for each variable on the first principal component.
6: Removing low loading variables
We excluded variables with loadings less than 0.3 in absolute value, on the grounds that they were not strong indicators of relative advantage or disadvantage. This limit is an accepted level in the PCA literature and has been used in past releases of SEIFA and IHAD. We removed variables one at a time, starting with the lowest loading variable.
7: Conducting PCA on the reduced list of variables
We conducted a PCA on the reduced variable list, and if any other variables loaded below 0.3, we repeated steps six and seven.
8. Calculating and standardising component/index scores
We derived the first principal component scores for each household by taking the product of each variable with its respective weight, then taking the sum across all variables. Note that the weight for each variable was calculated by dividing the loading by the square root of the eigenvalue.
\({Z_{SA1}} = \sum\limits_{j = 1}^p {\frac{{{L_j}}}{{\sqrt \lambda }} \times {X_{j,}}_{SA1}}\)
where,
\({Z_{SA1}}\)= raw score for the SA1
\({{X_{j,}}_{SA1}}\) = standardised variable of the j-th variable for the SA1
\({{L_j}}\) = loading for the j-th variable
\(\lambda\) = eigenvalue of the principal component
\(p\) = total number of variables in the index
For convenience of presentation, we then rescaled the raw scores to a mean of 1,000 and standard deviation of 100 to create a new set of scores that are the household index scores in IHAD.
Note that the principal components are arbitrary with respect to their sign (positive or negative), so we set the sign of the weights and loadings so that they make intuitive sense. That is, we gave advantage indicators positive weights and loadings, and disadvantage indicators negative weights and loadings. Accordingly, high scores indicate relative advantage, and low scores indicate relative disadvantage. This is consistent with previous editions of SEIFA and IHAD.
Technical details for IHAD: variables and loadings
This section gives the results of the principal component analysis carried out for IHAD, including variable loadings and percentage of variance explained. A list of variables initially considered for inclusion but removed due to high correlations with other variables or weak loadings is also provided.
IHAD summaries variables that indicate either relative socio-economic advantage or disadvantage, according to the concept described in Defining the concept behind IHAD. The final IHAD variables and loadings are listed below.
IHAD variables and loadings
IHAD indicators of disadvantage
The following variables are indicators of disadvantage. PUBLIC_RENT is the strongest indicator of disadvantage in the index.
Variable | Description | Loading |
---|---|---|
PUBLIC_RENT | Households being rented from a state or territory housing authority, or a housing co-operative/community/church group (disadvantage) | -0.84 |
LOWRENT | Households where rent payments are less than $250 per week, excluding employer landlords (excludes $0) (disadvantage) | -0.81 |
INC_LOW | Households with low annual equivalised income (between $1 and $25,999) (disadvantage) | -0.71 |
NOYEAR11or higher | Households where the person with the highest educational attainment left school at year 10 or below, including those who did not go to school and with Certificate level I or II (excludes those currently studying secondary education) (disadvantage) | -0.69 |
NOCAR | Households with no car (disadvantage) | -0.61 |
RETIRED_NOT_OWNED | Households with a person aged 65 years and over who does not own the home, or occupy it under a life tenure scheme (disadvantage) | -0.59 |
DISABILITY_HH_PROP | Households where more than 50% of people need assistance with core activities (disadvantage) | -0.55 |
NOYEAR12_DEPENDENT | Households with at least one dependent child and the person with the highest educational attainment left school at year 11 or below, including those who did not go to school and with Certificate level I or II (excludes those currently studying secondary education) (disadvantage) | -0.54 |
FEWBED | Households with one or no bedrooms (disadvantage) | -0.45 |
ALL_UNEMPLOYED | Households where all people aged 15 years and over are unemployed (disadvantage) | -0.44 |
YEAR11 | Households where the person with the highest educational attainment left school at year 11 (excludes those currently studying secondary education) (disadvantage) | -0.41 |
CHILDJOBLESS | Households with children aged under 15 years and parent(s) not employed (disadvantage) | -0.35 |
IHAD indicators of advantage
The following variables are indicators of advantage. DEGREE_DEPENDENT is the strongest indicator of advantage in the index.
Variable | Description | Loading |
---|---|---|
HIGHCAR | Households with three or more cars (advantage) | 0.43 |
HIGHBED | Households with four or more bedrooms (advantage) | 0.50 |
INC_HIGH | Households with high annual equivalised income (greater than $90,999) (advantage) | 0.68 |
PURCHASED | Households being purchased (advantage) | 0.75 |
DEGREE | Households where the person with the highest educational attainment has a Bachelor Degree or above (advantage) | 0.76 |
HIGH_SKILL | Households where the highest skilled employed adult works in a skill level 1 occupation (advantage) | 0.78 |
HIGHMORTGAGE | Households where mortgage repayments are greater than or equal to $2,900 per month (advantage) | 0.79 |
DEGREE_DEPENDENT | Households with at least one dependent child and the person with the highest educational attainment has a Bachelor Degree or above (advantage). | 0.81 |
The 2021 IHAD index explains 41.4% of the total variance of the variables in the final variable list. The Experimental IHAD 2016 explained 43.2% of this total variance.
Removal of highly correlated variables
In most cases, highly correlated variables were removed from the initial candidate list. This was done to prevent instability in the variable weights and over-representation of any specific socio-economic characteristic. When two variables had a correlation coefficient of size greater than 0.8 in absolute value, one of them was generally removed. However, if they were deemed to be measuring different socio-economic characteristics (e.g. education and occupation), both were retained.
Variable description | Reason for exclusion |
---|---|
Households with one or more people aged 15 years and over who are unemployed (UNEMPLOYED) (disadvantage) | Highly correlated with ALL_UNEMPLOYED which highlights disadvantaged households better |
Households with one or more people aged 70 years and over who need assistance with core activities (DISABILITY_OVER70) (disadvantage) | Highly correlated with DISABILITY_HH_PROP (0.83) and not as representative of the total population |
Households where all people aged 15 years and over have no educational attainment (NOEDU) (disadvantage) | Small prevalence and highly correlated with NOYEAR11_OR_HIGHER |
Removal of low loading variables
The following variables were initially considered for the index but were excluded when the analysis showed that they were weak indicators of relative advantage or disadvantage.
Variable | Variable description |
---|---|
OVERCROWD | Households requiring one or more extra bedrooms (based on Canadian National Occupancy Standard) (disadvantage) |
SPAREBED | Households with one or more bedrooms spare (based on Canadian National Occupancy Standard) (advantage) |
OTHER_HHLD | Households with a structure classified as "other" (e.g. caravan, tent) (disadvantage) |
MULTI_FAMILY | Multi-family households (advantage) |
HIGHRENT | Households where rent payments are more than $500 per week (advantage) |
OWNED | Households owned outright (advantage) |
ONEPARENT | Households with a one-parent family, with dependent children only (disadvantage) |
CERTIFICATE | Households where the person with the highest educational attainment has a Certificate III or IV (advantage) |
DIPLOMA | Households where the person with the highest educational attainment has an Advanced Diploma or Diploma (advantage) |
SKILL_LVL_2 | Households where the highest skilled employed adult works in a skill level 2 occupation (advantage) |
SKILL_LVL_4 | Households where the highest skilled employed adult works in a skill level 4 occupation (disadvantage) |
LOW_SKILL | Households where the highest skilled employed adult works in a skill level 5 occupation (disadvantage) |
ALL_SHORT_DISTANCE | Households where all people aged 15 years and over who are employed, travel 0 to less than 2.5 km to work (advantage) |
ALL_LONG_DISTANCE | Households where all people aged 15 years and over who are employed travel 50 to less than 250 km to work (disadvantage) |
ALL_VLONG_DISTANCE | Households where all people aged 15 years and over who are employed travel 250 or more km to work (disadvantage) |
SEP_DIVORCED | Households with one or more people aged 15 years and over separated or divorced (disadvantage) |
ENGPOOR | Households with one or more people aged 15 years and over who do not speak English well (disadvantage) |
ROM | Households with one or more people aged 15 years and over who arrived in Australia in the last 10 years (disadvantage) |
UNENGAGED_YOUTH | Households with one or more people aged between 15 and 24 years who are not working or studying (disadvantage) |
CARER | Households with one or more people aged 15 years and over who provide unpaid assistance to a person with a disability (disadvantage) |
VOLUNTEER | Households with one or more people aged 15 years and over who does voluntary work for an organisation or group (advantage) |
Distribution of the IHAD
This section presents the frequency histogram of IHAD scores. The IHAD distributions have generally similar shapes to those from Experimental IHAD 2016.
The scores for IHAD range from 613 to 1,246; the table presents maximum and minimum scores of each IHAD quartile. These show that there is sufficient variation in the IHAD scores to allow for the formation of these groups.
Some households will not have any indicators of advantage or disadvantage (i.e. their values for the final binary candidate variables are all 0). They will still receive an IHAD score reflecting the middle of the IHAD score distribution, which places them in quartile 2.
Household index group | Number of households* | Household index score | ||
---|---|---|---|---|
Frequency | Percentage | Minimum | Maximum | |
1 | 2,307,765 | 25.0 | 613 | 943 |
2 | 2,308,638 | 25.0 | 943 | 992 |
3 | 2,338,786 | 25.3 | 992 | 1,070 |
4 | 2,275,872 | 24.7 | 1,070 | 1,246 |
* The total number of in-scope households assigned an IHAD score is 9,231,061