Construction of the indexes

Latest release
Socio-Economic Indexes for Areas (SEIFA): Technical Paper
Reference period
2021
Released
27/04/2023
Next release Unknown
First release

This chapter describes the methods used to construct the indexes, some important technical specifications of each index, and some basic outputs.

Principal Component Analysis

Each index is a weighted sum of SEIFA variables. As with past versions of SEIFA, principal component analysis (PCA) is used to determine the weights. This section introduces some technical concepts related to PCA to assist the reader understand the SEIFA index construction process. Some references are given at the end of this section for readers interested in a comprehensive discussion of PCA.

PCA is a technique that involves summarising a large number of correlated variables into a set of new uncorrelated components, each of which is a linear combination of the original variables. There are as many principal components as there are variables. If the original variables are highly correlated, much of the variation can be summarised by a reduced set of components, enabling easier analysis. The first principal component accounts for the largest proportion of variance in the original dataset, with each following component explaining less of the variance. The principal component used for each SEIFA index is the one that can be interpreted as best explaining the variation in the concept of advantage and disadvantage for that index. For the four indexes in SEIFA 2016, the first principal component was used to create the index.

The PCA procedure gives an eigenvalue for each component, which indicates the amount of variance in the original data explained by the component. The proportion of variance explained by a principal component is its eigenvalue divided by the sum of all the eigenvalues. The 'loading' for a variable is calculated by multiplying the eigenvector by the square root of the eigenvalue. It gives a measure of the strength of the relationship between the variable and the component, though it should be noted that some sources use different definitions for the loadings and weights in PCA. The loadings are also useful in comparing results obtained from different sets of original variables (such as for the four indexes in SEIFA). Loadings for each index are presented in the following sections.

To generate the component scores (otherwise known as raw scores), the loading is converted to a weight by dividing it by the square root of the eigenvalue. The product of the weight and standardised variable values are summed to produce the raw scores. The raw scores for each component will then have variance equal to the eigenvalue for that component. We then rescale the raw scores to a mean of 1,000 and standard deviation of 100 to create a new set of scores that are the index scores in SEIFA - this process is known as "standardisation".

More detailed explanations of PCA can be found in Joliffe (1986) and O’Rourke (2005).

Areas with no SEIFA score

Some SA1 areas do not receive an index score, either due to low populations or poor-quality data. The criteria used to identify these areas are called ‘exclusion rules’. SEIFA 2021 uses a similar exclusion rule framework as SEIFA 2016, with the aim of obtaining a reliable index score for as many areas as possible.

The 2021 exclusion rules use a two-phase approach. The first phase excludes areas (SA1s) that should not receive a SEIFA score because of the type of area, confidentiality or reliability concerns (e.g. low population or low response rates for particular key variables). The second phase excludes areas (SA1s) by looking specifically at the variables included in each index. For each SA1, if any of the variables have a low denominator count, it is deemed that there is not enough data to support a reliable calculation of an index score for that area.

• The first phase rules are applied before PCA, whereas the second phase rules are applied following the PCA when the list of variables has been finalised. The step-by-step process provides details on how this is implemented.
• SA1s excluded in the first phase will be excluded for all four indexes. The number of SA1s excluded in the second phase may be different for each index, because they have different sets of variables.
• Following on from the point above, an area can receive a score for one index and not another depending on the make-up of its variables.
• The low denominator cut-off of six is chosen based on past practice and a judgement on how many responses are required to calculate a reliable value for an area.
• The exclusion of areas is based on the confidentialised counts for each SEIFA variable to ensure the confidentiality of respondents is upheld and the reliability of the indexes is maintained.

The specific exclusion rules and the number of areas meeting each rule are summarised in the table below. Note that areas might fall into multiple categories, which is why the column sum does not equal the final total number of excluded areas.

The proportions of excluded SA1s are similar to those for SEIFA 2016.

Summary of excluded areas - first phase
Exclusion criteriaTotal SA1s excluded
Population = 01,357
Offshore, Shipping SA124
Population > 0 and ≤ 10554
Employed persons ≤ 52079
Classifiable(a) occupied private dwellings ≤ 52118
People in private dwellings ≤ 20%1741
Total excluded due to any of the rules above2412
1. These are dwellings where the type of household living in the dwelling could be determined during the collection process. For more information, refer to the 2021 Census Dictionary.
Summary of excluded areas - second phase
IndexTotal SA1s excluded
IRSD150
IER127
IEO20

Step-by-step process

With the preceding two sections providing context, a step-by-step process for constructing the indexes is presented below.

1: Creating the initial variable list

Given the data available, we created a list of variables related to our definition of relative socio-economic advantage and disadvantage.

2: Constructing the variables

We created all variables as proportions at the SA1 level (e.g. ‘percent of people aged 15 years and over attending secondary school’). We then standardised these proportions to a mean of zero and a standard deviation of one. The standardisation was used to prevent variables with larger prevalence, or larger ranges, from having a disproportionate influence on the index.

3: Applying first phase exclusion rules

We excluded areas (SA1s) that should not receive an index score because of the type of area, confidentiality, or reliability concerns.

4: Calculating the correlation matrix

We set to missing any variables that have denominators less than our prescribed cut-off of six. Note that we did not exclude areas based on this cut-off at this stage in the process – this occurred at step nine. We calculated the correlation matrix and used pairwise deletion when areas (observations) contain missing values. Pairwise deletion is a method for dealing with missing data. The maximum number of non-missing values for each pair of variables is used in the calculation of the correlation matrix. This contrasts to listwise deletion in which entire records (areas in our case) are removed from the analysis if any of their variables have missing values. Given the number of observations in our dataset and the low prevalence of missing values, the use of pairwise deletion had very little impact on the correlation matrix, however it did enable a convenient way of implementing our second phase exclusion rules (refer to step nine).

5: Removing very highly correlated variables

We removed highly correlated variables to avoid over-representing any specific socio-economic characteristic. When two variables had a correlation coefficient greater than 0.8 in absolute value and were measuring conceptually similar aspects of advantage or disadvantage, we generally removed one of them. However, we applied some discretion, depending on the variables in question and the size of the correlation.

6: Conducting the initial PCA

Using the correlation matrix, we conducted principal component analysis (PCA) to obtain the loading for each variable on the first principal component.

We excluded variables with loadings less than 0.3 in absolute value, on the grounds that they were not strong indicators of relative advantage or disadvantage. This limit is an accepted level in the PCA literature and has been used in past releases of SEIFA. We removed variables one at a time, starting with the lowest loading variable.

8: Conducting PCA on the reduced list of variables

We conducted a PCA on the reduced variable list, and if any other variables loaded below 0.3, we repeated steps seven and eight.

9: Finalise list of variables in index and apply second phase exclusion rules

After the final list of variables in the index was determined, we excluded any SA1s that had denominators less than our prescribed cut-off of six for any of the variables on the final variable list.

10: Calculating and standardising component/index scores

We derived the first principal component scores for each SA1 by taking the product of each standardised variable with its respective weight, then taking the sum across all variables. Note that the weight for each variable was calculated by dividing the loading by the square root of the eigenvalue.

$${Z_{SA1}} = \sum\limits_{j = 1}^p {\frac{{{L_j}}}{{\sqrt \lambda }} \times {X_{j,}}_{SA1}}$$

where,

$${Z_{SA1}}$$ = raw score for the SA1

$${{X_{j,}}_{SA1}}$$ = standardised variable of the j-th variable for the SA1

$${{L_j}}$$ = loading for the j-th variable

$$\lambda$$  = eigenvalue of the principal component

$$p$$ = total number of variables in the index

For convenience of presentation, we then rescaled the raw scores to a mean of 1,000 and standard deviation of 100 to create a new set of scores that are the SA1 index scores in SEIFA.

Note that the principal components are arbitrary with respect to their sign (positive or negative), so we set the sign of the weights and loadings so that they make intuitive sense. That is, we gave advantage indicators positive weights and loadings, and disadvantage indicators negative weights and loadings. Accordingly, high scores indicate relative advantage, and low scores indicate relative disadvantage. This is consistent with previous editions of SEIFA.

11: Creating higher geographic level indexes

We constructed indexes for geographies higher than the SA1 level using population weighted averages of the constituent SA1s. We used the following formula:

$$INDE{X_{AREA}} = \frac{{\sum\limits_{i = 1}^n {{{(INDE{X_{SA{1_i}}} \times PO{P_{SA{1_i}}})}^{}}} }}{{PO{P_{AREA}}}}$$

where,

$$INDEX$$= Index score for each SA1 or higher level area

$$POP$$ = Population for each SA1 or higher level area

$$n$$ = Total number of SA1s (with index scores) in the higher level area

The higher level area population is the sum of the populations from the constituent SA1s that received an index score. Populations in excluded SA1s are not included in this calculation.

Although we constructed the higher level indexes from standardised SA1 level indexes, they were not standardised themselves. Therefore the higher level area indexes do not necessarily have a mean of 1,000 or standard deviation of 100. Only SA1s with index scores were used to create the higher level indexes. In a small number of cases, where a higher level area contains a number of SA1s that were excluded, its index score may not be a good representation of its entire population.

For this reason, the output spreadsheets provide the proportion of each higher area level population that was in excluded SA1s. In general, we encourage users conducting analysis at higher level areas to keep in mind that the indexes were constructed at the SA1 level, and to consider using the distribution of SA1s within the higher level areas, rather than just the one index score for each higher level area.

This section gives the results of the principal component analysis carried out for each index, including variable loadings and percentage of variance explained. We also list the variables initially considered for inclusion but removed due to high correlations with other variables or weak loadings.

The IRSD summarises variables that indicate relative disadvantage at the SA1 level, according to the concept described in defining the concept behind each of the four indexes. The final variable list and corresponding loadings are shown below.

Variable name

Variable description

INC_LOW

Per cent of people living in households with stated annual household equivalised income between $1 and$25,999 (approx. 1st and 2nd deciles)

-0.87

CHILDJOBLESS

Per cent of families with children under 15 years of age who live with jobless parents

-0.78

NOYR12ORHIGHER

Per cent of people aged 15 years and over whose highest level of education is Year 11 or lower. Includes Certificate I and II

-0.75

LOWRENT

Per cent of occupied private dwellings paying rent less than $250 per week (excluding$0 per week)

-0.71

UNEMPLOYED

Per cent of people (in the labour force) unemployed

-0.68

OCC_LABOUR

Per cent of employed people classified as 'labourers'

-0.68

DISABILITYU70

Per cent of people aged under 70 who need assistance with core activities due to a long–term health condition, disability or old age

-0.63

ONEPARENT

Per cent of one parent families with dependent offspring only

-0.58

OVERCROWD

Per cent of occupied private dwellings requiring one or more extra bedrooms (based on the Canadian National Occupancy Standard)

-0.51

OCC_DRIVERS

Per cent of employed people classified as Machinery Operators and Drivers

-0.51

SEPDIVORCED

Per cent of people aged 15 and over who are separated or divorced

-0.51

NOEDU

Per cent of people aged 15 years and over who have no educational attainment

-0.47

OCC_SERVICE_L

Per cent of employed people classified as Low Skill Community and Personal Service Workers

-0.45

NOCAR

Per cent of occupied private dwellings with no cars

-0.43

ENGLISHPOOR

Per cent of people who do not speak English well

-0.35

The 2021 IRSD index explains 37% of the total variance of the variables in the final variable list. The corresponding percentages for previous indexes are: 43% (2016 IRSD), 44% (2011 IRSD), 39% (2006 IRSD) and 33% (2001 IRSD).

Removal of highly correlated variables

Of the variables considered for the IRSD, there were no two variables that had a correlation coefficient greater than 0.8 in absolute value.

The following table shows the variables that were dropped from the IRSD because their loading was below our prescribed cutoff of 0.3 in absolute value. The variables are shown in the order they were removed, with the loadings from the iteration when they were removed.

Variable name

Variable description

OCC_SALES_L

Per cent of employed people classified as Low-Skill Sales Workers

-0.27

CERTIFICATE

Per cent of people aged 15 years and over whose highest level of educational attainment is a certificate III or IV qualification

-0.21

FEWBED

Per cent of occupied private dwellings with one or no bedrooms

-0.01

The IRSAD summarises variables that indicate either relative socio-economic advantage or disadvantage, according to the concept described in defining the concept behind each of the four indexes. The final variable list and corresponding loadings are shown below.

Variable name

Variable description

NOYR12ORHIGHER

Per cent of people aged 15 years and over whose highest level of education is Year 11 or lower. Includes Certificate I and II

-0.85

INC_LOW

Per cent of people living in households with stated annual household equivalised income between $1 and$25,999 (approx. 1st and 2nd deciles)

-0.83

OCC_LABOUR

Per cent of employed people classified as 'labourers'

-0.75

DISABILITYU70

Per cent of people aged under 70 who need assistance with core activities due to a long–term health condition, disability or old age

-0.67

CHILDJOBLESS

Per cent of families with children under 15 years of age who live with jobless parents

-0.65

OCC_DRIVERS

Per cent of employed people classified as Machinery Operators and Drivers

-0.61

LOWRENT

Per cent of occupied private dwellings paying rent less than $250 per week (excluding$0 per week)

-0.58

SEPDIVORCED

Per cent of people aged 15 and over who are separated or divorced

-0.58

ONEPARENT

Per cent of one parent families with dependent offspring only

-0.55

UNEMPLOYED

Per cent of people (in the labour force) unemployed

-0.54

OCC_SERVICE_L

Per cent of employed people classified as Low Skill Community and Personal Service Workers

-0.49

CERTIFICATE

Per cent of people aged 15 years and over whose highest level of educational attainment is a certificate III or IV qualification

-0.45

OVERCROWD

Per cent of occupied private dwellings requiring one or more extra bedrooms (based on Canadian National Occupancy Standard)

-0.32

NOEDU

Per cent of people aged 15 years and over who have no educational attainment

-0.32

OCC_SALES_L

Per cent of employed people classified as Low Skill Sales

-0.32

ATUNI

Per cent of people aged 15 years and over at university or other tertiary institution

0.35

HIGHBED

Per cent of occupied private dwellings with four or more bedrooms

0.35

DIPLOMA

Per cent of people aged 15 years and over whose highest level of education attainment is a diploma qualification

0.38

HIGHRENT

Per cent of occupied private dwellings paying rent greater than $470 per week 0.51 OCC_MANAGER Per cent of employed people classified as Managers 0.52 HIGHMORTGAGE Per cent of occupied private dwellings paying mortgage greater than$2,800 per month

0.69

OCC_PROF

Per cent of employed people classified as Professionals

0.74

INC_HIGH

0.52

HIGHMORTGAGE

0.07

Index of Education and Occupation

The IEO summarises variables related to educational qualifications and vocational skills, according to the concept described in defining the concept behind each of the four indexes. The final variable list and corresponding loadings are shown below.

Variable name

Variable description

NOYR12ORHIGHER

Per cent of people aged 15 years and over whose highest level of education is Year 11 or lower. Includes Certificate I and II

-0.87

OCC_SKILL5

Per cent of employed people who work in a Skill Level 5 occupation

-0.76

OCC_SKILL4

Per cent of employed people who work in a Skill Level 4 occupation

-0.75

CERTIFICATE

Per cent of people aged 15 years and over whose highest level of educational attainment is a certificate III or IV qualification

-0.65

UNEMPLOYED

Per cent of people (in the labour force) unemployed

-0.41

DIPLOMA

Per cent of people aged 15 years and over whose highest level of education attainment is a diploma qualification

0.37

ATUNI

Per cent of people aged 15 years and over at university or other tertiary institution

0.48

OCC_SKILL1

Per cent of employed people who work in a Skill Level 1 occupation

0.90

The 2021 IEO index explains 46% of the total variance of the variables in the final variable list. The corresponding percentages for previous indexes are: 41% (2016 IEO) 47% (2011 IEO), 52% (2006 IEO) and 46% (2001 IEO).

Removal of highly correlated variables

DEGREE (% People aged 15 years and over with a degree or higher qualification) had high correlations with NOYR12ORHIGHER (–0.83) and OCC_SKILL1 (0.82). It was decided that the proportion of people with a degree was already well explained by the index, and DEGREE was removed.

The table below shows the variable dropped from the IEO because of a low loading. The variables are shown in the order they were removed, with the loadings from the iteration when they were removed.

Variable name

Variable description

NOEDU

Per cent of people aged 15 years and over who have no educational attainment

0.29

OCC_SKILL2

Per cent of employed people who work in a skill level 2 occupation

0.27

ATSCHOOL

Per cent of people aged 15 years and over who are still attending secondary school

0.05

Summary of variables included in indexes

The table below shows the final set of variables included in each index.

List of variables in each index, by socio-economic dimension

Dimension

Index of Economic Resources

Index of Education and Occupation

Income

INC_LOW

INC_HIGH
INC_LOW

INC_HIGH
INC_LOW

Education

NOYR12ORHIGHER
NOEDU

NOYR12ORHIGHER
NOEDU
CERTIFICATE
ATUNI
DIPLOMA

NOYR12ORHIGHER
CERTIFICATE
ATUNI
DIPLOMA

Employment

UNEMPLOYED

UNEMPLOYED

UNEMPLOYED_IER

UNEMPLOYED

Occupation

OCC_LABOUR
OCC_DRIVERS
OCC_SERVICE_L

OCC_LABOUR
OCC_DRIVERS
OCC_SERVICE_L
OCC_SALES_L
OCC_MANAGER
OCC_PROF

OCC_SKILL1
OCC_SKILL4
OCC_SKILL5

Housing

LOWRENT
OVERCROWD

LOWRENT
OVERCROWD
HIGHRENT
HIGHBED
HIGHMORTGAGE

LOWRENT
OVERCROWD
OWNING
MORTGAGE
HIGHBED
HIGHMORTGAGE

Other

CHILDJOBLESS
ONEPARENT
DISABILITYU70
ENGLISHPOOR
NOCAR
SEPDIVORCED

CHILDJOBLESS
ONEPARENT
DISABILITYU70
SEPDIVORCED

UNINCORP
ONEPARENT
LONE
GROUP
NOCAR

Distribution of the indexes

This section presents frequency histograms for each index at the SA1 level. The index distributions have generally similar shapes to those from SEIFA 2016.

The IRSD distribution shown below has a very long left tail. The values range from about 143 to 1207. This index contains only disadvantage indicators, so there is more scope to distinguish between disadvantaged areas than advantaged areas.

The steep peak for this distribution means that there will be little difference in the scores of SA1s in the middle deciles, and so the characteristics related to the IRSD variables may not vary much across SA1s in these middle deciles.

The scores for IRSAD range from 435 to 1273. The right-hand slope is not as steep in the IRSAD distribution as it is in the IRSD distribution. This means that the IRSAD scores of SA1s in the upper deciles are more spread out than the IRSD scores in these deciles, and this index has a greater ability to differentiate between the more advantaged areas.

Index of Economic Resources

The scores for IER range from 299 to 1315.

Index of Education and Occupation

The scores for IEO range from 407 to 1372

Basic output: scores, ranks, deciles and percentiles

Scores

The scores are a weighted combination of the selected indicators of advantage and disadvantage which have been standardised to a distribution with a mean of 1000 and standard deviation of 100. An area with all of its indicators equal to the national average will receive a score of 1000. The score for an area will increase if an area has: an indicator of advantage that is greater than the national average; or an indicator of disadvantage that is less than the national average. Conversely, the score for an area will decrease if an area has: an indicator of disadvantage that is greater than the national average; or an indicator of advantage that is less than the national average. Indicators which are further away from the national average have a larger impact on the score.

For areas larger than SA1, the scores are a population weighted average of constituent SA1 scores, as described in Step 11 of the step by step process.

It is important to remember that the scores are an ordinal measure (discussed in more detail in broad guidelines on appropriate use), so care should be taken when comparing scores. For example, an area with a score of 500 is not twice as disadvantaged as an area with a score of 1000; it just had more markers of relative disadvantage.

Ranks, Deciles and Percentiles

As an ordinal measurement, it’s often more appropriate to use alternative measures rather than the raw score. We have calculated ranks, deciles and percentiles and included these in the output spreadsheets. These measures are defined below.

Rank

The areas are ranked in order of their score, from lowest to highest, with rank one representing the most disadvantaged area. Note that in the spreadsheets, rankings are provided on a national basis and also a state/territory basis. Note that the same set of scores is used for each ranking – the scores are not recalculated for each state/territory.

Deciles

All areas are ordered from lowest to highest score, the lowest 10% of areas are given a decile number of one, the next lowest 10% of areas are given a decile number of two and so on, up to the highest 10% of areas which are given a decile number of 10. This means that areas are divided into ten equal sized groups, depending on their score.

Percentiles

All areas are ordered from lowest to highest score, the lowest 1% of areas are given a percentile number of one, the next lowest 1% of areas are given a percentile number of two and so on, up to the highest 1% of areas which are given a percentile number of 100. This means that areas are divided into one hundred equal sized groups, depending on their score. Sometimes deciles and percentiles are referred to generally as quantiles. Other commonly used quantiles include quintiles and quartiles, although we have not included these in the output spreadsheets. They can easily be derived using the percentiles.

Geographic output levels for SEIFA 2021

The primary unit of analysis and the smallest area for which the indexes are available is the Statistical Area Level 1 (SA1). This is the recommended unit of analysis for SEIFA 2021.

For a selection of geographic areas larger than SA1, scores have been calculated by taking population-weighted averages of constituent SA1 scores. The output spreadsheets also contain some information about the distribution of SA1 index scores within larger areas. This enables users to consider the socio-economic diversity that can exist within a larger area.

The table below summarises the output available at the different geographic levels.

Geographic output summary for SEIFA 2021

Geographic unit

Index score

SA1 distribution information

Statistical Area level 1 (SA1)

Yes

N/A

Statistical Area level 2 (SA2)

Yes

Yes

Statistical Area level 3 (SA3)

No

Yes

Statistical Area level 4 (SA4)

No

Yes

Local Government Area (LGA)

Yes

Yes

Suburbs and Localities (SAL)

Yes

Yes

Postal Area (POA)

Yes

Yes

Commonwealth Electoral Division (CED)

No

Yes

State Electoral Division (SED)

No

Yes

For the geographies larger than SA1, and not in the ASGS (LGAs, SALs and POAs), a best fit correspondence of SA1s to the larger geographies was used. Local Government Areas (LGAs), Suburbs and Localities (SALs) and Postal Areas (POAs) are constructed from Mesh Blocks in the 2021 version of the ASGS. In some cases, particularly for certain SALs with small populations, the SA1 boundaries do not correspond closely to the higher level area. For this reason, SEIFA scores for SALs and POAs with small populations should be used with caution, as the scores may have been calculated from populations that do not correspond closely with the actual population in the area. Refer to ABS Maps for information useful for identifying areas that do not correspond closely to the SA1 structure.

The output spreadsheets contain specific references to the ABS publications from which the geography classifications and correspondences have been sourced.