4715.0 - National Aboriginal and Torres Strait Islander Health Survey, 2018-19 Quality Declaration 
Latest ISSUE Released at 11:30 AM (CANBERRA TIME) 11/12/2019   
   Page tools: Print Print Page Print all pages in this productPrint All RSS Feed RSS Bookmark and Share Search this Product

This document was added or updated on 23/06/2020.

MODELLED ESTIMATES FOR SMALL AREAS


1 INTRODUCTION

This publication contains modelled estimates of health conditions and risk factors for small areas based on data from the 2018-19 National Aboriginal and Torres Strait Islander Health Survey (NATSIHS), 2016 Estimated Resident Population (ERP), the 2016 ABS Census of Population and Housing, and aggregate administrative data sources. The term “small area” generally refers to a geographical area that is smaller than a state or territory, such as Indigenous Regions and Primary Health Networks.

2 METHODOLOGY USED

A modelled estimate can be interpreted as the expected number or proportion of people with a health condition or characteristic for an area of Australia based on the demographic information available for that area. The process of producing modelled small area estimates for health conditions measured in the NATSIHS consisted of the following components, described in detail in sections 2.1 to 2.9:

    1. Identification of the outcome variables
    2. Identification of the geographical areas
    3. Selection of the predictor variables
    4. Scoping the data
    5. Creation of binary and proportion variables
    6. Aggregating observations and merging datasets
    7. Model selection
    8. Final adjustments and outputs
    9. Assessment of the modelled estimates

2.1 Identification of the outcome variables

From the NATSIHS, modelled small area estimates (counts, proportions, measure of error) have been produced for persons with the following health conditions and risk factors:
    • Self-assessed health status, by sex
    • Smoking status, by sex
    • Body Mass Index, by sex
    • Number of long term health conditions, by sex
    • Disability status, by sex
    • Psychological distress, by sex
    • Alcohol long term risk
    • Alcohol short term risk
    • Substance use
    • Number of chronic conditions, by sex
    • Specific long term health conditions, by age group

For age groups:
    • All ages
    • 0 to 39 years
    • 40 to 54 years
    • 55 years and over

For more information about the outcome variables, including definitions, see the footnotes in each Data cube, the Glossary, or Explanatory Notes on the ABS Website.

2.2 Identification of the geographical areas

The modelled estimates for small areas have been produced at the Indigenous Region (IREG) and Primary Health Network (PHN) geographies.

2.3 Selection of the predictor variables

In order to predict outcome variables, predictor variables are required on both the NATSIHS dataset and a small area dataset containing population, Census, and administrative data. Predictor variables were created if data were available for small areas for all of urban, rural, remote and very remote Australia and if there was an expectation that they might be good predictors of the outcome variables.

For age and sex predictor variables, data at the small area level were obtained from ABS ERP data from Estimates of Aboriginal and Torres Strait Islander Australians, June 2016 (Cat. No. 3238.0.55.001). This is described below in section 2.4.

For other demographic variables collected in the NATSIHS, data at the small area level were obtained from the 2016 Census of Population and Housing, as this was the most up-to-date comprehensive source of demographic data due to the depth of information at small geographical levels.

Additional variables that were available at the small area level but not collected in the NATSIHS were also included in the model. These variables included other demographic variables on the Census, geographic variables, and variables from administrative sources.

Predictor variables that relate to the geographical areas where people reside included:
    • remoteness area
    • socio-economic indexes for areas (SEIFAs) – population-weighted deciles at the Statistical Areas Level 1 (SA1) level
    • state and territory
    • section of state (major urban/other urban/bounded locality/rural balance)
    • Greater Capital City Statistical Area (GCCSA)/balance of state
    • design area type (categorises inner city, large and small urban towns, rural towns and remote areas within states and territories for designing the sample of the NATSIHS)

Sources of geographical area data included:


Predictor variables obtained from administrative data sources are described in the following table:


Predictor variableData source

Indigenous Relative Socioeconomic Outcomes index (IRSEO) (2016)

Participation in vocational education and training (2017)

Home and Community Care Program (HACC) clients (2014-15)

Hospital admissions (2014-15 to 2016-17)

Proportion of Indigenous persons (2016)
Public Health Information Development Unit (PHIDU) November 2019 release http://phidu.torrens.edu.au/social-health-atlases/data#aboriginal-torres-strait-islander-social-health-atlas-of-australia



Within most types of predictor variables (as discussed above), several separate categories of data items were included. The variables considered for inclusion in the model are listed in the Predictor variables tab of the Datacube.

2.4 Scoping the data

The modelled estimates for small areas are applicable to persons who were usual residents of private dwellings to match the scope of the NATSIHS. They exclude:
    • non-private dwellings, for example hospitals and aged care facilities

The base data source used to compile the modelled small area estimates was the ABS Estimated Resident Population (ERP) data from Estimates of Aboriginal and Torres Strait Islander Australians, June 2016 (Cat. No. 3238.0.55.001). Adjustments were made to the ERP data, by using ratios of private to non-private dwellings, calculated from the 2016 Census to match the scope of the NATSIHS, and then summed to the NATSIHS population state by age by sex estimates. These are the ‘population denominator’ estimates included in the Data cube. It is important to note that these population estimates are not official estimates and were created solely for analysis of the NATSIHS modelled small area estimates and will not match other population data at the IREG or PHN geography level.

Adjustments were also made to the Census data, specifically the predictor variables obtained from the Census (described above in section 2.3) to match the scope of the NATSIHS. Persons residing in non-private dwellings were removed from the small area dataset using persons’ dwelling type available on the Census datasets for respondents at home on Census night. However, for persons who were not at home on Census night, information is not collected to determine if the dwelling they usually reside in is a private or non-private dwelling; therefore, their records were deleted from the small area dataset. This data adjustment assumes that the people who were away from home on Census night and live in private dwellings have the same health characteristics as the people who were at home in a private dwelling.

Modelled estimates were not produced for IREGs or PHNs that are entire states. This includes the Tasmania and ACT IREGs; and the Tasmania, NT and ACT PHNs. State-level data should be obtained from the NATSIHS published data or the TableBuilder product, as the NATSIHS sample size is designed to be sufficient at State and Territory level to directly estimate these health statistics. Modelled small area estimates are not designed for use at large geographies such as states.

Additional exclusions that were applied to the data included:
    • residents of Other Territories

2.5 Creation of binary and proportion variables

On the NATSIHS dataset outcome variables were created as binary variables to make them suitable for the type of modelling undertaken (logistic regression). For the outcome variables described in section 2.1 binary variables are created for each category of the outcome variable. For example, in the case of Body Mass Index, binary variables are created separately for underweight/normal weight, overweight, and obese. On both the NATSIHS and the small area datasets, predictor variables that were categorical were also created as binary variables. An observation took the value of 1 if an individual had a characteristic of interest and 0 otherwise. For example:
    1. in the case of overweight, the outcome variable for overweight took the value of 1 if an individual was overweight and 0 if the individual was not overweight

    2. in the case of labour force status, the predictor variable for employed took the value of 1 if an individual was employed and 0 if the individual was unemployed, not in the labour force or aged 0-14 years

Binary variables were also created on the small area dataset denoting quintiles of the characteristic of interest. For example:
    • for hospital admissions, a binary variable was created to denote whether the person lived in an area in the bottom quintile of admission rates

In addition, proportions of an area’s population with the characteristic of interest is also calculated as a predictor variable. For example:
    • the proportion of an area’s population having had a hospital admission is recorded as a predictor variable for each small area

2.6 Aggregating observations and merging datasets

All data sources were aggregated to a fixed structure (cross classification cell groups) including several levels of geography, five year age group and sex. This decreases the size of the datasets (especially the Census dataset) to increase the efficiency of the modelling process.

The Census, adjusted ERP and administrative datasets were then merged into one small area dataset.

2.7 Model selection

Models were created for each outcome variable, and each category of the outcome variables described in section 2.1 independently. For example, a different model is created and selected for overweight than for obese. However within each outcome variable the same model is used for each output classification, for example geography, age group, and sex.

The model selection method uses the small area dataset to measure the relationship between the outcome variable and possible predictor variables to determine one set of significant predictor variables. This method assumes that the relationships observed in the survey data at State and National levels also hold at the small area level. The significant predictor variables for each model are listed in the Predictor variables tab of the Data cube.

Random effects logistic regression models are used for each outcome variable. As part of any model selection process an appropriate significance level must be chosen for determining which predictor variables to include in the models. The 0.05 (95%) level is most commonly used; however, due to NATSIHS’ relatively large sample sizes, the Bayesian Information Criterion (BIC) was used to reduce the risk of over-fitting.

To verify that the model adequately predicted the outcome variable, the models were applied to small area data, summed to create Australia level modelled estimates and compared with reliable direct survey weighted estimates. This property is known as model additivity. Where model additivity was not similar, additional predictor variables were included in the model until suitable model additivity was achieved.

Using the selected model for each outcome variable, a mixed estimate comprised of modelled and survey data is then produced for each small area output classification (IREG or PHN by sex or age group). A mixed/composite estimate reflects the best trade-off between the accuracy of the direct survey weighted estimate and the error associated with the modelled estimate. For a small area that happens to have a low sampling error (because of a large sample size within that small area, for example), more weight will be given to the direct estimate when calculating a modelled estimate for that small area. On the other hand, for a small area with high sampling error, more weight will be given to the model based prediction as this will be more reliable in calculating the modelled estimate for that small area. This takes advantage of what is known about the small area location from the survey to improve the modelled estimates.

2.8 Final adjustments and outputs

The modelled estimates are then adjusted so that they sum to national direct survey estimates. The adjustment also ensures that estimates for outcome variable categories within a broader outcome category, for example Body Mass Index categories sum to the population within each small area. The associated errors resulting from the modelling process (described in Section 3), which improve on direct survey estimates’ errors, were not adjusted.

The modelled estimates in the Datacube as:
    • counts with selected characteristic (number of persons)
    • relative error
    • proportion
    • 95% margin of error of proportion (95% MoE)
    • total population (expected number of people in each small area).

The denominators (total population) used to calculate the proportions are the unofficial population estimates for each IREG or PHN (based on adjusted ERP) described above in section 2.4.

To mitigate against the identification of survey respondents, modelled estimates have been confidentialised to ensure they meet ABS requirements for confidentiality. Small area locations (IREGs or PHNs) with populations or modelled counts that didn’t meet the confidentiality rules have modelled estimates comprised solely of the modelled component, rather than the mixed/composite estimator described above. This means that no sampled contribution is included in such modelled estimates, regardless of whether sample exists in these small areas.

One facet of the adjustment process is that the ‘population denominator’ estimates for each small area will not exactly match between outcome variables. The differences are insignificant and are solely due to the adjustment process.

2.9 Assessment of the modelled estimates

Various measures were taken to examine the modelled estimates. Modelled estimates were compared with direct survey estimates from the NATSIHS for areas that were sampled. For the survey estimates, 95% Confidence Intervals (CIs) were calculated. These were plotted against the modelled estimates to see if the majority of modelled rates fell within the CIs of the NATSIHS estimates.

Relative root mean squared errors (RRMSEs) (described in section 3.4) of the modelled estimates were examined to ensure that the majority were of suitable quality.

The number, range, and applicability of predictor variables included in the models used to create the small area estimates were considered.

Comparisons among the small area estimates and choropleth maps were produced to assess whether the modelled estimates aligned with expectations. Data were confronted with available ABS National level data to assess whether the modelled estimates aligned with expectations.

Please see section 5 for a quality summary for the modelled small area estimates.

3 ACCURACY OF RESULTS

The process undertaken in producing modelled estimates overcomes much of the volatility at the IREG or PHN levels caused by sampling error. However, it should be remembered that the modelled estimates produced are still subject to errors.

The errors associated with the modelled small area estimates fall into three categories, as follows:
    1. sampling error
    2. non-sampling error
    3. modelling error

These errors are combined into an overall measure of accuracy, the relative root mean squared error (RRMSE), described in section 3.4.

3.1 Sampling Error

Sampling error is introduced into estimates because the NATSIHS data were collected from only a sample of dwellings. Therefore, they are subject to sampling variability; that is, modelled estimates may differ from those that would have been produced if all dwellings had been included in NATSIHS. The smaller the sample obtained within a small area, the greater the sampling error associated with that small area's modelled estimates will be.

3.2 Non-Sampling Error

The imprecision due to sampling error should not be confused with inaccuracies due to imperfections occurring in the survey process. Such imperfections include mistakes made in reporting by respondents and recording by interviewers, and errors made in coding and processing data. Inaccuracies of this kind are referred to as non-sampling error, and they occur in any enumeration, whether it be a full count (Census) or a sample. Unlike the other sources of error, non-sampling error is not measurable and therefore isn’t accounted for in the measured error (direct or modelled) that accompanies these estimates. Every effort is made to reduce non-sampling error to a minimum through careful design of questionnaires, intensive training and supervision of interviewers, and rigorous procedures; as detailed in the Explanatory Notes.

3.3 Modelling Error

Modelling error is introduced by model misspecification. This can occur when the choice of model is incorrect, a key predictor variable is left out or an inappropriate predictor variable is included. Therefore, the selected predictor variables chosen in the models may result in inaccurate modelled estimates for certain small areas, particularly those small areas where there isn’t a strong correlation between the available predictor variables and the health conditions. The models that have been chosen have been tested against a range of possible alternative models; however, they are only the most preferred models subject to available data at the time.

3.4 Relative Root Mean Squared Error (RRMSE) and Margin of Error (MoE)

A measure of the quality of the modelled estimates is the RRMSE. The RRMSE is used as a measure of prediction error informing how well the models predict the outcome variables. In its calculation it also inherits some aspects of modelling and sampling error. The RRMSE generally decreases as the population size increases, and is used to assess the reliability of modelled estimates.

As a general rule of thumb, estimates with RRMSEs less than 25% are considered reliable for most purposes, estimates with RRMSEs between 25% and 50% should be used with caution and estimates with RRMSEs greater than 50% are considered too unreliable for general use.

In the case of estimates of proportions, estimates with 95% MoEs greater than 10 percentage points are considered too unreliable for general use.

Estimates that were altered by more than 10% due to the adjustment process described in section 2.8 should be used with caution, as the RRMSE and 95% MoE are likely to be smaller than the true error for these estimates.

4 USING MODELLED ESTIMATES

The small area modelled estimates can be interpreted as the expected number or proportion of people with a health condition or characteristic for a typical area in Australia with the same characteristics. For some small area locations (IREGs or PHNs), there will be differences between the modelled estimates and the actual number of people with the characteristic of interest. One explanation for this is that significant local information about particular small areas exists but has not been collected for all areas and cannot be incorporated into the models. This sort of information is usually not measurable, and relies on local or expert knowledge.

Small area modelled estimates should be viewed as a tool that when used in conjunction with local area knowledge as well as the consideration of the modelled estimates reliability, can provide useful information that can assist in making decisions for small geographic areas. Care needs to be taken to ensure decisions are not based on inaccurate estimates. The provided modelled small area estimates can be aggregated to larger regions (such as regional planning regions) to help improve decision making. Small area estimates can be aggregated together using an approximation formula outlined in section 6. Aggregation of small areas should be done taking into account local knowledge about these areas.

5 QUALITY SUMMARY FOR MODELLED ESTIMATES

The quality of the modelled estimates were assessed according to the following criteria:

1. the number, range, and applicability of predictor variables included in the models
2. consistency with national direct survey estimates. For example, whether modelled estimates for circulatory system diseases increased proportionally with age
3. median RRMSE, as a measure of prediction accuracy

These culminated in an overall reliability assessment, which has three categories:
    • reliable, meaning the modelled estimates are suitable for general use
    • less reliable, meaning the modelled estimates should be used with caution
    • unreliable, meaning the modelled estimates are unsuitable for general use. Modelled estimates assessed as unreliable are not published.


Reliability assessment table: IREG and PHN estimates


Outcome variable
Number and range of predictor variables
Consistency with National data
Median RRMSE
(all persons estimates - IREG)
Median RRMSE
(all persons estimates - PHN)
Overall Reliability Assessment

Self-assessed health status - Excellent/ Very good
Reliable
Reliable
5.5%
Reliable
5.2%
Reliable
Reliable
Self-assessed health status – Good
Less reliable
Reliable
4.2%
Reliable
3.8%
Reliable
Less Reliable
Self-assessed health status - Fair/ Poor
Reliable
Reliable
9.0%
Reliable
7.2%
Reliable
Reliable
Smoking status - Current daily smoker
Reliable
Reliable
6.0%
Reliable
6.9%
Reliable
Reliable
Smoking status - Other
Reliable
Reliable
4.3%
Reliable
3.4%
Reliable
Reliable
BMI - Underweight/ Normal
Reliable
Reliable
5.0%
Reliable
4.9%
Reliable
Reliable
BMI - Overweight
Reliable
Reliable
3.2%
Reliable
3.0%
Reliable
Reliable
BMI - Obese
Reliable
Reliable
4.4%
Reliable
3.8%
Reliable
Reliable
BMI - Overweight/ Obese
Reliable
Reliable
2.3%
Reliable
1.9%
Reliable
Reliable
Number of long term conditions - One or two
Reliable
Reliable
3.9%
Reliable
3.6%
Reliable
Reliable
Number of long term conditions - Three or more
Reliable
Reliable
6.0%
Reliable
4.4%
Reliable
Reliable
Number of long term conditions - One or more
Reliable
Reliable
3.2%
Reliable
2.4%
Reliable
Reliable
Number of long term conditions - None
Reliable
Reliable
5.3%
Reliable
6.4%
Reliable
Reliable
Disability - Has disability
Reliable
Reliable
6.0%
Reliable
5.1%
Reliable
Reliable
Disability - No disability
Reliable
Reliable
3.4%
Reliable
3.3%
Reliable
Reliable
Psychological distress - Low/ Moderate
Reliable
Reliable
3.8%
Reliable
4.0%
Reliable
Reliable
Psychological distress - High/ Very high
Reliable
Reliable
8.5%
Reliable
7.5%
Reliable
Reliable
Alcohol long term risk - Exceeded
Reliable
Reliable
10.9%
Reliable
9.4%
Reliable
Reliable
Alcohol long term risk - Did not exceed
Reliable
Reliable
10.3%
Reliable
8.1%
Reliable
Reliable
Alcohol long term risk - Other
Reliable
Reliable
4.1%
Reliable
4.4%
Reliable
Reliable
Alcohol short term risk - Exceeded
Reliable
Reliable
5.4%
Reliable
5.1%
Reliable
Reliable
Alcohol short term risk - Did not exceed
Reliable
Reliable
12.3%
Reliable
10.0%
Reliable
Reliable
Alcohol short term risk - Other
Reliable
Reliable
6.4%
Reliable
6.8%
Reliable
Reliable
Substances use - Has used substances
Reliable
Reliable
7.4%
Reliable
6.5%
Reliable
Reliable
Substances use - Has not used substances
Reliable
Reliable
5.1%
Reliable
5.0%
Reliable
Reliable
Number of chronic conditions - One or two
Reliable
Reliable
4.4%
Reliable
3.4%
Reliable
Reliable
Number of chronic conditions - Three or more
Reliable
Reliable
11.1%
Reliable
8.0%
Reliable
Reliable
Number of chronic conditions - One or more
Reliable
Reliable
4.2%
Reliable
3.3%
Reliable
Reliable
Number of chronic conditions - None
Reliable
Reliable
2.7%
Reliable
3.2%
Reliable
Reliable
Endocrine, nutritional and metabolic diseases
Reliable
Reliable
6.8%
Reliable
6.4%
Reliable
Reliable
Circulatory system diseases
Reliable
Reliable
6.2%
Reliable
5.5%
Reliable
Reliable
Respiratory system diseases
Reliable
Reliable
7.3%
Reliable
5.3%
Reliable
Reliable
Neoplasms
Unreliable
Not assessed
Not assessed
Not assessed
Unreliable



6 ESTIMATING AGGREGATED AREAS

The following formulas describe the estimation of aggregated areas. This may be done for one of two reasons:
    1. Estimates are required for a bespoke small area of interest
    2. Where the error (RRMSE) for an area is unacceptably high, aggregating areas can decrease the error

Note that the error formula is an approximation only, and that these should only be used where alternative modelled estimates are not available. Aggregation of the modelled small area estimates to large geographies such as capital city or state/territory level is not recommended. If you require capital city or state/territory level data for the characteristics of health conditions provided here at small area level, then use of NATSIHS published data (or use of the TableBuilder product) is recommended.

The following formulae are used to estimate the count for an aggregated area.




The following formula may be used to approximate the RRMSE for an aggregated area.



The following formula may then be used to derive an approximate 95% MoE for an aggregated area.