|Page tools: Print Page Print All RSS Search this Product|
This document was added or updated on 23/06/2020.
MODELLED ESTIMATES FOR SMALL AREAS
2. Identification of the geographical areas
3. Selection of the predictor variables
4. Scoping the data
5. Creation of binary and proportion variables
6. Aggregating observations and merging datasets
7. Model selection
8. Final adjustments and outputs
9. Assessment of the modelled estimates
From the NATSIHS, modelled small area estimates (counts, proportions, measure of error) have been produced for persons with the following health conditions and risk factors:
For age groups:
2.2 Identification of the geographical areas
The modelled estimates for small areas have been produced at the Indigenous Region (IREG) and Primary Health Network (PHN) geographies.
2.3 Selection of the predictor variables
In order to predict outcome variables, predictor variables are required on both the NATSIHS dataset and a small area dataset containing population, Census, and administrative data. Predictor variables were created if data were available for small areas for all of urban, rural, remote and very remote Australia and if there was an expectation that they might be good predictors of the outcome variables.
For age and sex predictor variables, data at the small area level were obtained from ABS ERP data from Estimates of Aboriginal and Torres Strait Islander Australians, June 2016 (Cat. No. 3238.0.55.001). This is described below in section 2.4.
For other demographic variables collected in the NATSIHS, data at the small area level were obtained from the 2016 Census of Population and Housing, as this was the most up-to-date comprehensive source of demographic data due to the depth of information at small geographical levels.
Additional variables that were available at the small area level but not collected in the NATSIHS were also included in the model. These variables included other demographic variables on the Census, geographic variables, and variables from administrative sources.
Predictor variables that relate to the geographical areas where people reside included:
Predictor variables obtained from administrative data sources are described in the following table:
Within most types of predictor variables (as discussed above), several separate categories of data items were included. The variables considered for inclusion in the model are listed in the Predictor variables tab of the Datacube.
2.4 Scoping the data
The modelled estimates for small areas are applicable to persons who were usual residents of private dwellings to match the scope of the NATSIHS. They exclude:
The base data source used to compile the modelled small area estimates was the ABS Estimated Resident Population (ERP) data from Estimates of Aboriginal and Torres Strait Islander Australians, June 2016 (Cat. No. 3238.0.55.001). Adjustments were made to the ERP data, by using ratios of private to non-private dwellings, calculated from the 2016 Census to match the scope of the NATSIHS, and then summed to the NATSIHS population state by age by sex estimates. These are the ‘population denominator’ estimates included in the Data cube. It is important to note that these population estimates are not official estimates and were created solely for analysis of the NATSIHS modelled small area estimates and will not match other population data at the IREG or PHN geography level.
Adjustments were also made to the Census data, specifically the predictor variables obtained from the Census (described above in section 2.3) to match the scope of the NATSIHS. Persons residing in non-private dwellings were removed from the small area dataset using persons’ dwelling type available on the Census datasets for respondents at home on Census night. However, for persons who were not at home on Census night, information is not collected to determine if the dwelling they usually reside in is a private or non-private dwelling; therefore, their records were deleted from the small area dataset. This data adjustment assumes that the people who were away from home on Census night and live in private dwellings have the same health characteristics as the people who were at home in a private dwelling.
Modelled estimates were not produced for IREGs or PHNs that are entire states. This includes the Tasmania and ACT IREGs; and the Tasmania, NT and ACT PHNs. State-level data should be obtained from the NATSIHS published data or the TableBuilder product, as the NATSIHS sample size is designed to be sufficient at State and Territory level to directly estimate these health statistics. Modelled small area estimates are not designed for use at large geographies such as states.
Additional exclusions that were applied to the data included:
2.5 Creation of binary and proportion variables
On the NATSIHS dataset outcome variables were created as binary variables to make them suitable for the type of modelling undertaken (logistic regression). For the outcome variables described in section 2.1 binary variables are created for each category of the outcome variable. For example, in the case of Body Mass Index, binary variables are created separately for underweight/normal weight, overweight, and obese. On both the NATSIHS and the small area datasets, predictor variables that were categorical were also created as binary variables. An observation took the value of 1 if an individual had a characteristic of interest and 0 otherwise. For example:
2. in the case of labour force status, the predictor variable for employed took the value of 1 if an individual was employed and 0 if the individual was unemployed, not in the labour force or aged 0-14 years
2.6 Aggregating observations and merging datasets
All data sources were aggregated to a fixed structure (cross classification cell groups) including several levels of geography, five year age group and sex. This decreases the size of the datasets (especially the Census dataset) to increase the efficiency of the modelling process.
The Census, adjusted ERP and administrative datasets were then merged into one small area dataset.
2.7 Model selection
Models were created for each outcome variable, and each category of the outcome variables described in section 2.1 independently. For example, a different model is created and selected for overweight than for obese. However within each outcome variable the same model is used for each output classification, for example geography, age group, and sex.
The model selection method uses the small area dataset to measure the relationship between the outcome variable and possible predictor variables to determine one set of significant predictor variables. This method assumes that the relationships observed in the survey data at State and National levels also hold at the small area level. The significant predictor variables for each model are listed in the Predictor variables tab of the Data cube.
Random effects logistic regression models are used for each outcome variable. As part of any model selection process an appropriate significance level must be chosen for determining which predictor variables to include in the models. The 0.05 (95%) level is most commonly used; however, due to NATSIHS’ relatively large sample sizes, the Bayesian Information Criterion (BIC) was used to reduce the risk of over-fitting.
To verify that the model adequately predicted the outcome variable, the models were applied to small area data, summed to create Australia level modelled estimates and compared with reliable direct survey weighted estimates. This property is known as model additivity. Where model additivity was not similar, additional predictor variables were included in the model until suitable model additivity was achieved.
Using the selected model for each outcome variable, a mixed estimate comprised of modelled and survey data is then produced for each small area output classification (IREG or PHN by sex or age group). A mixed/composite estimate reflects the best trade-off between the accuracy of the direct survey weighted estimate and the error associated with the modelled estimate. For a small area that happens to have a low sampling error (because of a large sample size within that small area, for example), more weight will be given to the direct estimate when calculating a modelled estimate for that small area. On the other hand, for a small area with high sampling error, more weight will be given to the model based prediction as this will be more reliable in calculating the modelled estimate for that small area. This takes advantage of what is known about the small area location from the survey to improve the modelled estimates.
2.8 Final adjustments and outputs
The modelled estimates are then adjusted so that they sum to national direct survey estimates. The adjustment also ensures that estimates for outcome variable categories within a broader outcome category, for example Body Mass Index categories sum to the population within each small area. The associated errors resulting from the modelling process (described in Section 3), which improve on direct survey estimates’ errors, were not adjusted.
The modelled estimates in the Datacube as:
The denominators (total population) used to calculate the proportions are the unofficial population estimates for each IREG or PHN (based on adjusted ERP) described above in section 2.4.
To mitigate against the identification of survey respondents, modelled estimates have been confidentialised to ensure they meet ABS requirements for confidentiality. Small area locations (IREGs or PHNs) with populations or modelled counts that didn’t meet the confidentiality rules have modelled estimates comprised solely of the modelled component, rather than the mixed/composite estimator described above. This means that no sampled contribution is included in such modelled estimates, regardless of whether sample exists in these small areas.
One facet of the adjustment process is that the ‘population denominator’ estimates for each small area will not exactly match between outcome variables. The differences are insignificant and are solely due to the adjustment process.
2.9 Assessment of the modelled estimates
Various measures were taken to examine the modelled estimates. Modelled estimates were compared with direct survey estimates from the NATSIHS for areas that were sampled. For the survey estimates, 95% Confidence Intervals (CIs) were calculated. These were plotted against the modelled estimates to see if the majority of modelled rates fell within the CIs of the NATSIHS estimates.
Relative root mean squared errors (RRMSEs) (described in section 3.4) of the modelled estimates were examined to ensure that the majority were of suitable quality.
The number, range, and applicability of predictor variables included in the models used to create the small area estimates were considered.
Comparisons among the small area estimates and choropleth maps were produced to assess whether the modelled estimates aligned with expectations. Data were confronted with available ABS National level data to assess whether the modelled estimates aligned with expectations.
Please see section 5 for a quality summary for the modelled small area estimates.
3 ACCURACY OF RESULTS
The process undertaken in producing modelled estimates overcomes much of the volatility at the IREG or PHN levels caused by sampling error. However, it should be remembered that the modelled estimates produced are still subject to errors.
The errors associated with the modelled small area estimates fall into three categories, as follows:
2. non-sampling error
3. modelling error
These errors are combined into an overall measure of accuracy, the relative root mean squared error (RRMSE), described in section 3.4.
3.1 Sampling Error
Sampling error is introduced into estimates because the NATSIHS data were collected from only a sample of dwellings. Therefore, they are subject to sampling variability; that is, modelled estimates may differ from those that would have been produced if all dwellings had been included in NATSIHS. The smaller the sample obtained within a small area, the greater the sampling error associated with that small area's modelled estimates will be.
3.2 Non-Sampling Error
The imprecision due to sampling error should not be confused with inaccuracies due to imperfections occurring in the survey process. Such imperfections include mistakes made in reporting by respondents and recording by interviewers, and errors made in coding and processing data. Inaccuracies of this kind are referred to as non-sampling error, and they occur in any enumeration, whether it be a full count (Census) or a sample. Unlike the other sources of error, non-sampling error is not measurable and therefore isn’t accounted for in the measured error (direct or modelled) that accompanies these estimates. Every effort is made to reduce non-sampling error to a minimum through careful design of questionnaires, intensive training and supervision of interviewers, and rigorous procedures; as detailed in the Explanatory Notes.
3.3 Modelling Error
Modelling error is introduced by model misspecification. This can occur when the choice of model is incorrect, a key predictor variable is left out or an inappropriate predictor variable is included. Therefore, the selected predictor variables chosen in the models may result in inaccurate modelled estimates for certain small areas, particularly those small areas where there isn’t a strong correlation between the available predictor variables and the health conditions. The models that have been chosen have been tested against a range of possible alternative models; however, they are only the most preferred models subject to available data at the time.
3.4 Relative Root Mean Squared Error (RRMSE) and Margin of Error (MoE)
A measure of the quality of the modelled estimates is the RRMSE. The RRMSE is used as a measure of prediction error informing how well the models predict the outcome variables. In its calculation it also inherits some aspects of modelling and sampling error. The RRMSE generally decreases as the population size increases, and is used to assess the reliability of modelled estimates.
As a general rule of thumb, estimates with RRMSEs less than 25% are considered reliable for most purposes, estimates with RRMSEs between 25% and 50% should be used with caution and estimates with RRMSEs greater than 50% are considered too unreliable for general use.
In the case of estimates of proportions, estimates with 95% MoEs greater than 10 percentage points are considered too unreliable for general use.
Estimates that were altered by more than 10% due to the adjustment process described in section 2.8 should be used with caution, as the RRMSE and 95% MoE are likely to be smaller than the true error for these estimates.
4 USING MODELLED ESTIMATES
The small area modelled estimates can be interpreted as the expected number or proportion of people with a health condition or characteristic for a typical area in Australia with the same characteristics. For some small area locations (IREGs or PHNs), there will be differences between the modelled estimates and the actual number of people with the characteristic of interest. One explanation for this is that significant local information about particular small areas exists but has not been collected for all areas and cannot be incorporated into the models. This sort of information is usually not measurable, and relies on local or expert knowledge.
Small area modelled estimates should be viewed as a tool that when used in conjunction with local area knowledge as well as the consideration of the modelled estimates reliability, can provide useful information that can assist in making decisions for small geographic areas. Care needs to be taken to ensure decisions are not based on inaccurate estimates. The provided modelled small area estimates can be aggregated to larger regions (such as regional planning regions) to help improve decision making. Small area estimates can be aggregated together using an approximation formula outlined in section 6. Aggregation of small areas should be done taking into account local knowledge about these areas.
5 QUALITY SUMMARY FOR MODELLED ESTIMATES
The quality of the modelled estimates were assessed according to the following criteria:
1. the number, range, and applicability of predictor variables included in the models
2. consistency with national direct survey estimates. For example, whether modelled estimates for circulatory system diseases increased proportionally with age
3. median RRMSE, as a measure of prediction accuracy
These culminated in an overall reliability assessment, which has three categories:
Reliability assessment table: IREG and PHN estimates
6 ESTIMATING AGGREGATED AREAS
The following formulas describe the estimation of aggregated areas. This may be done for one of two reasons:
2. Where the error (RRMSE) for an area is unacceptably high, aggregating areas can decrease the error
Note that the error formula is an approximation only, and that these should only be used where alternative modelled estimates are not available. Aggregation of the modelled small area estimates to large geographies such as capital city or state/territory level is not recommended. If you require capital city or state/territory level data for the characteristics of health conditions provided here at small area level, then use of NATSIHS published data (or use of the TableBuilder product) is recommended.
The following formulae are used to estimate the count for an aggregated area.
The following formula may be used to approximate the RRMSE for an aggregated area.
The following formula may then be used to derive an approximate 95% MoE for an aggregated area.
These documents will be presented in a new window.