The DataLab is an interactive data analysis solution for users to run advanced statistical analyses using detailed microdata. The DataLab environment contains recent versions of analytical software, including R, Python, SAS and STATA. Controls in the DataLab have been put in place to protect the identification of individuals and organisations. All output from DataLab sessions is cleared by an ABS officer before it is released.
The data items available in the DataLab are detailed in the Data Item List in the Data downloads section.
For more information, including prerequisites for DataLab access, see the DataLab page.
Weights
The microdata are provided as a single flat file that contains both the person and household data. Each record has a person weight (FINWTPR) and household weight (FINWTHH) indicating how many population units are represented by the sample unit. The weights produce estimates that are designed to represent the demographic spread of the entire population and correct for bias in survey selection and response. Use the person weight when analysing counts of people and the household weight when analysing counts of households.
Reliability of estimates
Two types of error are possible in estimates based on a sample survey:
- non-sampling error
- sampling error
Non-sampling error is caused by factors other than those related to sample selection. It can occur at any stage throughout the survey process and includes factors such as questions being misunderstood and selected people that do not respond (eg. refusals, non-contact).
Sampling error is the expected difference that can occur between the published estimates and the values that would have been produced if the whole population had been surveyed. Sampling error is the result of random variation and can be estimated using measures of variance in the data.
Measures of sampling error including standard error (SE), relative standard error (RSE) and margin of error (MoE) can be estimated using the replicate weights. The GSS uses 60 replicate groups for both household and person weights labelled RWH01 to RWH60 (household) and RWP01 to RWP60 (person).
Overview of replication methods
ABS household surveys employ complex sample designs and weighting which require special methods for estimating the variance of survey statistics. Variance estimators for a simple random sample are not appropriate for this survey microdata.
A class of techniques called 'replication methods' provide a general process for estimating variance for the types of complex sample designs and weighting procedures employed in ABS household surveys. The ABS uses a method called the Group Jackknife Replication Method.
A basic idea behind the replication approach is to split the sample into G replicate groups. One replicate group is then dropped from the file and a new set of weights is produced for the remaining sample. This is repeated for all G replicate groups to provide G sets of replicate weights. For each set of replicate weights, the statistic of interest is recalculated and the variance of the full sample statistic is estimated using the variability among the replicate statistics.
The statistics calculated from these replicates are called replicate estimates. Replicate weights provided on the microdata file enable variance of survey statistics, such as means and medians, to be calculated relatively simply (Further technical explanation can be found in Section 4 of Research Paper: Weighting and Standard Error Estimation for ABS Household Surveys (Methodology Advisory Committee).
How to use replicate weights
To calculate the standard error of any statistic derived from the survey data, the method is as follows:
- Calculate the estimate of the statistic of interest using the main weight.
- Repeat the calculation above for each replicate weight, substituting the replicate weight for the main weight and creating G replicate estimates. In the example where there are 60 replicate weights, you will have 60 replicate estimates.
- Use the outputs from steps 1 and 2 as inputs to the formula below to calculate the estimate of the Standard Error (SE) for the statistic of interest.
\(\mathrm{SE}_{(y)} = \sqrt{\frac{G - 1}{G} \sum_{g=1}^{G} (y_{(g)} - y)^2}\)
- \(G =\) number of replicate groups
- \(g =\) the replicate group number
- \(y_{(g)} =\) replicate estimate for group g (the estimate of y calculated using the replicate weight for g)
- \(y =\) the weighted estimate of y from the sample
From the replicate variance you can then derive the following measures of sampling error: relative standard error (RSE), or margin of error (MOE) of the estimate.
\(\mathrm{Relative\ Standard\ Error\ (RSE)} = \frac{\mathrm{SE}}{\mathrm{Estimate}}\)
\(\mathrm{Margin\ of\ Error\ (MoE)} = 1.96 \times \mathrm{SE}\)
Multi-response items
A number of questions included in the survey allowed respondents to provide one or more responses. Each response category for one of these 'multi-response questions' (or data items) is treated as a separate data item with the same general identifier (Data item name) suffixed by a letter in sequence beginning with A.
It should be noted that the sum of individual multi-response categories may be greater than the applicable population as respondents are able to select more than one response. Multi-response data items can be identified in the data item list where the words <multiple response> appear next to the data item name.
Continuous items
Some continuous data items are allocated special codes for certain responses (for example, 99999999 = Not known/not stated). Any special codes for continuous (summation) data items are listed in the Data Item List (DIL) and will be found in the categorical version of the continuous item. However, note that labelling of '0's in the DIL does not necessarily mean they are excluded from the ranges (for example - identifying 0 as 'Did not visit' or 'Did not do') as they may still be important in some calculations. Reference should be made to the categorical version of the item to identify which codes are specifically excluded.