METHODOLOGY
SELECTION OF SAMPLE
Data in the Census microdata files represent samples of dwelling, family and person records from the 2016 Census of Population and Housing. Systematic sampling techniques were utilised to ensure a representative sample across states and territories in each microdata file.
Detailed Microdata and Basic CURF files
The Detailed Microdata file contains a 5% sample of dwelling records, taken from occupied private dwellings and non-private dwellings, and their associated family and person records. That is, the Detailed Microdata file provides a sample of five occupied private and non-private dwelling records in every hundred from the Census, and their associated family and person records.
The 1% Basic CURF provides a sample of one private dwelling record in every hundred from the Census, and the associated family and person records. Dwellings with more than six usual residents were removed from the sample to ensure confidentiality of large dwellings. For non-private dwellings the sampling is applied to persons present, where one person in every hundred is selected and the associated dwelling records included on the file.
The data are released under the Census and Statistics Act 1905, which has provision for the release of individual level records, i.e. unit records, where the information is not likely to enable the identification of a particular person or organisation. Accordingly, there are no names or addresses on the microdata files, and other steps, including the following list of actions, are taken to maintain respondent confidentiality.
In both the Detailed Microdata and the Basic CURF:
- Records from the Other Territories, comprising Jervis Bay, Cocos (Keeling) and Christmas Islands, have been excluded from sampling, as have migratory, shipping and off-shore statistical areas; and
- Some data items that were collected in the Census have been excluded from the files.
In the Basic CURF, additional confidentiality measures were undertaken:
- Large households, i.e., with seven or more usual residents, have been replaced in the sample to ensure confidentiality of large households. A dwelling from a similar geographic region with similar size (up to six residents) was chosen via random sampling as a replacement for each large household;
- The level of detail of certain data items has been reduced by grouping, ranging or top coding values; and
- Where necessary, minor edits were made to individual records.
The nature of the changes made, and the relatively small number of records involved, ensure that the effect on data for analysis purposes is considered negligible. These changes also mean that estimates produced from the microdata files may differ from those published in Census products (Quickstats, DataPacks, Community Profiles and TableBuilder) or DataLab output.
Data included on the microdata files comprise the key output items for the 2016 Census, including person demographics, labour force, education, family and dwelling characteristics. For a full list of available data items in TableBuilder, Detailed Microdata and Basic CURF files, please see the Data Item Lists on the Downloads tab.
CHANGES FROM PREVIOUS CENSUS MICRODATA FILES |
There have been 5 new data items included on the 2016 Detailed Microdata file and 4 new data items on the Basic CURF. These are:
- Indigenous Status (INGP) on the persons level
- Indigenous Household Indicator (INGDWTD) on the dwelling level
- Form type (FTPP) on the persons level
- Status in Employment (SIEMP), which is a new item for the 2016 Census and replaces Employment Type (EMTP), which was used in 2011 Census output.
- Type of Non-Private Dwelling (NPDD) on the dwelling level (available on the Detailed Microdata file only).
The following data items underwent changes to their classifications in the 2016 Census:
- Ancestry (ANC1P, ANC2P)
- Birthplace of Mother (BPFP)
- Birthplace of Father (BFMP)
- Income classifications for persons (INCP), family (FINF, FINASF, FIDF) and household (HIND, HINASD, HIDD, HIED)
- Religious Affiliation (RELP)
- Year of Arrival in Australia (YARP), to accommodate the years between the 2011 and 2016 Censuses.
For more information about these data items, refer to the
Census Dictionary, 2016 (cat. no. 2901.0).
ESTIMATION PROCEDURE
An estimate of the total for an item can be obtained by totalling the item for the relevant Census microdata file and then multiplying the result by 20 for the Detailed Microdata file, or by 100 for the Basic CURF. Note that this estimate of total will not correspond exactly to the total that would be obtained from the full Census, firstly because of the sampling error arising due to the microdata files containing only a sample of Census records, and secondly, in the Basic CURF, because of the exclusion of large households.
Averages from the microdata files, such as the proportion of persons falling into a particular category, can be used as an estimate of the corresponding average in the Census. For example, the proportion of Australian born persons who are students is estimated by the proportion of students observed among Australian born persons on the microdata files. Note that if the denominator of such a proportion is known from the full Census then it can be multiplied by the estimated proportion to give an estimate of the numerator. For example, the total number of Australian born students could be estimated by multiplying the above proportion by the Australian born population. This gives an alternative estimate from using one of the microdata files (rather than counting the Australian born students on the Detailed Microdata file and multiplying by 20) that may be preferred in some circumstances, since it is more compatible with the known full-Census count.
Household, family and person estimates are available for private dwellings in both the Census Microdata files. For the detailed microdata file, person and household estimates are available for non-private dwellings, however for the basic CURF, only person estimates are available, due to the differing sampling methodologies. Family records are not applicable for non-private dwellings in both files.
RELIABILITY OF ESTIMATES
The sampling error should be taken into account when interpreting estimates from the Census microdata files. A measure of the likely difference between an estimate from the Census microdata files and the corresponding full Census value is given by the standard error (SE) of the estimate. The SE indicates the extent to which an estimate might have varied by chance because only a sample of persons was included. There are about two chances in three that a sample estimate will differ by less than one SE from the full Census value, and about 19 chances in 20 that the difference will be less than two SEs. Another measure of sampling variability is the relative standard error (RSE), which is obtained by expressing the SE as a percentage of the estimate to which it refers.
Non-sampling errors may occur in any statistical collection - a full count or a sample - and should not be confused with imprecision due to sampling error, which is measured by the SE. Non-sampling errors in both Census microdata files are differences due to the exclusion of large dwellings, while in the Census as a whole there may be inaccuracies that occur because of imperfections in reporting by respondents, errors made in collection (such as when recording responses) and errors made in processing the Census data. It is not possible to quantify non-sampling error, but every effort is made to reduce it to a minimum. For the following examples, non-sampling error is assumed to be zero. In practice, the potential for non-sampling error adds to the uncertainty in the estimates that is caused by sampling variability.
Standard error calculation
Both Census microdata files can be treated, for the purposes of standard error calculations, as a simple random sample of dwellings from the private dwelling population. For some analytic purposes, the non-private dwelling population has only a minor influence on results, and it is sufficient to include each person counted in a non-private dwelling as a separate 'dwelling' when calculating standard errors.
Dwelling level estimates
Estimates of the SE of averages for dwelling-level items can be obtained using standard formulae for a simple random sample. These standard error formulae require computing the average value of an item of interest per dwelling on the Census microdata file. The formula for , the estimated average of an item that takes value for dwelling d out of n sampled dwellings in a geographic area, is:
where represents summing over the n dwellings.
The standard error estimate is given by the following formula:
The estimate of the total count for this item, and its corresponding SE estimate , are obtained by multiplying the average per dwelling by the number of dwellings in the geographic area. The number of dwellings is approximated with minimal error by:
where w is the weight (20 on the Detailed Microdata file and 100 on the Basic CURF) since the construction of the Census microdata file ensures proportional representation of geographic areas.
The formulae are as follows:
Note that the geographic area to be used in these calculations should be the smallest geographic area containing the dwellings in question. For example, estimates for a single state should use state as the geographic area.
Person level estimates
The above formulae can be applied to totals of persons by treating the as person counts within the dwelling i.e. is the number of persons from dwelling d with the characteristic of interest. This makes the average number of persons per dwelling having this characteristic, and the total number of persons in the geographic area with this characteristic.
Family level estimates
Similarly, estimates for family-level items can be obtained by treating the as family counts within the dwelling i.e. is the number of families from dwelling d with the characteristic of interest, is the average number of families per dwelling having the characteristic, and is the total number of families in the geographic area with the characteristic.
Clustering of the person sample
For some person level variables, it may be a reasonable approximation to treat the Census microdata files as a simple random sample of persons, even though it is in fact a sample of dwellings. This would involve letting d in the above formulae indicate persons rather than dwellings, and replacing n by the number of persons in the geographic area of interest. Person level means and associated standard errors could then be obtained by a standard tabulation package applied to the person level data.
Unfortunately, doing this will typically give an underestimate of the actual SE. The extent of this underestimation depends on how clustered the variable of interest is within dwellings - that is, on how often similar values of the variable tend to occur together in the same dwelling. The understatement of standard error will be greatest for variables that are highly clustered within dwellings, such as birthplace.
For this reason, it would be appropriate when treating the Census microdata files as a sample of persons to obtain a measure of the effect of clustering for the variables being investigated. A suitable measure is the design factor (DEFT), given by the ratio of the SE calculated correctly (with dwellings as units) to the SE calculated treating persons as units. Standard errors from the person level analysis can then be adjusted by this factor.
The SE ignoring clustering will be denoted by , with the subscript p indicating that it is calculated at the person level. This can be obtained by taking the person level Census microdata file and creating a variable taking the value 1 for Australian born persons and 0 otherwise. This is then used to estimate the total and its SE.
An example using the 2011 Census microdata files showed that the standard error produced ignoring clustering underestimates the actual standard error by a factor of 2. Users could expect that other totals (eg. for geographic regions) for the variable 'Australian-born' would have a similar design factor.
Standard errors for proportions and differences
Proportions
Simple approximations can be used to estimate the standard error for a ratio of counts. If and are estimated totals for two nested categories (i.e. category 2 is a subset of category 1) then writing
for the relative standard error gives the following approximation:
This formula depends on the two categories being nested, and should not be used for distinct categories.
Differences
If two totals are for distinct categories (e.g. in comparing estimates across states), then the difference between two totals has the following SE approximation:
While this formula will only be exact for differences between separate and uncorrelated (unrelated) characteristics or sub-populations, it is expected to provide a good approximation for most differences likely to be of interest.
Regression estimates
One use of the sample file will be to examine relationships between variables using regression methods. By treating the dwelling as the sample unit, standard regression packages can be used unweighted and the resulting standard errors and test statistics will be good estimates. For example, a regression model could be derived for , the number of persons in the dwelling needing assistance with core activities, against various characteristics such as , the number of persons in the dwelling aged over 65 years, to fit the linear regression model:
Measures of model fit and of significance of the parameters from the standard package will then be appropriate. Unfortunately, such a linear model may not adequately describe the relationships between variables at a dwelling level.
If a similar regression is performed treating person as the sample unit, the resulting standard errors and measures of significance could be inaccurate or misleading. This arises because the persons in the sample are clustered within dwellings, and so their responses may be "correlated" or affected by similar influences such as characteristics of the dwelling. The extent to which the measures of significance are affected will depend on how clustered the variable is likely to be within dwellings.
If a person level analysis is performed, such as a 'logistic analysis' of the probability of a person having a given characteristic, then the effect of clustering should be taken into account when interpreting the outcomes. In particular, SEs are likely to be understated, as discussed in the section Clustering of the person sample, and this will tend to increase the apparent significance of modelled effects.
Techniques are available to perform valid analyses at the person level for a sample that is clustered within dwellings, treating persons as being subject to both person and dwelling effects. These techniques include 'multi-level', 'random effect' and 'mixed' modelling. (Footnote 1 and 2)
By using these techniques, models can be used that do a better job of describing the actual relationships between variables at both person and dwelling level. Statistical packages are widely available to validly perform such analyses.
___________________________
Footnote 1 Goldstein, H. and Arnold, E, 1995, 'Multilevel Statistical Models', 2nd ed.Halsted Press, New York.
Footnote 2 Snijders Tom A. B. and Bosker Roel J, 1999, 'Multilevel analysis : an introduction to basic and advanced multilevel modelling, SAGE, London.