|Page tools: Print Page Print All RSS Search this Product|
7 An extract of the PIT dataset containing selected variables has been used in constructing the microdata product and the data for each individual has been linked across these three subsets using an encrypted person identifier, the Scrambled Tax File Number.
Expanded Analytical Business Longitudinal Database (EABLD)
8 The EABLD is the longitudinal business level unit record data file created by the ABS in 2015. The Integrated Dataset used an extract of the EABLD for 2011-12 containing selected variables. The linking variable between PIT dataset and EABLD extract was Australian Business Number (ABN) as issued by ATO. For further information on the data sources and the linking methodology, refer to the Information Paper: Construction of Experimental Statistics on Employee Earnings and Jobs from Administrative Data, Australia, 2011-12 (cat. no. 6311.0).
9 This microdata product aims to represent information on all employee earnings and jobs in Australia throughout the reference period of 1 July 2011 to 30 June 2012. The scope includes:
10 Employees who meet one of the following conditions are excluded from coverage in the microdata product.
12 For further information on scope and coverage of the Integrated Dataset, refer to the Information Paper: Construction of Experimental Statistics on Employee Earnings and Jobs from Administrative Data, Australia, 2011-12 (cat. no. 6311.0).
DATA LINKING METHODOLOGY
13 The Integrated Dataset was created through a two stage process. The first stage involved linking the component files (Client Register, Client Dataset and PAYG) within the PIT dataset, and the second stage involved integrating the linked PIT dataset with the EABLD.
14 For details of the data linking process refer to the Information Paper: Construction of Experimental Statistics on Employee Earnings and Jobs from Administrative Data, Australia, 2011-12 (cat. no. 6311.0).
15 This Employee Earnings and Jobs microdata product is comprised, in part, of tax data supplied by the ATO to the ABS under the Taxation Administration Act 1953, which requires that the ABS only use the data for the purpose of administering the Census and Statistics Act 1905. Any discussion of data limitations or weaknesses is in the context of using the data for statistical purposes, and is not related to the ability of the data to support the ATO's core operational requirements.
16 Data cleaning was undertaken on the PIT data in order to remove duplicate records, remove invalid PAYG records (jobs with less than $1 in gross payments), and derive data items which aligned with ABS standards and classifications, where possible. Duplicate records were identified as those where all variables were identical. Demographic variables (age and sex) were checked to ensure that they were referenced to 30 June 2012. Variables such as occupation were checked to ensure that they adhered to the ABS classifications and any erroneous or invalid codes were removed.
17 For the purposes of this microdata product, minimal data cleaning was required on the EABLD extract. In creating the EABLD, transformation of source data was required to ensure that the contents adhered to the ABS standards and classifications.
18 For details of the data cleaning process refer to the Information Paper: Construction of Experimental Statistics on Employee Earnings and Jobs from Administrative Data, Australia, 2011-12 (cat. no. 6311.0).
19 In order to mitigate risks of disclosure only a sample (10%) of the records (person level) on the Integrated Dataset have been included in the microdata product. The sample has been chosen to be representative of the key characteristics of Employees using a stratified sample design.
20 Key aspects of the sample design are:
21 A 10% simple random sample was taken from each stratum.
22 The weighting ensured there was broad representativeness at the Statistical Area Level 4 by 1 digit Occupation and Age by Sex by 1 digit Occupation levels.
23 The microdata output contains:
24 Weighting is the process of adjusting a sample to infer results for the relevant population. To do this, a 'weight' is allocated to each sample unit - in this case, person (employee) records. The weight can be considered an indication of how many employees in the relevant population are represented by each person in the sample.
25 Estimates of the total number of persons with the specified characteristic should be obtained by summing the PERSON weights assigned to each linked record, using the variable called SWEIGHT.
26 Weights were calculated by calibrating to the following benchmarks:
27 This calibration ensures that the weighted sample estimates of total earnings in each of these groups match the total earnings for these groups according to the full Integrated Dataset.
28 Replicate weights can be used in the following manner to estimate the variance of the full sample statistic.
29 Using the replicate weights, sub-samples are repeatedly selected from the whole sample and the statistic of interest calculated for each of them. The variance of the full sample statistic is then estimated using the variability among the replicate statistics calculated from these sub-samples. The sub-samples are called 'replicate groups', and the statistics calculated from these replicate groups are called 'replicate estimates'.
30 The replicate weights for the Employee Earnings and Jobs microdata product were created using the jackknife method of replication. Each record in the Employee Earnings and Jobs microdata product has 60 replicate weights attached to it.
31 The formulae for calculating the SE and RSE of an estimate using the jackknife replicate weights are:
y(g) = weighted estimate, having applied the weights for replicate group g
y = weighted estimate from the full sample weight
RSE(y) % = SE(y)/y * 100.
32 The 95% Margin of Error is calculated as MoE(y) = SE(y)*1.96.
33 This method can also be used when modelling relationships from unit record data. In modelling, the full sample would be used to estimate the parameter being studied (such as a regression coefficient) and the 60 replicate groups would be used to provide 60 replicate estimates of the survey parameter. The variance of the estimate of the parameter from the full sample is then approximated, as above, by the variability of the replicate estimates.
SOURCES OF ERROR
34 Potential sources of error, including sampling and non-sampling errors should be kept in mind when interpreting statistics from this product.
35 Sampling error occurs because only a small proportion of the total population is used to produce estimates that represent the whole population. Sampling error refers to the fact that for a given sample size, each sample will produce different results, which will usually not be equal to the population value. Given the large sample size for the Employee Earnings and Jobs microdata product (1 in 10 employees), and stratified random sampling method used, sampling error will be relatively small in general, as quantified by the relative standard errors of estimates.
36 Non-sampling error is caused by factors other than those related to using a sample in developing statistical outputs. It refers to the presence of any factor that would result in the data values not accurately reflecting the 'true' value for the population. They can occur at any stage of a collection (census, sample or administrative data) and are not easily identifiable or quantifiable.
37 The administrative data used in developing this microdata product is extensive in its scope, breadth, and utility, but it also contains missing and erroneous data, as well as data not suitable for the creation of official statistics without intervention. All these contribute to non-sampling errors. Simple editing strategies and cleaning have been applied to the administrative data used in this experimental output.
38 Non-sampling errors in this microdata product include but are not limited to those related to:
39 The Census and Statistics Act, 1905 provides the authority for the ABS to collect statistical information, and requires that statistical output shall not be published or disseminated in a manner that is likely to enable the identification of a particular person or organisation. The confidentiality of respondents and businesses was maintained throughout the process. Access to taxation data is tightly controlled within the ABS. Policies and Guidelines governing the disclosure of information were implemented and followed in order to maintain the confidentiality of individuals and businesses.
40 Some techniques used to minimise the risk of identifying individuals and businesses in this microdata product are collapsing of categories (e.g., geography collapsed to state/territory level for the smaller states/territories of Tasmania, Northern Territory and Australian Capital Territory) and perturbation.
41 Perturbation involves making small random adjustment to values and is considered the most satisfactory technique for mitigating the risk of identification while maximising the range of information that can be released. The two earnings variables Total earnings from all jobs held in reference period and Gross payment amount per job held during the reference period have been perturbed. Perturbation has had a negligible impact on the underlying distribution of the variables.
COHERENCE OF OUTPUTS ACROSS OTHER ABS COLLECTIONS
42 Analysis was conducted to assess the comparability of aggregate statistics produced from the full Integrated Dataset (experimental statistics) and those from related ABS household and business survey collections. They were found to be broadly coherent; however, differences were identified due to the differences in scope, sample design, collection methodology and processing approaches. Moreover, the Integrated Dataset is based on data collected for administrative purposes, whereas ABS collections are designed to create statistical outputs.
43 For further information on the coherence of the experimental statistics with ABS estimates refer to the Information Paper: Construction of Experimental Statistics on Employee Earnings and Jobs from Administrative Data, Australia, 2011-12 (cat. no. 6311.0).
44 As the microdata product is a subset of the Integrated Dataset, similar differences between statistics produced from the microdata and those from other ABS surveys can be expected.
These documents will be presented in a new window.