6311.0.55.001 - Microdata: Employee Earnings and Jobs, Australia, 2011-12

ARCHIVED ISSUE Released at 11:30 AM (CANBERRA TIME) 18/01/2016 First Issue

Page tools: Print Page Print All
Explanatory Notes Glossary Abbreviations Data Cubes (I-Note)	EXPLANATORY NOTES 1 The Employee Earnings and Jobs microdata product contains a 10% sample (at the person level) of the integrated employer-employee file. FILE STRUCTURE 2 The structure of the microdata product is hierarchical: 1. Person (Employee) 2. Job (along with business characteristics relating to that job) 3 For persons who had a missing job record, a ‘dummy’ (job) record has been created to maintain the integrity of the file structure. Data items for these records have 'Not known' values if relevant, or have been given a zero value. These records are identified by the data item Dummy Job record data flag (DUMJOBF) having a value of 1. 4 The same applies to business data items where a job could not be linked to a business. These records are identified by the data item No business data available flag (NOBUSDAT) having a value of 1. DATA SOURCES 5 Person and business level data for 2011-2012, sourced from the Personal Income Tax data and the Expanded Analytical Business Longitudinal Database respectively, were used to construct the Integrated Dataset from which the microdata product was created. Personal Income Tax (PIT) dataset 6 The PIT dataset contains person level unit record data compiled by the Australian Taxation Office (ATO) and consists of three subsets. Client Register; Client Dataset; and Individual Pay As You Go (PAYG) Dataset. 7 An extract of the PIT dataset containing selected variables has been used in constructing the microdata product and the data for each individual has been linked across these three subsets using an encrypted person identifier, the Scrambled Tax File Number. Expanded Analytical Business Longitudinal Database (EABLD) 8 The EABLD is the longitudinal business level unit record data file created by the ABS in 2015. The Integrated Dataset used an extract of the EABLD for 2011-12 containing selected variables. The linking variable between PIT dataset and EABLD extract was Australian Business Number (ABN) as issued by ATO. For further information on the data sources and the linking methodology, refer to the Information Paper: Construction of Experimental Statistics on Employee Earnings and Jobs from Administrative Data, Australia, 2011-12 (cat. no. 6311.0). SCOPE 9 This microdata product aims to represent information on all employee earnings and jobs in Australia throughout the reference period of 1 July 2011 to 30 June 2012. The scope includes: All persons who were an employee at any point in the reference period as recorded on either an Individual Tax Return (ITR) or an Individual PAYG summary; All jobs as reported in an Individual PAYG summary during the reference period; and All businesses which provided an Individual PAYG summary to an employee in the reference period. COVERAGE 10 Employees who meet one of the following conditions are excluded from coverage in the microdata product. Employees who did not report earnings on an ITR for any of the following reasons: Did not submit an ITR for any of the reasons outlined on pages 6 and 7 of the Individual Tax Return Instructions 2012; Did not submit an ITR for any other reason; or Submitted an ITR but did not report their applicable earnings. Employees who did not receive an Individual PAYG summary from an employer for any reason including: They worked for cash in hand or other payments not recorded on an Individual PAYG summary; They conducted illicit activities not recorded on Individual PAYG summaries; or They did not supply their Tax File Number to their employer. 11 There were no businesses excluded on the basis of coverage. 12 For further information on scope and coverage of the Integrated Dataset, refer to the Information Paper: Construction of Experimental Statistics on Employee Earnings and Jobs from Administrative Data, Australia, 2011-12 (cat. no. 6311.0). DATA LINKING METHODOLOGY 13 The Integrated Dataset was created through a two stage process. The first stage involved linking the component files (Client Register, Client Dataset and PAYG) within the PIT dataset, and the second stage involved integrating the linked PIT dataset with the EABLD. 14 For details of the data linking process refer to the Information Paper: Construction of Experimental Statistics on Employee Earnings and Jobs from Administrative Data, Australia, 2011-12 (cat. no. 6311.0). DATA CLEANING 15 This Employee Earnings and Jobs microdata product is comprised, in part, of tax data supplied by the ATO to the ABS under the Taxation Administration Act 1953, which requires that the ABS only use the data for the purpose of administering the Census and Statistics Act 1905. Any discussion of data limitations or weaknesses is in the context of using the data for statistical purposes, and is not related to the ability of the data to support the ATO's core operational requirements. 16 Data cleaning was undertaken on the PIT data in order to remove duplicate records, remove invalid PAYG records (jobs with less than $1 in gross payments), and derive data items which aligned with ABS standards and classifications, where possible. Duplicate records were identified as those where all variables were identical. Demographic variables (age and sex) were checked to ensure that they were referenced to 30 June 2012. Variables such as occupation were checked to ensure that they adhered to the ABS classifications and any erroneous or invalid codes were removed. 17 For the purposes of this microdata product, minimal data cleaning was required on the EABLD extract. In creating the EABLD, transformation of source data was required to ensure that the contents adhered to the ABS standards and classifications. 18 For details of the data cleaning process refer to the Information Paper: Construction of Experimental Statistics on Employee Earnings and Jobs from Administrative Data, Australia, 2011-12 (cat. no. 6311.0). SAMPLE DESIGN 19 In order to mitigate risks of disclosure only a sample (10%) of the records (person level) on the Integrated Dataset have been included in the microdata product. The sample has been chosen to be representative of the key characteristics of Employees using a stratified sample design. 20 Key aspects of the sample design are: The sample was stratified by Statistical Area Level 4, Occupation groups at the 1 digit level and ranges of total annual employee earnings. Strata were constructed to have a minimum size of 100 persons. 21 A 10% simple random sample was taken from each stratum. 22 The weighting ensured there was broad representativeness at the Statistical Area Level 4 by 1 digit Occupation and Age by Sex by 1 digit Occupation levels. 23 The microdata output contains: 1,033,031 persons which when weighted represent 10,333,171 persons. 1,387,945 job records. 315,674 businesses (stored at the job level). This number consists of 257,045 where business information is available and 58,629 dummy businesses allocated to jobs where business information was not available. The latter can be identified by the data item 'No business data available flag' equalling 1. WEIGHTING Sample weights 24 Weighting is the process of adjusting a sample to infer results for the relevant population. To do this, a 'weight' is allocated to each sample unit - in this case, person (employee) records. The weight can be considered an indication of how many employees in the relevant population are represented by each person in the sample. 25 Estimates of the total number of persons with the specified characteristic should be obtained by summing the PERSON weights assigned to each linked record, using the variable called SWEIGHT. 26 Weights were calculated by calibrating to the following benchmarks: Total Earnings for each Statistical Area Level 4 by Occupation group (including 'Not known' SA4 and/or 'inadequately described' Occupation); and Total Earnings for each Age range by Sex by Occupation group (including 'inadequately described' Occupation). 27 This calibration ensures that the weighted sample estimates of total earnings in each of these groups match the total earnings for these groups according to the full Integrated Dataset. Replicate Weights 28 Replicate weights can be used in the following manner to estimate the variance of the full sample statistic. 29 Using the replicate weights, sub-samples are repeatedly selected from the whole sample and the statistic of interest calculated for each of them. The variance of the full sample statistic is then estimated using the variability among the replicate statistics calculated from these sub-samples. The sub-samples are called 'replicate groups', and the statistics calculated from these replicate groups are called 'replicate estimates'. 30 The replicate weights for the Employee Earnings and Jobs microdata product were created using the jackknife method of replication. Each record in the Employee Earnings and Jobs microdata product has 60 replicate weights attached to it. 31 The formulae for calculating the SE and RSE of an estimate using the jackknife replicate weights are: where g = 1,..,60 (the no. of replicate groups) y(g) = weighted estimate, having applied the weights for replicate group g y = weighted estimate from the full sample weight RSE(y) % = SE(y)/y * 100. 32 The 95% Margin of Error is calculated as MoE(y) = SE(y)1.96. 33 This method can also be used when modelling relationships from unit record data. In modelling, the full sample would be used to estimate the parameter being studied (such as a regression coefficient) and the 60 replicate groups would be used to provide 60 replicate estimates of the survey parameter. The variance of the estimate of the parameter from the full sample is then approximated, as above, by the variability of the replicate estimates. SOURCES OF ERROR* 34 Potential sources of error, including sampling and non-sampling errors should be kept in mind when interpreting statistics from this product. Sampling Error 35 Sampling error occurs because only a small proportion of the total population is used to produce estimates that represent the whole population. Sampling error refers to the fact that for a given sample size, each sample will produce different results, which will usually not be equal to the population value. Given the large sample size for the Employee Earnings and Jobs microdata product (1 in 10 employees), and stratified random sampling method used, sampling error will be relatively small in general, as quantified by the relative standard errors of estimates. Non-sampling Error 36 Non-sampling error is caused by factors other than those related to using a sample in developing statistical outputs. It refers to the presence of any factor that would result in the data values not accurately reflecting the 'true' value for the population. They can occur at any stage of a collection (census, sample or administrative data) and are not easily identifiable or quantifiable. 37 The administrative data used in developing this microdata product is extensive in its scope, breadth, and utility, but it also contains missing and erroneous data, as well as data not suitable for the creation of official statistics without intervention. All these contribute to non-sampling errors. Simple editing strategies and cleaning have been applied to the administrative data used in this experimental output. 38 Non-sampling errors in this microdata product include but are not limited to those related to: Linking accuracy During the construction of the Integrated Dataset a number of approaches were taken to allocate each ABN within a complex business structure to a single set of business characteristics. Further investigation into the allocation method is required as part of the future LEED development. For further detail on the accuracy of the linking, refer to the Information Paper: Construction of Experimental Statistics on Employee Earnings and Jobs from Administrative Data, Australia, 2011-12 (cat. no. 6311.0). Coverage Employees who did not report earnings on an ITR or did not receive an Individual PAYG summary from an employer were excluded from coverage in the Integrated Dataset. There were no businesses excluded on the basis of coverage. Non-response This refers to blank fields in the PIT and PAYG forms received from the ATO. No attempt was made to impute the missing values as they were not sufficient to impact the analytical value of the dataset. For further details regarding missing values for key variables refer to the Information Paper: Construction of Experimental Statistics on Employee Earnings and Jobs from Administrative Data, Australia, 2011-12 (cat. no. 6311.0). Response errors This refers to a type of error caused by respondents intentionally or accidentally providing inaccurate responses. They are hard to detect and to quantify. The extent of occurrence of this error has not been assessed in this experimental exercise. Processing errors This refers to errors that occur in the process of data collection, data entry, coding, editing and output. Once again, these are hard to identify and quantify and have not been assessed in this experimental exercise. CONFIDENTIALITY 39 The Census and Statistics Act, 1905 provides the authority for the ABS to collect statistical information, and requires that statistical output shall not be published or disseminated in a manner that is likely to enable the identification of a particular person or organisation. The confidentiality of respondents and businesses was maintained throughout the process. Access to taxation data is tightly controlled within the ABS. Policies and Guidelines governing the disclosure of information were implemented and followed in order to maintain the confidentiality of individuals and businesses. 40 Some techniques used to minimise the risk of identifying individuals and businesses in this microdata product are collapsing of categories (e.g., geography collapsed to state/territory level for the smaller states/territories of Tasmania, Northern Territory and Australian Capital Territory) and perturbation. 41 Perturbation involves making small random adjustment to values and is considered the most satisfactory technique for mitigating the risk of identification while maximising the range of information that can be released. The two earnings variables Total earnings from all jobs held in reference period and Gross payment amount per job held during the reference period have been perturbed. Perturbation has had a negligible impact on the underlying distribution of the variables. COHERENCE OF OUTPUTS ACROSS OTHER ABS COLLECTIONS 42 Analysis was conducted to assess the comparability of aggregate statistics produced from the full Integrated Dataset (experimental statistics) and those from related ABS household and business survey collections. They were found to be broadly coherent; however, differences were identified due to the differences in scope, sample design, collection methodology and processing approaches. Moreover, the Integrated Dataset is based on data collected for administrative purposes, whereas ABS collections are designed to create statistical outputs. 43 For further information on the coherence of the experimental statistics with ABS estimates refer to the Information Paper: Construction of Experimental Statistics on Employee Earnings and Jobs from Administrative Data, Australia, 2011-12 (cat. no. 6311.0). 44 As the microdata product is a subset of the Integrated Dataset, similar differences between statistics produced from the microdata and those from other ABS surveys can be expected. Document Selection These documents will be presented in a new window.