6311.0 - Information Paper: Construction of Experimental Statistics on Employee Earnings and Jobs from Administrative Data, Australia, 2011-12  
ARCHIVED ISSUE Released at 11:30 AM (CANBERRA TIME) 11/12/2015  First Issue
   Page tools: Print Print Page Print all pages in this productPrint All


The LEED Foundation Projects integrated person and business level data for the 2011-2012 financial year to construct the Integrated Dataset. The two data sources are the PIT data and the EABLD.

Any discussion of limitations or weaknesses in relation to ATO or ABR data is in the context of using the information for statistical purposes, and is not related to the ability of the data to support the ATO or ABR's core operational requirements.


The PIT dataset is person level unit record data compiled by the ATO. It is provided to the ABS in three subsets, and the data for each individual can be linked using an encrypted person identifier, the Scrambled Tax File Number (STFN). For the purpose of the LEED Foundation Projects, select variables were extracted from each of the subsets, as described below.

Client Register

This contains demographic information for each person who has submitted an ITR (or had some other transaction with the ATO) at some point in time. This is a constantly evolving register which is updated using information from various sources such as an ITR. As a result, the information is referenced to the date at which extracts are taken by the ATO and provided to the ABS. The reference date for the Client Register file for the LEED Foundation Projects is July 2014.

The following key variables are sourced from the Client Register:

    • Scrambled TFN (STFN) of employee
    • Geocoded address (State and territory and Statistical Area Level 4).

Client Dataset

This contains personal income information for each person who lodged an ITR with the ATO. This data is aggregated to the person level and contains information such as Earnings (see Explanatory Notes, paragraphs 36-51) and Occupation in main job (see Explanatory Notes, paragraphs 62-63). Data provided to the ABS by the ATO are from taxation returns processed up to 16 months after the end of the financial year (i.e. returns processed up to 31 October 2013 for the financial year ending 30 June 2012).

The following key variables were extracted from the Client Dataset:
    • STFN of employee
    • Sex
    • Age
    • Occupation in main job
    • Salary or wages
    • Allowances, earnings, tips, directors fees etc.
    • Employer lump sum payments
    • Attributed personal services income
    • Employee share schemes, total assessable discount amount
    • Total reportable fringe benefits amounts
    • Reportable employer superannuation contributions.

For further information on the ITR data items referred to above, please refer to the Individual Tax Return Instructions 2012 on the ATO website.

Individual Pay As You Go (PAYG) Dataset

This contains job level information reported by employers about gross payments made to employees and the start and end dates for each job. This dataset contains both the STFN of the employee and the Australian Business Number (ABN) of the employer.

The following key variables were extracted from the Individual PAYG Dataset:
    • STFN of employee
    • ABN of employer
    • Gross payment amounts
    • Employment date information.


The business unit record data used in the LEED Foundation Projects comes from the EABLD, construction of which was completed in 2015. The EABLD is based on the ABS Business Register (ABSBR), which uses the ABS Units Model to describe the characteristics of businesses and the structural relationships between related businesses.

For further information on the construction of the EABLD, refer to Information Paper: Construction of the Expanded Analytical Business Longitudinal Database, 2001-02 to 2012-13 (cat. no. 8171.0).

For further information on the ABS Units Model, refer to the Appendix of the Standard Economic Sector Classifications of Australia, 2008 (cat. no. 1218.0).

The LEED Foundation Projects used an extract of the EABLD for the 2011-12 financial year containing selected variables (as described below). The EABLD extract contains all businesses registered up to and including the 2011-12 financial year. These are separated into two populations, as described below.

Non-profiled population (simple businesses)

The majority of businesses have simple structures and the unit registered for an ABN will satisfy ABS statistical reporting requirements. These businesses form the non-profiled population in the EABLD extract. The ABN is the statistical unit used in the LEED Foundation Projects to represent businesses in the non-profiled population.

Profiled population (complex businesses)

For those businesses where the ABN is not considered suitable for ABS statistical requirements, the ABS maintains its own units structure (the ABS Units Model) through direct contact with businesses. This population, known as the profiled population, consists typically of large, diverse and complex structured businesses. For businesses in the profiled population, statistical units include the Enterprise Group (EG) and the Type of Activity Unit (TAU). The range of activities carried out across the EG can be very diverse. The TAU is established to represent a grouping of one or more businesses within the EG that cover all the operations within an industry sub-division and for which a basic set of financial, production and employment data can be reported. The TAU is the statistical unit used in the LEED Foundation Projects to represent EGs in the profiled population, such that each TAU of a complex business (EG) is considered to be a separate business.

The following key variables were extracted from the EABLD:

    • ABN (non-profiled population) or TAU (profiled population) of business
    • Type of Legal Organisation (TOLO)
    • Standard Institutional Sector Classification Australia (SISCA)
    • Employment size
    • Industry (ANZSIC)
    • Business turnover.


Data cleaning was undertaken on the PIT data in order to remove duplicate records, remove invalid PAYG records (jobs with less than $1 in gross payment), and derive data items which aligned with ABS standards and classifications. Duplicate records were identified as those where all variables were identical. Demographic variables (e.g. age and sex) were checked to ensure that they were referenced to 30 June 2012. Variables such as occupation were checked to ensure that they adhered to the ABS classifications and any erroneous or invalid codes were removed. After this cleaning, there were 12,734,746 records on the Client Register and Client Dataset (combined), and 13,316,438 records on the Individual PAYG Dataset.

Negligible data cleaning was required on the EABLD extract. After this cleaning, there were 6,917,943 records on the EABLD extract.