1 This publication uses national Vocational Education and Training (VET) in schools data for 2006, Census of Population and Housing data for 2011 and Australian Census Longitudinal Dataset data for 2006 and 2011.
Vocational education and training in schools data
2 Data on VET in Schools are collected from administrative sources held by senior secondary assessment authorities, sometimes known as Boards of Studies, or state training authorities in each state and territory. These authorities submit the data to the National Centre for Vocational Education Research where a national dataset is compiled.
3 Data are inclusive of all persons aged 15-19 years who were enrolled in a VET in Schools module or unit of competency in 2006.
4 A module is self-contained block of learning which can be completed on its own or as part of a course and which may also result in the attainment of one or more units of competency.
5 A unit of competency is a a component of a competency standard. A unit of competency is a statement of a key function or role in a particular job or occupation.
Census of Population and Housing
6 The Census is undertaken by the ABS every five years, and is collected under the authority of the Census and Statistics Act 1905. For information about the 2011 Census, including collection methodology, please refer to the information provided on the Census 2011 Reference and Information section of the ABS website. Information about the data quality of the Census is also available on the ABS website under Census Data Quality.
7 The scope of the first issue of this publication is restricted to persons who were enrolled in a VET in Schools module or unit of competency in 2006, attending school in Year 11 in 2006, and who also responded to the 2011 Census of Population and Housing.
8 Statistical data integration involves combining information from different administrative and/or survey sources to provide new datasets for statistical and research purposes. Further information on data integration is available on the National Statistical Service website – Data Integration.
9 Data linking is a key part of statistical data integration and involves the technical process of combining records from different source datasets using variables that are shared between the sources. Data linkage is typically performed on records that represent individual persons, rather than aggregates. The most common methods used link records on exact matches for common variables ('deterministic' linkage), or close matches ranked by probabilities that the variables used will result in a true match ('probabilistic' linkage).
Linking 2006 Vocational Education and Training in Schools data to 2011 Census of Population and Housing data
10 VET in Schools records were linked to Census of Population and Housing records through exact matches on responses for common variables ('deterministic' linkage). For example, a variable that was common to each dataset was Sex which had the possible responses of '1' (Male) or '2' (Female), if a record had a response of '1' on both datasets it would be one step closer to becoming a link. As name and address were not available, matches were sought on various combinations of Postcode, Locality code, Statistical Area 2, Statistical Local Area, Date of birth, Age, Sex, and Country of birth. At least one geographical element, Sex, and Date of birth or Age were kept as a minimum in all combinations that were used to search for links.
11 Unique links were taken from each combination of variables and then ranked in ascending order of the duplicate rate of each combination. This duplicate rate was calculated as the number of vocational education and training in schools records that linked to MORE THAN one Census record divided by the number of vocational education and training in schools records that linked to AT LEAST one Census record. Where records matched on more than one combination of variables in the set of unique links the match from the combination with the lower duplicate rate was kept. The theory behind this is that higher duplicate rates point to more common characteristics in the populations you are trying to match, and links that are made on more common characteristics are more likely to be false.
12 The duplicate rates were quite high for the records linked through this process due to the large area geographic variables available for linking. In order to preserve the quality of the linked dataset, only links with lower duplicate rates were kept for analysis. The links that were rejected happened to be those made with Age instead of Date of birth, or Statistical Local Area in place of another geographic variable.
13 Information about data linkage methods used in similar studies can be found in - Research Paper: Assessing the Feasibility of Linking 2011 Vocational Education and Training in Schools Data to 2011 Census Data (cat.no. 1351.0.55.044)
14 At the completion of the linkage process, 51.01% (77,730 out of 152,367) of the in-scope vocational education and training in schools records were successfully linked to Census records. This link rate is relatively low when compared to similar projects where education data was linked to the Census, for example - Research Paper: Assessing the Quality of Linking School Enrolment Records to 2011 Census Data: Deterministic Linkage Methods (cat. no. 1351.0.55.045). There is potential to raise the link rate by being less strict with the combinations of linking variables and the duplicate rate cut-off. However, the small increase in the link rate using these methods would be outweighed by a loss in accuracy.
15 While only unique links with acceptable duplicate rates were kept, these links still have a small chance of being false. This chance of error is influenced by a few factors. The first factor is the amount of missing or invalid information for the linking variables used. Matches can only be made on valid responses and any of the unique links could have potentially been duplicated in the records with missing or invalid information if that information was present. The table below shows the proportion of in-scope records in each dataset that have missing information for the variables used for linkage.
The locality codes used for linking were State Suburb (SSC) codes, for more information about SSCs see Australian Statistical Geography Standard (ASGS): Volume 3 - Non ABS Structures (cat. no. 1270.0.55.003)
2006 vocational education and training in schools
2011 Census of Population and Housing
|Statistical Area 2|
|Date of birth|
|Country of birth|
While both sources of data are population counts, students in 2006 may not have filled in a Census form in 2011 because they were no longer a resident of Australia, were abroad temporarily at the time of collection, or were missed for another reason. Similarly to missing information, these people who were missing from the 2011 Census could have created duplicate records for the links that were considered unique. Additionally, there would have been persons in the 2011 Census who did not have a chance to take part in vocational education and training in schools even though they would have been eligible because they arrived in Australia after 2006, were abroad for that year, or were missing for another reason. As this group may have similar characteristics to the persons in the 2011 Census who may have done vocational education and training in schools, some of them could have been linked, escalating the chance of false links.
Another potential quality issue stems from the fact that some groups of people may be less likely to link due to their characteristics. These include, persons in remote areas who may have poor address information, persons in densely populated areas where many people share similar characteristics, and persons who may not fill in enrolment or Census forms correctly due to language barriers or other reasons. This potential bias can result in a linked dataset that is not representative of the input data and therefore not appropriate for analysis.
In order to check the representativeness of the linked data, frequencies were run on demographic variables and compared to the input data. The analysis revealed that some groups were under-represented in the linked data, the groups that suffered the greatest effects included:
- students in remote areas, particularly in the Northern Territory
- Aboriginal and Torres Strait Islander students
- students born outside of Australia, particularly those born in China and surrounding territories
In order to account for the groups that were under-represented and for the low link rate, the linked data was weighted to match the input data. This process is explained in the section below.
Weighting is the process of adjusting a sample to infer results for the relevant population. To do this, a 'weight' is allocated to each sample unit - in this case, student records. The weight can be considered an indication of how many students in the relevant population are represented by each person in the sample. Weights were created for linked records to enable population estimates to be produced.
The estimates in this publication are obtained by assigning a ’weight‘ to each linked record. The weight is a value which indicates how many student population records are represented by the linked record. Weights aim to adjust for the fact that the linked student records may not be representative of all the student records. Weighting was used to ensure better representation of population sub-groups and to enhance the reliability of linked education data for longitudinal and cross-sectional analysis.
Weights were benchmarked to the following population groups:
- Postcode (129 groups), with large Postcodes, those with 800 persons or more, weighted individually and smaller Postcodes grouped together by state
- Sex, age, and Indigenous Status (50 groups)
- Country of birth (6 groups), with Australia, New Zealand, the United Kingdom, China and surrounding territories, other countries, and not stated / missing responses as separate groups.
The weights have a mean value of 1, a median value of 1.2 and range between 1.1 and 9.9
The weighted total of the 152,367 in scope linked records was 154,281
USE OF THE DATA
Despite the efforts made to assure the quality of the linked dataset and weight it to make it representative, there is still a chance that some of the links made were false and certain groups were either under or over represented after weighting. The statistics presented in this publication should be treated as experimental estimates and interpreted with caution.