9 Statistical data integration involves combining information from different administrative and/or survey sources to provide new datasets for statistical and research purposes. Further information on data integration is available on the National Statistical Service website – Data Integration.
10 Data linking is a key part of statistical data integration and involves the technical process of combining records from different source datasets using variables that are shared between the sources. Data linkage is typically performed on records that represent individual persons, rather than aggregates. The most common methods used link records on exact matches for common variables ('deterministic' linkage), or close matches ranked by probabilities that the variables used will result in a true match ('probabilistic' linkage).
Linking 2006 Vocational Education and Training in Schools data to 2011 Census of Population and Housing data
11 VET in Schools records were linked to Census records through exact matches on responses for common variables ('deterministic' linkage). For example, a variable that was common to each dataset was Sex which had the possible responses of '1' (Male) or '2' (Female), if a record had a response of '1' on both datasets it would be one step closer to becoming a link. As name and address were not available, matches were sought on various combinations of Postcode, Locality code, Statistical Area 2, Statistical Local Area, Date of birth, Age, Sex, and Country of birth. At least one geographical element, Sex, and Date of birth or Age were kept as a minimum in all combinations that were used to search for links.
12 Unique links were taken from each combination of variables and then ranked in ascending order of the duplicate rate of each combination. This duplicate rate was calculated as the number of vocational education and training in schools records that linked to MORE THAN one Census record divided by the number of vocational education and training in schools records that linked to AT LEAST one Census record. Where records matched on more than one combination of variables in the set of unique links the match from the combination with the lower duplicate rate was kept. The theory behind this is that higher duplicate rates point to more common characteristics in the populations you are trying to match, and links that are made on more common characteristics are more likely to be false.
13 The duplicate rates were quite high for the records linked through this process due to the large area geographic variables available for linking. In order to preserve the quality of the linked dataset, only links with lower duplicate rates were kept for analysis. The links that were rejected happened to be those made with Age instead of Date of birth, or Statistical Local Area in place of another geographic variable.
14 Information about data linkage methods used in similar studies can be found in - Research Paper: Assessing the Feasibility of Linking 2011 Vocational Education and Training in Schools Data to 2011 Census Data (cat. no. 1351.0.55.044)
Linkage results
15 At the completion of the linkage process, 50.52% (84,412 out of 167,088) of the in-scope VET in Schools records were successfully linked to Census records. This link rate is relatively low when compared to similar projects where education data was linked to the Census, for example - Research Paper: Assessing the Quality of Linking School Enrolment Records to 2011 Census Data: Deterministic Linkage Methods (cat. no. 1351.0.55.045). There is potential to raise the link rate by being less strict with the combinations of linking variables and the duplicate rate cut-off. However, the small increase in the link rate using these methods would be outweighed by a loss in accuracy.
16 While only unique links with acceptable duplicate rates were kept, these links still have a small chance of being false. This chance of error is influenced by a few factors. The first factor is the amount of missing or invalid information for the linking variables used. Matches can only be made on valid responses and any of the unique links could have potentially been duplicated in the records with missing or invalid information if that information was present. The table below shows the proportion of in-scope records in each dataset that have missing or incomplete information for the variables used for linkage.