1504.0 - Methodological News, Sep 2014  
ARCHIVED ISSUE Released at 11:30 AM (CANBERRA TIME) 25/09/2014   
   Page tools: Print Print Page Print all pages in this productPrint All

Uniqueness Analysis

Data Integration, Access and Confidentiality Methodology Unit (DIACMU) is currently developing methods to evaluate the feasibility of linking datasets prior to the actual linking process and to help identify disclosure risks in linked datasets. One method recently developed is a “uniqueness analysis” on the input datasets.

Data linking involves bringing together records from two or more datasets belonging to the same unit. The process produces a unit record file containing analysis fields from the input datasets for the common population. It is a cost-effective method of acquiring more comprehensive statistics. The recent release of the Australian Census Longitudinal Dataset (ACLD) was an important milestone for data linking in the ABS.

Ideally, datasets should be linked with a high degree of accuracy and coverage. Data linking is only feasible if there are linking variables on datasets that can uniquely identify individual record pairs belonging to the same unit. The more record pairs uniquely identified by a combination of linking variables, the more likely that high quality links are established. It is important to ascertain the likely success of a linking project before undertaking the project.

Uniqueness analysis determines the proportion of records on a single file which are uniquely identified by their values on a combination of variables. It provides a guide to the upper bound of records that could be uniquely linked using the available variables (Conn and Bishop, 2005). For example, if one could uniquely identify 80% of records on File A, but only 50% on File B, then the upper bound for the match rate would be 50%. This is considered an upper bound as errors or changes in linking fields can occur across the two datasets. This analysis helps inform whether a linking project is feasible, and furthermore, provides insight into the optimal linking strategy. This method extends the work of Conn and Bishop in the following ways:

1. investigating the marginal improvement in the percentage of uniquely identified records by increasing the number of variables in the combination of potential linking variables

2. taking into account non-response in linking variables in calculating that percentage.

It is envisaged that a uniqueness analysis will be conducted on linked datasets to discover the relationship between the percentage of uniquely identified records and the linkage accuracy.

Besides data linking, DIACMU is also investigating methods to more efficiently mitigate disclosure risks in disseminating data on TableBuilder and DataAnalyser. Linked datasets released on TableBuilder and DataAnalyser include the ACLD and the Australian Census and Migrants Integrated Dataset. A uniqueness analysis on linked datasets can help quickly identify disclosure risks prior to their release. Thus, the uniqueness analysis can potentially have multiple applications besides determining the feasibility of linking datasets. It also gives DIACMU a guide to the best way in ensuring the relevance of linked datasets while maintaining confidentiality.

References
Conn, L & Bishop, G (2005). Exploring Methods for Creating a Longitudinal Dataset, cat. no. 1352.0.55.076, Australian Bureau of Statistics, Canberra.


Further Information
For more information, please contact Charles Au (02 6252 5990, charles.au@abs.gov.au)

The ABS Privacy Policy outlines how the ABS will handle any personal information that you provide to us.