|Page tools: Print Page Print All|
Record linkage is the act of bringing together records from two files that belong to, or are likely to belong to, the same unit (e.g. person, student, business). Record linkage is an appropriate technique when data sets need to be joined to enhance dimensions such as time and breadth or depth of detail. For example, the Australian Census Longitudinal Database (ACLD), created by linking the 2006 and 2011 Australian Population Censuses, allows longitudinal analysis (ABS, 2013a). Record linkage offers opportunities for new and enhanced statistical output and analysis at relatively low cost.
With these new opportunities comes the associated problem of linkage errors. The prevalence of linkage errors is often difficult to estimate because the errors themselves (e.g. linking records that belong to two different people) may not be detected. Links can be declared deterministically, using a set of pre-defined rules, or probabilistically, where evidence for a link being a match is weighed against the evidence that it is not a match. Both methods are widely used at the ABS. This paper describes methods of estimating the prevalence of linkage errors for deterministic and probabilistic linking. It is envisaged that these methods will be used as part of the quality assurance process for record linkage at the ABS.
First we present some necessary background to record linkage.
A match is a pair of records that belong to the same unit. A non-match is a pair of records that do not belong to the same unit. The population of interest in record linkage is the complete set of matches. Perfect linkage occurs when all matches are linked and no non-matches are linked. Perfect linkage would be possible if a unique person identifier was available on the files. Perfect linkage of a person’s record could be possible with name and address. In many situations, however, name and address are not available and the linking fields that are available do not uniquely identify a unit, are missing or contain errors.
Perfect linkage is typically not possible and linkage errors occur. Linkage errors can have negative consequences for the validity of analysis of the linked file. The two types of linkage errors are missed records and incorrect links. A missed record is a record that was not linked to any record even though its match exists. Commonly used measures for missed records are the Link Rate, which is the number of linked records divided by the total number of matches that exist, and the Match Rate, which is the proportion of all matches that are linked.
The impact of Link Rate on analysis is analogous to the impact of non-response, in the sample survey context, on analysis: the linked records may not be representative of the matches. For example, because some linking variables are not applicable to children (e.g. marital status, highest education attainment, and industry of occupation) we have frequently found that children’s records are more likely to be missed than adult records. To minimise the potential for one sub-group to be under represented on the linked file, a reasonable approach is to use as many linking variables as possible to differentiate between matches and non-matches. Explicitly considering linking variables for children’s records would be important in this regard. Calibrating the weights of linked records to known population totals, possibly calculated directly from one of the files used in linking, can reduce the bias due to missed records.
A link is either correct (i.e. a match) or incorrect (i.e. a non-match). A commonly used measure of linkage error is Precision, which is the proportion of links that are matches. Incorrect links create a type of measurement error and can bias analysis.
From detailed studies of linking Census records containing only categorical variables, the analytic conclusions based on a linked file with Precision=95% are often not substantively different to those based on a perfectly linked file. The impacts of this bias and ways to correct it have been studied , but these methods are still new and further advancement in the literature would be required before they are adopted by the ABS.
There is typically a trade-off between Precision and Link Rate: accepting more links typically increases the Link Rate and decreases the Precision. This trade-off has meaning to the extent that an increase in the Link Rate will reduce the potential for bias due to missed records while a decrease in the Precision will increase the potential for bias due to incorrect links. While bias is very difficult to estimate, the trade-off between Precision and Link Rate is still a useful way to compare two competing linking strategies or to decide if a linking strategy is worthwhile at all. This is illustrated later in this paper.
Unfortunately, Precision is not easy to estimate. Even a clerical review of a link cannot always be relied upon to decide if a link is match or non-match. Link Rate is often easy to accurately estimate after files have been linked because the number of links is observed and the total number of matches that exist can usually be accurately approximated. Both Precision and Link Rate are difficult to estimate in the situation where the files’ linking variables are known but the files themselves are not available. This paper will describe a framework for estimating Precision and Link Rate. The framework is model-based and does not require clerical review.
These documents will be presented in a new window.