2.3.2 PROBABILISTIC LINKING
Probabilistic linking allows links to be assigned in spite of missing or inconsistent information, providing there is enough agreement on other variables to offset any disagreement. In probabilistic data linkage, records from two datasets are compared and brought together using several variables common to each dataset (Fellegi & Sunter, 1969). A key feature of the methodology is the ability to handle a variety of linking variables and record comparison methods to produce a single numerical measure of how well two particular records match. This allows ranking of all possible links and optimal assignment of the link or non-link status (Solon and Bishop, 2009).
Within a blocking pass, records on the two files which agree on the specified blocking variables are compared on a set of linking fields. Each linking field has associated field weights, which are calculated prior to comparison. Field weights indicate the amount of information (agreement, disagreement, or missing values) a linking field provides about whether the records belong to the same or a different person (true match status). Field weights are based on two probabilities associated with each linking field: first, the probability that the field values agree on a record pair given that the two records belong to the same person (match); and second, the probability that the field values agree on a record pair given the two records belong to different persons (unmatch). These are called m and u probabilities (or match and unmatch probabilities) and are defined as:
m = P (fields agree | records belong to the same entity ).
u = P (fields agree | records belong to different entities ).
Given that the m and u probabilities require knowledge of the true match status of record pairs, they cannot be known exactly, but rather must be estimated. For estimating m and u probabilities for the ACLD, the ABS used the Expectation Maximisation (EM) algorithm (see Samuels, 2012). In some instances the EM algorithm is deemed unsuitable, or fails to converge on an estimate, and in such cases m and u probabilities are based on those of similar linking projects. Note that m and u probabilities are calculated for each pass, conditional on agreement on the specified blocking fields, as all records compared will agree on blocking variables.
Match (m) and unmatch (u) probabilities are then converted to agreement and disagreement field weights. They are as follows:
Agree = log2(m / u)
Disagree = log2([1 - m ] / [1 - u ]) |
These equations give rise to a number of intuitive properties of the Fellegi–Sunter framework. First, in practice, agreement weights are always positive and disagreement weights are always negative. Second, the magnitude of the agreement weight is driven primarily by the likelihood of chance agreement. That is, a low probability of two random people agreeing on a field (for example, Date of Birth) will result in a large agreement weight being applied when two records do agree.
The magnitude of the disagreement weight is driven by the stability and reliability of a variable. That is, if a variable is well-reported and stable over time (for example, Sex) then disagreement on the variable will yield a large negative weight. For each record pair comparison, the field weights from each linking field are summed to form an overall record pair comparison weight or 'linkage weight'.
Before calculating m and u probabilities for some variables it is first necessary to define what constitutes agreement. Typical comparison functions include:
- Exact match (e.g. Sex). Agreement occurs only when the two field values are identical. This criterion is used for most linking fields.
- Logical movement (e.g Highest year of schooling). Agreement occurs when the two numerical field values are identical, with interpolated weights attributed as the field values increase/decrease in a pre-specified direction. A pair may be defined to agree if their field values differ by an amount less than or equal to a specified maximum difference.
- Numeric difference (e.g. Age). A pair may be defined to agree if their field values differ by an amount less than or equal to a specified maximum difference.
For further details on comparison functions used for probabilistic linkage, see Christen & Churches (2005).
Alternatively, near or partial agreement may be factored into the linking process by converting m and u probabilities to weights. For example, a person’s age on equivalent records will frequently be an exact match, and the m and u probabilities are calculated based on this definition. During linkage, however, a partial agreement weight was given for ages within two years difference to cater for persons who may have understated their age in 2006 and overstated it in 2011 or vice versa.
Blocking fields, linking fields, comparator types, and m and u probabilities are used as input parameters for the linking software. Records which agree on the blocking variable(s) are compared on all linking fields.