3302.0.55.004 - Information Paper: Death registrations to Census linkage project - Methodology and Quality Assessment, 2011-2012

ARCHIVED ISSUE Released at 11:30 AM (CANBERRA TIME) 18/09/2013 First Issue

Page tools: Print

Print Page Print all pages in this product

Print All

Contents >> Data linking methodology >> Record pair comparison >> m and u probabilities

M AND U PROBABILITIES

Within a blocking pass, records on the two files which agree on the specified blocking variables are compared on a number of linking fields. Each linking field has associated 'field weights', which are calculated prior to comparison. Field weights indicate the amount of information (agreement, disagreement, or missing values) a linking field provides about whether the records belong to the same or a different person (true match status). Field weights are based on two probabilities associated with each linking field: first, the probability that the field values agree on a record pair given that the two records belong to the same person (match); and second, the probability that the field values agree on a record pair given the two records belong to different persons (unmatch). These are called m and u probabilities (or match and unmatch probabilities) and are defined below.

m = P(fields agree | records belong to the same entity)
u = P(fields agree | records belong to different entities)

Given that the m and u probabilities require knowledge of the true match status of record pairs, they cannot be known exactly, but rather must be estimated. The ABS uses a number of techniques to estimate m and u probabilities. For the series of 2011 linking projects, the Expectation Maximisation (EM) algorithm was used (see Samuels, 2012). In some instances the EM algorithm is deemed unsuitable, or fails to converge on an estimate, and in such cases m and u probabilities are based on those of similar linking projects. Note that m and u probabilities are calculated for each pass, conditional on agreement on the specified blocking fields, as all records compared will agree on blocking variables.

As a new feature to the suite of 2011 Census Data Enhancement projects, m and u probabilities for missing data on a linking field were calculated. These capture the probability that a pair belonging to the same individual (match) and a pair belonging to two different individuals (unmatch) are missing on either dataset (or both datasets) for a linking field. The m and u probabilities used in this project are presented in Appendix C: Linking m and u probabilities for each pass.

Match (m) and unmatch (u) probabilities are then converted to agreement, disagreement and missing field weights. The formulae to convert m and u probabilities to field weights are a small extension of the Fellegi and Sunter (1969) linking methodology to now provide weights for missing data

They are as follows.

Agree = log₂(m ÷u)

Missing = log₂(m_missing ÷ u_missing)

Disagree = log₂([1 - m - m_missing] ÷ [1 - u - u_missing])

These equations give rise to a number of intuitive properties of the Fellegi–Sunter framework. First, in practice agreement weights are always positive and disagreement weights are always negative. Second, the magnitude of the agreement weight is driven primarily by the likelihood of chance agreement. That is, a low probability of two random people agreeing on a field (for example, Date of Birth) will result in a large agreement weight applied when two records do agree. The magnitude of the disagreement weight is driven by the stability and reliability of a variable. That is, if a variable is well-reported and stable over time (for example, Sex) then disagreement on the variable will yield a large negative weight. For each record pair comparison, the field weights from each linking field are summed to form an overall record pair comparison weight.