Page tools: Print Page Print All | ||||
M AND U PROBABILITIES m = P(fields agree | records belong to the same entity) u = P(fields agree | records belong to different entities) Given that the m and u probabilities require knowledge of the true match status of record pairs, they cannot be known exactly, but rather must be estimated. The ABS uses a number of techniques to estimate m and u probabilities. For the series of 2011 linking projects, the Expectation Maximisation (EM) algorithm was used (see Samuels, 2012). In some instances the EM algorithm is deemed unsuitable, or fails to converge on an estimate, and in such cases m and u probabilities are based on those of similar linking projects. Note that m and u probabilities are calculated for each pass, conditional on agreement on the specified blocking fields, as all records compared will agree on blocking variables. As a new feature to the suite of 2011 Census Data Enhancement projects, m and u probabilities for missing data on a linking field were calculated. These capture the probability that a pair belonging to the same individual (match) and a pair belonging to two different individuals (unmatch) are missing on either dataset (or both datasets) for a linking field. The m and u probabilities used in this project are presented in Appendix C: Linking m and u probabilities for each pass. Match (m) and unmatch (u) probabilities are then converted to agreement, disagreement and missing field weights. The formulae to convert m and u probabilities to field weights are a small extension of the Fellegi and Sunter (1969) linking methodology to now provide weights for missing data They are as follows. Agree = log2(m ÷u)
Missing = log2(mmissing ÷ umissing) Disagree = log2([1 - m - mmissing] ÷ [1 - u - umissing]) These equations give rise to a number of intuitive properties of the Fellegi–Sunter framework. First, in practice agreement weights are always positive and disagreement weights are always negative. Second, the magnitude of the agreement weight is driven primarily by the likelihood of chance agreement. That is, a low probability of two random people agreeing on a field (for example, Date of Birth) will result in a large agreement weight applied when two records do agree. The magnitude of the disagreement weight is driven by the stability and reliability of a variable. That is, if a variable is well-reported and stable over time (for example, Sex) then disagreement on the variable will yield a large negative weight. For each record pair comparison, the field weights from each linking field are summed to form an overall record pair comparison weight.
|