The ABS is investigating Locality Sensitive Hashing (LSH) as a more robust and automated blocking method for probabilistic linking projects.
The need for better blocking
When linking two or more datasets, comparing every record with every other record is usually infeasible. Instead, blocking rules are used to partition the records from the input datasets into smaller groups, or blocks, so that comparisons are only made between records within the same block. Blocking rules must strike a balance: they need to minimise the total number of comparisons while maximising the proportion of true matches that are found.
Currently, every linkage requires the manual creation of a bespoke blocking scheme, and blocking schemes need to be reassessed and potentially redeveloped every time the input data changes. LSH may offer a way to automate this process for all linkages.
What is Locally Sensitive Hashing?
LSH [1] is a technique from computer science for identifying similar records within large datasets without directly comparing every record. The resulting blocks are such that two similar records are very likely to be in at least one of the same blocks, and two dissimilar records are unlikely to be in any of the same blocks.
Min-Hash and the Jaccard similarity
Min-Hash LSH is a well-known variant which corresponds to the Jaccard similarity measure. The Jaccard similarity between two sets A and B is defined as the size of their intersection divided by the size of their union:
\[J(A, B) = \frac{|A \cap B|}{|A \cup B|}.\]How Min-Hash LSH works
To use Min-Hash LSH, the records to be linked are broken up into sets of “shingles” – short overlapping strings of a fixed length. For each record, a min-hash is then calculated as the minimum value of a hash function applied to each shingle for that record. Conveniently, the probability of a single min-hash agreeing for records A and B is exactly the Jaccard similarity of A and B. A single min-hash is generally not sufficient for the purposes of blocking, so a number of min-hashes are calculated using independently selected hash functions (e.g. by seeding each one with a different value).
These min-hashes are then organised into b groups of r min-hashes each. Records can be placed into b blocks defined by agreement on all r min-hashes for that group, which results in the following probability of two records being compared:
\[P = 1 - \left( 1 - J(A,B)^r \right)^b.\]Research questions
Our investigation will aim to answer the following questions about the method:
- Whether we can match or increase the coverage of our current blocking rules using LSH with an acceptable number of comparisons.
- Whether LSH is easily adapted to different linkage projects.
For further information please contact Aymon Wuolanne.
References
[1] Indyk, P. and Motwani, R. (1998). Approximate nearest neighbours: towards removing the curse of dimensionality . Proceedings of the 30th Annual ACM Symposium on Theory of Computing, pp. 604-613.