Methodological News, March Quarter 2026

Features important work and developments in ABS methodologies

Released

11/03/2026

Release date and time

11/03/2026 11:30am AEDT

Preliminary investigation of Locality Sensitive Hashing for use in ABS data linking processes

The ABS is investigating Locality Sensitive Hashing (LSH) as a more robust and automated blocking method for probabilistic linking projects.

The need for better blocking

When linking two or more datasets, comparing every record with every other record is usually infeasible. Instead, blocking rules are used to partition the records from the input datasets into smaller groups, or blocks, so that comparisons are only made between records within the same block. Blocking rules must strike a balance: they need to minimise the total number of comparisons while maximising the proportion of true matches that are found.

Currently, every linkage requires the manual creation of a bespoke blocking scheme, and blocking schemes need to be reassessed and potentially redeveloped every time the input data changes. LSH may offer a way to automate this process for all linkages.

What is Locally Sensitive Hashing?

LSH [1] is a technique from computer science for identifying similar records within large datasets without directly comparing every record. The resulting blocks are such that two similar records are very likely to be in at least one of the same blocks, and two dissimilar records are unlikely to be in any of the same blocks.

Min-Hash and the Jaccard similarity

Min-Hash LSH is a well-known variant which corresponds to the Jaccard similarity measure. The Jaccard similarity between two sets A and B is defined as the size of their intersection divided by the size of their union:

\[J(A, B) = \frac{|A \cap B|}{|A \cup B|}.\]

How Min-Hash LSH works

To use Min-Hash LSH, the records to be linked are broken up into sets of “shingles” – short overlapping strings of a fixed length. For each record, a min-hash is then calculated as the minimum value of a hash function applied to each shingle for that record. Conveniently, the probability of a single min-hash agreeing for records A and B is exactly the Jaccard similarity of A and B. A single min-hash is generally not sufficient for the purposes of blocking, so a number of min-hashes are calculated using independently selected hash functions (e.g. by seeding each one with a different value).

These min-hashes are then organised into b groups of r min-hashes each. Records can be placed into b blocks defined by agreement on all r min-hashes for that group, which results in the following probability of two records being compared:

\[P = 1 - \left( 1 - J(A,B)^r \right)^b.\]

Research questions

Our investigation will aim to answer the following questions about the method:

Whether we can match or increase the coverage of our current blocking rules using LSH with an acceptable number of comparisons.
Whether LSH is easily adapted to different linkage projects.

For further information please contact Aymon Wuolanne.

References

[1] Indyk, P. and Motwani, R. (1998). Approximate nearest neighbours: towards removing the curse of dimensionality . Proceedings of the 30th Annual ACM Symposium on Theory of Computing, pp. 604-613.

Generative AI (GenAI) Evaluation Framework to support responsible use of AI

Customised GenAI systems built on Large Language Models (LLMs) offer new opportunities to support the production and dissemination of official statistics. For example, conversational agents can help clients navigate survey and data services, improving overall user experience. At the same time, these systems can produce low quality responses, such as incorrect, incomplete, irrelevant, unclear, or poorly toned answers, or struggle with multilingual or atypical queries. If not carefully managed, this can make services unreliable or less accessible, ultimately eroding public trust.

The GenAI Evaluation Framework addresses these risks by focusing on quality (contextual relevance, clarity, grounding, tone), performance (accuracy, latency, cost), and safety and ethics (confidentiality, detection of toxic content).

The framework consists of two components: 

(1) GenAI Evaluation Workflow: a reusable, flexible evaluation workflow that embeds evaluation throughout the AI development cycle as a continuous and iterative process, and  

(2) Evaluation suite: a set of code-based tools in a Git repository, designed to support each stage of the workflow.  

The workflow balances the need for human and domain expertise to ensure ethical alignment, with a structured engineering approach.  

Business areas can design scenarios for the question-answer generator (“Q&A Generator”), to ensure evaluation questions reflect diverse use personas with different education levels, languages, and tones. Domain experts also assist with the selection of appropriate metrics ("Evaluation Metrics Selection").

Once customised, the workflow is executed as a structured, repeatable end-to-end process that includes:  

Automated, scenario-based generation of evaluation datasets, in the form of ground-truth, question-answer pairs (“Q&A Generator”) 
Automated scoring against selected metrics using “LLM-as-Judge" (“Evaluation Runner”)

Future work will focus on applying and refining the GenAI Evaluation Workflow, and expanding the Evaluation suite including adding a user interface to support customisation by domain experts.  

The Evaluation Framework is being applied to relevant ABS use cases and aligns with broader Responsible AI initiatives across the Australian Public Service. 

For further information please contact Najmeh Samadiani.

Contact us

Please email methodology@abs.gov.au to:

contact authors for further information
provide comments or feedback
be added to or removed from our electronic mailing list.

Alternatively, you can post to:

Methodological News Editor
Methodology Division
Australian Bureau of Statistics
Locked Bag No. 10
Belconnen ACT 2617

The ABS Privacy Policy outlines how the ABS will handle any personal information that you provide to us.

Previous releases

Releases from June 2021 onwards can be accessed under research.

Releases up to March 2021 can be accessed under past releases.

APA

Citation