Methodological News, June Quarter 2026

Features important work and developments in ABS methodologies

Released

10/06/2026

Release date and time

10/06/2026 11:30am AEST

Generative AI (GenAI) Evaluation Framework to support responsible use of AI

Customised GenAI systems built on Large Language Models (LLMs) offer new opportunities to support the production and dissemination of official statistics. For example, conversational agents can help clients navigate survey and data services, improving overall user experience. At the same time, these systems can produce low quality responses, such as incorrect, incomplete, irrelevant, unclear, or poorly toned answers, or struggle with multilingual or atypical queries. If not carefully managed, this can make services unreliable or less accessible, ultimately eroding public trust.

The GenAI Evaluation Framework addresses these risks by focusing on quality (contextual relevance, clarity, grounding, tone), performance (accuracy, latency, cost), and safety and ethics (confidentiality, detection of toxic content).

The framework consists of two components: 

(1) GenAI Evaluation Workflow: a reusable, flexible evaluation workflow that embeds evaluation throughout the AI development cycle as a continuous and iterative process, and  

(2) Evaluation suite: a set of code-based tools in a Git repository, designed to support each stage of the workflow.

Figure 1 - GenAI Evaluation Workflow 

Image

Description

The workflow shows the process for evaluating Generative AI applications, such as Conversation Chatbots, LLM-based coders, and Summarisation tools, which take user queries as input and generate outputs using LLMs.

In the “Evaluation Dataset Preparation” stage, a human domain expert designs scenarios and uses them to guide an LLM to generate a set of meaningful, ground-truth, question-answer pairs (“Q&A Generator”).

Next, the "LLM Inference pipeline" runs the target application on the question part of each question-answer pair to generate the model's output.

In the “Evaluation Metrics Selection” stage, a human domain expert selects the appropriate metrics across quality (e.g., relevance, accuracy), performance (e.g., cost, latency) and safety or ethical considerations.

In the “Evaluation Runner” stage, an “LLM-as-Judge” approach is used in a secure cloud environment. In this process, a separate LLM scores the output answers against the original ground-truth answers in the question-answer pairs, using the selected metrics.

The resulting scores help human domain experts and application developers identify issues and improve the Generative AI Application.

The workflow balances the need for human and domain expertise to ensure ethical alignment, with a structured engineering approach.  

Business areas can design scenarios for the question-answer generator (“Q&A Generator”), to ensure evaluation questions reflect diverse use personas with different education levels, languages, and tones. Domain experts also assist with the selection of appropriate metrics ("Evaluation Metrics Selection").

Once customised, the workflow is executed as a structured, repeatable end-to-end process that includes:  

Automated, scenario-based generation of evaluation datasets, in the form of ground-truth, question-answer pairs (“Q&A Generator”) 
Automated scoring against selected metrics using “LLM-as-Judge" (“Evaluation Runner”)

Future work will focus on applying and refining the GenAI Evaluation Workflow, and expanding the Evaluation suite including adding a user interface to support customisation by domain experts.  

The Evaluation Framework is being applied to relevant ABS use cases and aligns with broader Responsible AI initiatives across the Australian Public Service. 

For further information please contact Ilana Lichtenstein.

Evaluating Grounding Approaches for Large Language Models on Tabular Statistical Data

ABS statistics are used in research, policy, planning, and public information contexts. Statistical outputs are organised through a set of predefined products, often through large collections of tabular datasets. Exploring available information, understanding relationships across tables, and extracting useful statistical insights can require familiarity with product structure and conventions, including variable definitions, metadata, geographic hierarchies, and table layouts.

With the growing adoption of large language models (LLMs), new opportunities arise for natural language interaction with statistical information, providing an alternative way for users to explore and retrieve information from complex tabular datasets. This has the potential to broaden access to high-quality statistical information across a wider range of users and use cases.

This research investigates two LLM grounding approaches – conventional Retrieval-Augmented Generation (RAG), and Knowledge Graph-RAG (KG-RAG). A subset of ABS tabular statistical data — covering selected demographic, socioeconomic variables, and standard geographic levels (SA2, SA3, SA4, and Local Government Area boundaries) — is curated and then evaluated to assess accuracy and coverage. This curation is necessary because tabular data is not readily available in a form that can be reliably interpreted by language models.

The RAG approach curates and transforms tabular statistical tables into text summaries that are indexed and retrieved before generating a response. The KG-RAG approach curates the same tabular statistical tables into a knowledge graph representation of variables, dimensions, and geographic relationships, and executes graph queries against the underlying data. Both approaches use the same LLM generation model.

Evaluation shows that each approach performs better on different types of questions across two key metrics: accuracy and coverage. The KG-RAG approach achieves greater accuracy on direct lookup, ranking, and set membership tasks, reflecting the inherent strengths of its knowledge graph-based design in supporting traceable statistical retrieval and verifiable outputs. The RAG approach, through its semantic text retrieval, is better suited to capturing semantically varied natural language questions and achieves broader answer coverage on questions comparing regions or describing characteristics of areas. These findings suggest that hybrid architectures, where questions are routed to the most appropriate retrieval system, can improve both answer accuracy and coverage across a broader range of question types in large-scale production systems.

The project also underlines specific curation challenges in preparing tabular statistical data for LLM grounding. Source metadata — including variable naming, multi-level table structures, and embedded reference markers — is structured for human interpretation rather than machine processing. The metadata required transformation and normalization before it could be reliably used by either grounding approach.

This work demonstrates that the effectiveness of grounding approaches depends not only on selecting the right architecture for the question type, but also on ensuring that the underlying data is AI-ready.

For further information please contact Ali Behnaz.

Contact us

Please email methodology@abs.gov.au to:

contact authors for further information
provide comments or feedback
be added to or removed from our electronic mailing list.

Alternatively, you can post to:

Methodological News Editor
Methodology Division
Australian Bureau of Statistics
Locked Bag No. 10
Belconnen ACT 2617

The ABS Privacy Policy outlines how the ABS will handle any personal information that you provide to us.

Previous releases

Releases from June 2021 onwards can be accessed under research.

Releases up to March 2021 can be accessed under past releases.

APA

Citation