Customised GenAI systems built on Large Language Models (LLMs) offer new opportunities to support the production and dissemination of official statistics. For example, conversational agents can help clients navigate survey and data services, improving overall user experience. At the same time, these systems can produce low quality responses, such as incorrect, incomplete, irrelevant, unclear, or poorly toned answers, or struggle with multilingual or atypical queries. If not carefully managed, this can make services unreliable or less accessible, ultimately eroding public trust.
The GenAI Evaluation Framework addresses these risks by focusing on quality (contextual relevance, clarity, grounding, tone), performance (accuracy, latency, cost), and safety and ethics (confidentiality, detection of toxic content).
The framework consists of two components:
(1) GenAI Evaluation Workflow: a reusable, flexible evaluation workflow that embeds evaluation throughout the AI development cycle as a continuous and iterative process, and
(2) Evaluation suite: a set of code-based tools in a Git repository, designed to support each stage of the workflow.
Figure 1 - GenAI Evaluation Workflow
Image
Description
The workflow shows the process for evaluating Generative AI applications, such as Conversation Chatbots, LLM-based coders, and Summarisation tools, which take user queries as input and generate outputs using LLMs.
In the “Evaluation Dataset Preparation” stage, a human domain expert designs scenarios and uses them to guide an LLM to generate a set of meaningful, ground-truth, question-answer pairs (“Q&A Generator”).
Next, the "LLM Inference pipeline" runs the target application on the question part of each question-answer pair to generate the model's output.
In the “Evaluation Metrics Selection” stage, a human domain expert selects the appropriate metrics across quality (e.g., relevance, accuracy), performance (e.g., cost, latency) and safety or ethical considerations.
In the “Evaluation Runner” stage, an “LLM-as-Judge” approach is used in a secure cloud environment. In this process, a separate LLM scores the output answers against the original ground-truth answers in the question-answer pairs, using the selected metrics.
The resulting scores help human domain experts and application developers identify issues and improve the Generative AI Application.
The workflow balances the need for human and domain expertise to ensure ethical alignment, with a structured engineering approach.
Business areas can design scenarios for the question-answer generator (“Q&A Generator”), to ensure evaluation questions reflect diverse use personas with different education levels, languages, and tones. Domain experts also assist with the selection of appropriate metrics ("Evaluation Metrics Selection").
Once customised, the workflow is executed as a structured, repeatable end-to-end process that includes:
- Automated, scenario-based generation of evaluation datasets, in the form of ground-truth, question-answer pairs (“Q&A Generator”)
- Automated scoring against selected metrics using “LLM-as-Judge" (“Evaluation Runner”)
Future work will focus on applying and refining the GenAI Evaluation Workflow, and expanding the Evaluation suite including adding a user interface to support customisation by domain experts.
The Evaluation Framework is being applied to relevant ABS use cases and aligns with broader Responsible AI initiatives across the Australian Public Service.
For further information please contact Ilana Lichtenstein.