About the Whole of Australian Government Occupation Coding Service

Learn more about the Coding Service and how to use it. Includes privacy, security, and assurance information.

Public Beta release

The WoAG Occupation Coding Service is now ready to support real-world data coding, however, as with any AI Machine Learning (ML) based service, it may return unexpected results for ambiguous inputs. We encourage feedback to help us continue refining the service for all users. 

What is the Coding Service?

The ABS is excited to launch the ABS WoAG Occupation Coding Service (the Coding Service), a new digital service that makes it easier and faster to code occupation data to the latest official standard classifications.

We created a video explaining what the ABS WoAG Occupation Coding Service is and how it works. Watch it here:

The WoAG Occupation Coding Service ('the Coding Service') is a new capability designed to code job title and tasks text to official statistical classifications. In this context, ‘coding’ means assigning numbered and labelled categories to free text descriptions of job titles and tasks. The Coding Service is trained by supervised machine learning to make fast and accurate predictions of the best code matches for this input data.

Using cloud compute for power and scale to achieve highly automated, high-quality coding, the service makes coding data to the latest classifications easier, faster and cheaper. Coding to a standard classification ensures the data is meaningful and comparable across data sources, and enables consistent data coding across the Australian data landscape. This means data assets are more coherent for research and analysis, and policy makers get better information.

The service can be integrated into agency systems, forms and survey instruments. It can code single records or small batches of data in real time, and can also code large datasets in batches, up to many millions of records.

The Coding Service can code data to:

  • the Occupation Standard Classification for Australia (OSCA) 2024, and
  • the Australian and New Zealand Standard Classification of Occupations (ANZSCO) 2022 Australian Update.

The ABS will maintain the service to ensure the most up-to-date occupation models are available.

Machine Learning models

The Coding Service uses Hierarchical Support Vector Machine (HSVM) models trained by Machine Learning (ML) to code data. ML is a branch of artificial intelligence (AI) that uses algorithms to analyse, draw conclusions and then make predictions from patterns in data. The algorithms are trained to spot patterns, which helps improve coding accuracy. 

The models make predictions by learning how the ABS applies the occupation classifications against the way people describe their jobs and the tasks they do. During the training process, the models learn to map certain words, or combinations of words, to return specific occupation codes for given input text. As the models are trained with both occupation title and task inputs, the best outcomes are achieved when both inputs are included for coding, and enough detail to ensure the right coding decision is provided.

The models were trained with millions of de-identified responses to occupation of employment questions in previous Australian Censuses. An advantage of this is that the training data represents the Australian population, and the service can recognise and code occupations described in the language and broad vocabulary used by Australians themselves - including misspellings and common colloquial terms. For example, if a person types ‘brickie’ into an online form, the service will recognise that this belongs to OSCA code 371131 - Bricklayer.

The models consider individual words, as well as bigrams, or combinations of two consecutive words, to help capture contextual meaning and word order. For example, the phrase ‘General Manager’ carries a meaning that is distinct from the words ‘General’ or ‘Manager’ on their own, which may appear across many unrelated occupations.

When working out the best match for an input, the models draw on the structure and distribution of the input occupation title and task text in the training data. This approach gives the models flexibility to learn from natural variations in language. It also means that the quality, frequency, and specificity of training examples directly shape model behaviour. This means the models are more likely to predict occupations that are more common in the training data, particularly when coding short or ambiguous inputs.

Training data is analysed to address biases and ensure the models balance efficiency with measures of quality, including accuracy and matching of input text to the classification. While the models behind the service have been extensively trained and tested to support high-quality coding, we cannot validate or guarantee correct results for all possible inputs. As noted above, feedback during Public Beta will help us continue refining the service for all users.  

How the service works

The Coding Service is connected and consumed through a standard and secure Application Programming Interface (API) endpoint. This is a software-to-software interface that allows two applications to exchange data with each other via programming scripts. Users may need support from their agency's technology services to integrate the Coding Service into their systems.

The programming code written in an agency’s system sends a request to the Coding Service, via an API, to assign classification codes and titles to specified input text. The ABS system returns classification codes and titles back through the API to the sender.

Single record or small batch coding

  • Single record (synchronous) coding is the real time coding of text inputs for a single instance via the API. This involves a programming code request that includes a text input of a job title and a description of that job’s tasks. The service generates a response in the programming code that indicates whether the request was successful and if so, provides the best matched occupation code and title for the job, and a confidence score for that match.
  • The single record coding function can be embedded into online forms or surveys for synchronous (real-time) coding of text entries. Programming code for integration is available in the User Guide.
  • Synchronous coding of up to 300 records (small batch coding) can also be carried out via the API. The service generates a response to each entry listed in the programming request code, indicating whether the entry was successfully coded, and if so, providing the best matched occupation code and title for that job, with its associated confidence score

Large batch coding

  • The Coding Service can also be used for large batch coding or recoding, up to millions of records. This is asynchronous coding and can take from 5 minutes to a couple of hours, depending on a range of factors such as number of concurrent users, order in the queue, size of the data file to be coded and connection speed.
  • A file of records is sent via an API request and coded by the service. The output is a web address returned in the sender’s programming code. The web address hosts a downloadable file of all the coded and un-coded records, which opens by clicking the address.

Understanding model confidence

In machine learning, model confidence typically refers to the estimated probability the model assigns to its predicted outcome, indicating how certain it is about that prediction. The Coding Service models make predictions according to the structure of the OSCA and ANZSCO classifications. These are hierarchical structures with five levels:

  • major group (1-digit code) - broadest level
  • sub-major group (2-digit code)
  • minor group (3-digit code)
  • unit group (4-digit code)
  • occupation (6-digit code) - most detailed level

(To learn more about the occupation classification structure, see the OSCA Structure.)

During prediction, the model uses the input text (job title and tasks) to predict the most appropriate code, starting at the major group (1-digit code) and making its way down the classification structure to the occupation level (6-digit code). At each level, the model will choose the most likely code based on the information it learned during training.

When using the service, the confidence score that accompanies a returned code represents a conditional probability. The score is not an overall measure of the model’s confidence across all levels of the classification. Instead, it highlights the model's relative strength of confidence within the relevant single classification group (i.e., at the same level of the classification as the final predicted code).

For example, a confidence score of 75% for a given 6-digit occupation means the model is 75% confident in its prediction relative to the other 6-digit occupations within the same Unit Group as the occupation predicted. It does not denote the model's confidence at the higher levels of the classification. Confidence scores can inform users of the strength of a prediction relative to nearest neighbour occupations, that is, occupations at the same level and group.

Using the service

How to register

Important context for model use

The occupation coding models are optimised for use in an Australian context. They use, and will only accept, an English character set for coding responses that relate to individuals employed in legal occupations that fall within the definition and scope of official ABS occupation classifications. For example, ‘retiree’ and ‘homemaker’ do not return codes, as these are not occupations defined in official ABS occupation classifications.

The contextual assumption of the input text is that the text relates to and describes a job. The models can recognise a broad vocabulary, and will attempt to code all input text regardless of context, so users need to ensure a contextual fit between their input data and the coding task being undertaken.

For example, if a person describes their job as a ‘prisoner’, the service assumes a context that the occupation to be coded works with prisoners in some way, and codes to ‘Correctional Officer’. Likewise, the input text ‘baby’ codes to ‘Nanny’.

If a person describes two jobs in a single text response, the service will attempt to code the provided text to a best fit single occupation code at the most detailed level. This will reflect the training data and will depend on how many times the two jobs were present together in the training data. The service will default to whatever is most commonly found in the training data.  

Tips for getting the best predictions out of the Coding Service

  • Enter both a job title and its tasks as input texts. This is how the Coding Service performs best as this is how the models were trained. The models should be able to predict a 6-digit occupation code if input text contains brief, specific details and descriptions of both a job and its tasks. It may not code as accurately with just one of these inputs included.
  • When the input text is limited or vague, the model will produce a lower quality coding outcome. It may predict a code at a less detailed level (e.g. a 3-digit code representing the minor group) or fail to produce a code altogether.
  • The more specific the job title and task information sent to the coder, the better the quality of the predictions.
  • Be specific. Sending the Coding Service the occupation text ‘teacher’ with the tasks text ‘teaching’ is too vague. A better occupation entry might be ‘primary school teacher’ and a more detailed description of the tasks might be ‘teach literacy, numeracy, etc. to primary school students’.
  • Spell check your input text. The model has learned to handle common misspellings of certain words, however, the cleaner the input text, the better chance of getting a quality prediction.
  • The Coding Service provides high-quality, probabilistic coding using well-tested models, however appropriate review by users is encouraged before using coded outputs. The Coding Service returns confidence scores to help inform that review.
  • While the models behind the service have been extensively trained to support high-quality classification, we cannot validate or guarantee correct results for all possible inputs. Where users encounter unexpected results, please let us know. Your feedback helps us continue refining performance and improving the service for all users.

Coding to the Occupation Standard Classification for Australia (OSCA)

There are currently two ways of coding occupation data to OSCA 2024 - the WoAG Occupation Coding Service and the OSCA Coder. The Coding Service uses machine learning and connects via an API for automated coding, while the OSCA Coder is a downloadable program that uses rule-based indexes to match job titles. 

WoAG Occupation Coding Service

  • The WoAG Occupation Coding Service is trained by supervised machine learning to make fast and accurate predictions of the best code match for input occupation title and tasks data. For registered users, this service is consumed through a standard and secure API endpoint which enables it to be integrated into agency systems, forms and survey instruments.
  • Connecting to the API requires script-based instructions to be followed by users, and data to be entered in specific formats. Once established, the API endpoint enables automated, quick and easy use.
  • The service supports consistent coding of records to OSCA 2024 and ANZSCO 2022. Entering a job title and task text into the service will return a best match prediction of the most appropriate occupation code and title for that job, using a powerful set of background data and algorithms to predict all possible codes and select the best one.
  • The service allows users to quickly and easily adopt OSCA as the standard classification for their occupation data. Large batch coding enables existing data assets to be recoded to OSCA 2024 in a single, fast, automated process.
  • Previous classification versions will also be available for use, starting from ANZSCO 2022.
  • Where input job title and task data is retained in a dataset, the service can recode occupation datasets to OSCA that were previously coded to ANZSCO. If previously coded datasets do not retain these text inputs for either job titles or tasks (i.e. if the dataset now contains only ANZSCO codes, for example), it is not possible to use the Coding Service to recode this data to OSCA. Users will need to use the published concordance file to remap the data to the OSCA classification codes.

OSCA Coder

  • The OSCA Coder is a Windows-based structured coding system to code occupation information to OSCA 2024 v1.0. Users are required to code each description individually by manually entering the information into the coder.
  • It assigns codes according to prescribed rules incorporated into a detailed coding index. The index is an extensive list of occupation titles which incorporates tasks, goods and industry information, as needed, to determine the appropriate OSCA code.
  • The OSCA Coder is available on request as a set of downloadable files. Users who require coding to ANZSCO 2022 and earlier versions of ANZSCO may also request the ANZSCO Coder.
  • Learn more about OSCA on the ABS website or contact occupation.standards.and.classifications@abs.gov.au for more information.

Searching the classification

There are also occupation classification search tools available, however these are not statistical coding tools. Entering an occupation name into the OSCA Search or ANZSCO Search tool will return all matching classification labels which contain this term. The search tools can be used to locate a job in the classification hierarchy, see the range of occupations in a category, and provide contextual information for manual coding decisions or Coding Service output review. 

Privacy and security

The WoAG Occupation Coding Service is private, safe and secure. It has been designed with strong protections in place to make sure it cannot be misused or accessed by other systems, and has been Infosec Registered Assessors Program (IRAP) assessed, affirming the security of the environment.

  • Users of the Coding Service cannot interact with the AI technologies used to develop coding models.
  • The models are wholly developed within ABS owned and controlled infrastructure within Australia, following strong privacy and security practices within existing ABS safeguards, while ensuring responsible and transparent use of ML technology.
  • The machine learning models underpinning the service are trained entirely within a dedicated, segregated, private and highly secure ABS account.
  • The models are static, and cannot learn from user input data, and the service does not re-use or retain the text input from user requests.
  • The service does not access any data beyond the specific text submitted for coding. It does not store, share, or use any personal information. The data submitted for coding is securely processed and then deleted once a response is provided. 

These are conscious design choices that reflect our commitment to privacy, security, and trust. 

Assurance

The WoAG Occupation Coding Service uses AI in accordance with applicable legislation, regulations, frameworks and policies.

The service complies with the Policy for the responsible use of AI in government. This policy advises the use of existing frameworks through which to consider the risks of developing and using AI, including cyber security, privacy, and ethics. Data security and cyber security are critical for the ABS and are essential elements of the AI technology used in the service. The ABS has governance procedures to ensure that AI is used safely and responsibly when creating the Coding Service models, in accordance with government policy and ethics principles.

The models were assessed against the newly developed Australian Government Pilot AI assurance framework to assure the models’ quality and fitness for purpose. The ABS has taken a risk-managed approach to risks specific to the use of AI, including the potential for algorithmic bias, lack of model transparency and unintended consequences. The forms of AI used in the Coding Service, including Hierarchical Support Vector Machine (HSVM) models, ensure the AI is transparent, traceable, and contained, which lowers potential risks.

The WoAG Occupation Coding Service is purpose-built for one specific task: coding data to official occupation classifications. It cannot do anything outside of this single purpose.

The models are tightly scoped, using highly controlled, narrow machine learning capability. They can only return occupation codes and titles. They cannot generate responses, provide insights, or perform any function beyond this specific task.

The models are trained on historical ABS data. They remain static until manually updated or retrained in a controlled process. They do not dynamically learn from, nor adapt to the data that users send to be coded. They cannot update themselves with previously unseen words. 

Back to top of the page