1504.0 - Methodological News, Mar 2021  
ARCHIVED ISSUE Released at 11:30 AM (CANBERRA TIME) 29/03/2021   
   Page tools: Print Print Page Print all pages in this productPrint All

Text analytics

As part of its research program in text analytics – the automatic extraction of information content from textual data sources – the ABS is investigating the potential use of Named Entity Recognition (NER) in statistical production. NER is a capability in the field of Natural Language Processing that detects references to entities in text and classifies them into predefined categories that have similar attributes, such as a person, business, location, product or event. A named entity is one that is identified by a distinctive word or phrase. For example, a person or business may be identified by a name, a location by address, and an event by date.

There have been many approaches to NER systems in the past, including rules-based identification and categorisation, unsupervised methods such as clustering, and feature-based supervised learning. Recently, the use of neural networks has delivered state-of-the-art results in entity extraction, and this approach is being investigated for use in ABS applications.

The two current use cases for NER in the ABS involve identifying and extracting descriptive information about:

    • Consumer products (such as type, model, version, size) contained within retail scanner transactions records.
    • Job vacancies (such as position title, required skills, location) contained in web-scraped online job advertisements.

The job vacancy use case has been chosen as the initial focus area for investigation. Following a survey of the literature in December 2020, an experiment was conducted using web-scraped Australian job advertisements data from Seek, which are publicly available on the Kaggle data science portal. The motivation for this work is to create entity-level data about jobs that can be utilised in the ABS Labour Market Analysis Project (LMAP) to deliver new insights on the impact of COVID-19 on employment and labour demand. This project also uses GLIDE – an advanced knowledge discovery system under development by the ABS.

The Seek dataset required manual annotation to create training data for the machine learning process. The Python package spaCy was then used to generate a basic NER language model to extract key job characteristics. The next stage of research aims to develop a machine-interpretable concept model of job classes, and to iteratively improve the performance of the language models for use in the LMAP.

For further information, please contact Phil Newbold at methodology@abs.gov.au.

The ABS Privacy Policy outlines how the ABS will handle any personal information that you provide to us.