1504.0 - Methodological News, June 2019  
ARCHIVED ISSUE Released at 11:30 AM (CANBERRA TIME) 20/06/2019   
   Page tools: Print Print Page Print all pages in this productPrint All

A NEW MACHINE LEARNING APPROACH FOR ENTITY RESOLUTION

The problem of entity resolution (“is this thing the same as that thing?”) has many applications in ABS work, particularly in the data integration and validation processes.

An internship organised by Australian Postgraduate Research (APR) between December 2018 and April 2019 allowed an intern from the Australian National University (ANU) to work on these issues with the ABS. The intern explored methods for linking business records across time in a simulated population using machine learning.

This kind of linkage is often investigated via latent variable methods: we assume that each person, business, or other entity has invisible (latent) features which influence the presence or absence of relationships, and we use the observable data to simultaneously estimate these features and the rules that determine how such features affect relationships between entities: for instance, if a company employs a high percentage of staff who have previously worked in mining, we might deduce that this company is also in mining.

The project considered several latent variable tensor learning methods including RESCAL, holographic embedding, and complex embedding, before settling on Trans-E. This is a relatively simple method which still performed well at linking business records in the simulated population, and scales well to large data sets. The linkage performance was improved with a “blocking” approach that broke the full-sized problem into smaller, more manageable problems.

We are now working to develop more sophisticated test data sets which will include realistic longitudinal phenomena and relationships between entities (e.g. family members working in the same business) along with complications such as multiple job-holders and enterprise group structures. These will be generated by a simulation approach with no use of confidential inputs, allowing them to be shared freely without any requirement for clearance processes. We expect that this will improve our ability to test and validate machine learning on realistic data, and make it easier to draw on external expertise without needing to share private data.

For more information, please contact Junmei Jing Methodology@abs.gov.au

The ABS Privacy Policy outlines how the ABS will handle any personal information that you provide to us.