1504.0 - Methodological News, Dec 2020

ARCHIVED ISSUE Released at 11:30 AM (CANBERRA TIME) 11/12/2020

Page tools: Print Page Print All
Summary Pufferfish differential privacy Improving prediction of dwelling occupancy for Census 2021 Unsupervised machine learning for text analysis Data Innovation Hub How to contact us and email subscriber list About this Release	Unsupervised machine learning for text analysis In April 2020, the ABS trialled the application of unsupervised machine learning (ML) for the automatic analysis of free-text responses on two surveys relating to the impacts of COVID-19. The aim was to extract meaningful themes from these responses as quickly as possible. Free-text is text or comments that users provide in response to open-ended questions. There is often no common structure or conceptual basis underlying the themes presented in free-text responses, and the themes cannot be anticipated prior to the survey. It is therefore difficult to automate the process of coding free-text responses, and manual coding is both resource-intensive and time consuming. The first survey, Business Impacts of COVID-19, asked businesses if they had been adversely affected by COVID-19 (yes or no) and if so, how (free-text response). The second, the ABS Staff Wellbeing Survey, asked staff how they were currently feeling (respondents could select from range of very good to very bad); respondents were then provided an opportunity to explain why they were feeling this way, expressed in the form of a free-text response. In both surveys, analysis of the free-text responses required a quick turnaround (less than three weeks). Text analysis involves three separate processes: Preprocessing, where the text responses are cleaned, punctuation is removed, upper case letters converted to lower case, and words are reduced to their stems (e.g. ‘working’ is reduced to ‘work’) Encoding (or embedding), in which the processed text is converted to a numerical form (such as a vector) that can be represented as a point in some space. Text is then grouped based on the distance between the points in the space. Depending on the encoding scheme used, similar text will be represented by points that are close together in the space. This step involves the use of Machine Learning (ML) to classify similar responses into groups or clusters. Specifically, unsupervised ML can be used to group similar responses together. It is intended to work in situations where no pre-labelled data exists on which to train a model. Rather, the approach seeks to detect patterns/themes in the data without any pre-existing examples to learn from (in contrast with supervised approaches). Two unsupervised ML approaches were tested. The first method used ‘clustering’: a method where similar text responses are grouped together into clusters. This method was tested on Business Impacts of COVID-19 Survey data and provided some insights into common effects of COVID-19 on businesses. Two predominant clusters were: ‘staff working from home’ and ‘closure’ of things. The second method used probability distributions to identify major themes in text responses by identifying the most common words associated with a topic. This method was used to analyse the Staff Wellbeing Survey. Common themes identified when staff were feeling very good were: ‘sun shining, good weather’ and ‘no commute time, sleep’. A common topic was identified for staff that were feeling very bad – this being ‘personal pressure, children’. Both methods were quick and processed hundreds of text responses in under ten minutes. While the methods were not perfect in grouping or clustering all text responses, they did work well to reduce the amount of manual analysis required. The clustering approach for Cycle 1 of the Business Impacts of COVID-19 Survey data successfully clustered about 30% of the free-text responses into coherent groups. The probability distribution method for the Staff Wellbeing Survey outperformed clustering analysis and successfully classified over 90% of free-text responses. Given further development, these methods show great potential for use in other ABS applications. For more information, please contact Lisa-Maree Gulino at methodology@abs.gov.au. The ABS Privacy Policy outlines how the ABS will handle any personal information that you provide to us. Document Selection These documents will be presented in a new window.