Time series clustering
The ABS is analysing the impact of the COVID-19 pandemic on the Australian economy using new methods, technologies and sources of data. One new method under investigation is the application of specialised clustering techniques to find patterns across large numbers of different time series. A time series is simply a set of measurements made sequentially in time. Clustering is the task of grouping objects in such a way that those in the same group (called a cluster) are more similar to each other (according to some objective measure) than to those in other groups.
Clustering different time series by their structure exposes similar features that may be caused by common events or behaviours in the underlying system of interest. This approach is more efficient than manually inspecting each individual time series, and it also avoids biases caused by subjective interpretation. For example, given data about personal mobility by country or state, can regions be identified that have been similarly affected by the lockdown measures imposed by governments, based on the types of business/industry or the demographics of residents in these areas? Or given thousands of time series of merchandise exports and imports data, can categories be found that give a better understanding of the factors influencing global trade patterns?
There are many techniques for time series clustering. However, the main issues to be considered for any time-series clustering approach are: how to represent the time series in a compact way that still preserves its essential characteristics; how to measure the similarity between time series under conditions that may cause distortions; and what algorithm to use for the clustering process.
To explore some of these issues, an undergraduate student from Monash University undertook a research project during the 2020-21 summer break as part of the ABS Vacation Student Program. The objectives were to evaluate the effectiveness, performance and computational demands of alternative approaches for clustering time series, using publicly available Apple and Google personal mobility data and open source R software packages in an AWS cloud computing environment. The work covered different representation schemes (wavelet transform, ARIMA), similarity metrics (Euclidean, dynamic time warping) and clustering algorithms (partitioning, hierarchical decomposition).
Some key findings from the project were that:
- The large volume of data involved makes it essential to reduce the dimensionality of the problem – by appropriately summarising the time series – for large-scale computation.
- The results of clustering depend critically on how the similarity of clusters is quantified – whether the focus is on trends, seasonal/weekly patterns, or volatility. It is therefore not possible to identify one clear ‘best’ method, as the most suitable technique will depend on the goals of the analysis.
Further work is planned over the course of 2021 to improve and deploy basic methods and software for a wide range of analyses, starting with an investigation of merchandise trade data in April 2021. Research will then extend into the application of machine learning to time series classification and event detection.
For further information, please contact Ric Clarke at firstname.lastname@example.org.
outlines how the ABS will handle any personal information that you provide to us.