Web scraping in the Australian CPI



The Consumer Price Index (CPI) provides a general measure of change in the prices of goods and services acquired by Australian households. It is an important measure of inflation that informs monetary and fiscal policy and used extensively by government, academics and economists for macro-economic analysis.

To ensure the CPI remains relevant and contemporary, the Australian Bureau of Statistics (ABS) introduced a number of enhancements and innovations to produce the CPI. A significant change was the use of new data sources to replace traditional methods of collecting price information.

Traditionally the ABS has sent field officers to stores to collect prices from retailers, or collected prices over the phone and manually from the retailer’s website. This saw around 100,000 prices collected each quarter for use in the CPI.

In recent years, the ABS has replaced this form of ‘manual’ collection with administrative data provided by businesses and government agencies. An example of this was the introduction of transactions ‘scanner’ data in 2014. The use of these new data sources has resulted in a significant increase in the number of prices used and the range of products included in the CPI.

This article focuses on a new form of data collection known as web scraping. The ABS has been utilising web scraping technology to collect prices since 2016. Web scraping is currently used by the ABS to collect approximately 1 million prices per week from around 65 different retailers.

Explaining web scraping

With the growth of online retailing, prices and product information are often available from websites. Collecting this information online for the CPI lowers the burden on retailers to provide the data to the ABS and reduces the need to send ABS officers to collect the prices in stores.

Manual collection from websites is time consuming and limits the number of prices and products included in the CPI. Automating the collection of websites using programming languages and specialised software enables large-scale data collection from retail websites. This is referred to as web scraping.

Web scraping is a technique employed to extract large amounts of data from websites. The ABS uses purpose-built programs to scan the websites of retailers, find the relevant information and store the information as a time series. The process runs automatically and as frequently as desired (daily/weekly), providing high frequency information on all products available on the website.

Web scraping has been identified by several other countries as having an important role in the ongoing production of their CPIs. Statistical agencies including Statistics Netherlands, the Office for National Statistics in the United Kingdom, Statistics New Zealand, Statistics Canada and many European Union statistical agencies have begun web scraping price data. This has resulted in the release of a number of research papers and experimental series.

Use of web scraping in the Australian CPI

The ABS introduced web scraped data into the CPI in 2017 using the average price of a product over a given period of time (generally a month).

Prior to introducing web scraped data, the ABS compared 12 months of web scraped prices with the manually collected prices to ensure they were the same and were suitable for measuring price change. Once this was determined, the manually collected prices for certain products were replaced with web scraped data and manual collection ceased for these products. This approach has the following benefits:

  1. An average price from the twice-weekly collected web scraped data is more representative than a single price collected once per month or quarter;
  2. It enables more products to be included in the CPI basket; and
  3. It reduces CPI collection costs.

At this stage, web scraped data is predominantly used in the CPI for alcohol, a small number of clothing products and car parts, accounting for around 5 per cent of the CPI. Figure 1 shows the range of data sources used in the CPI and the approximate contribution (or weight) of each.

Figure 1: Data sources used in the CPI

Data sources used in the CPI
Image is a pie chart showing data sources used in the CPI, split into 4 categories, containing multiple elements. Manual collection accounting for 57% - takeaway food, clothing, furniture, appliances, travel, household services, rents, and new dwellings. Other admin data accounting for 22% - electricity, gas, child care, fuel, pharmaceuticals, and insurance. Scanner data accounting for 16% - groceries and tobacco. Webscraped accounting for 5% - alcohol, clothing, and car parts.

Not every product is available online, for example hairdressing services, and it’s not feasible to web scrape for products such as restaurant meals. However, the ABS estimates that there is the potential to use web scraped data for up to 20-25 per cent of the CPI. Greater use of web scraped data and other forms of new data could provide the following benefits:

  1. Enhance the CPI through the use of a larger number of prices and inclusion of a greater range of products in the CPI basket.
  2. Lower the costs of producing the CPI by reducing the proportion that is collected manually to below 40 per cent of the CPI.
  3. The potential to produce new statistics, including regional CPIs, spatial price indexes and a monthly CPI.

Although web scraping provides many benefits and opportunities, it also presents some challenges that the ABS and other national statistical organisations are investigating. Some of these challenges include:

  • Maintaining the web scrapers when websites change.
  • How to handle the significant number of new and disappearing products, known as product churn.
  • Developing automated methods to process and analyse the large amounts of data collected.
  • The absence of expenditure data to identify the most popular products that are purchased and for weighting purposes in the CPI.

Future work on web scraped data

While the use of web scraped data has enhanced the CPI, the ABS recognises that more can be done to make greater use of its potential. In late 2018 the ABS began research into using automated methods to process web scraped data for clothing and footwear products.

The methods being investigated are similar to those used to process the scanner data in the CPI. These methods have the benefit of using the full range of prices collected for each clothing and footwear product and dealing with the significant challenge of product churn for these products. These methods make use of clusters to group together similar products in order to measure price change over time. The ABS will publish an additional paper with the findings from this investigation at a later time.

In addition to investigating automated methods, the ABS is continuing to identify products where manually collected prices can be replaced by web scraped prices. The ABS is also investigating the use of machine learning to automate the classification of the web scraped data.

Back to top of the page