BIG DATA PLAYS A BIG ROLE IN THE FUTURE OF STATISTICS
The ease of access of internet based services has fuelled an explosive growth of human interaction in the digital world. A large part of our everyday lives is now spent online, for business, education, social contact or just entertainment.
In addition, a web of connected digital devices is emerging that operates independently of personal interaction. These devices - such as mobile phones, smart energy meters, traffic flow meters or point-of-sale scanners - are all continually producing data snapshots as part of their ongoing operation, making them all sources of potentially new and highly detailed datasets.
Much of this human and machine generated data - collectively referred to as 'big data' - is now accessible to national statistical agencies. It can be assembled and analysed to create a richer, more dynamic and better focused statistical picture of society, the economy and the environment.
When combined with traditional data sources - from surveys or existing administrative processes - these new big data sources enable new forms of statistical analysis with potential benefits when it comes to government service delivery and the development of well informed policy.
Government policy development and evaluation is increasingly hindered by the persistence of what have been called 'wicked problems'. These are social, economic or environmental issues that are difficult to define clearly - definitions may even change over time - and involve complex and often hidden interdependencies among many possible causes.
No one statistical dataset is likely to encompass all the information relevant to a particular wicked problem, and a better understanding may require a unique combination of sources - drawing from government and private sector data, from surveys, administrative collections and data created by digital devices - in order to more fully understand the problem.
In this context, the analysis of big data promises new insights into long standing policy problems, and may even provide the means to reduce the cost and improve the timeliness of mainstream statistical products.
The whole concept of big data is relatively new, and many of the statistical techniques required to make the best use of it are still being developed and tested by national statistical organisations and academia.
The ABS is already developing methodological and technological foundations for what it believes will be a new paradigm in official statistics, drawing heavily on recent advances in mathematical and computer sciences and underpinned by a number of significant advances in the representation, storage, transformation and integration of data.
There are a range of technical, legislative and privacy challenges to address. Not all government organisations may have appropriate legislation in place to enable the sharing of administrative datasets; private organisations may regard the data they hold as commercially sensitive and be unwilling to share it, and - most importantly - considerable methodological work may be required to create processes that maintain the confidentiality and privacy of individuals when a number of datasets are combined.
In order to demonstrate and validate its research into the possibilities of big data, the ABS has produced some prototype solutions that apply new methods and technologies to real business problems. These have been developed to meet the analytical and computational demands of big data.
The centrepiece is the Graphically Linked Information Discovery Environment (GLIDE) - an integrated platform for the linking and analysis of multiple datasets from diverse sources. The basis of the approach used in GLIDE is that statistical concepts and data are represented and stored in the form of an 'information network'. This structure can be drawn upon to retrieve and visualise both unit-level and aggregate data, and it allows users to explore a linked dataset from different analytical perspectives - including structural, relational, spatial, temporal and schematic.
GLIDE currently hosts a prototype dataset derived from linked administrative data sources relating to businesses and persons held by the ABS and the Australian Taxation Office including:
- Personal Income Tax
- Business Income Tax
- Pay As You Go
- Business Activity Statements
- ABS Business Register.
Existing data about employers is mostly collected from ABS business surveys and is generally unconnected to data about employees, which are usually collected through ABS household surveys. The ability to identify the employer-employee relationship in a statistical dataset greatly enhances the type of analysis that can be conducted.
The prototype linked dataset offers the potential to explore a range of business and labour force issues, such as:
- how employer and employee characteristics affect firm productivity
- the impact of firm births and deaths on economic activity
- which industries are creating and shedding jobs
- how many people hold multiple jobs and what types of jobs and employers multiple jobs are concentrated in
- which industries mainly provide short-term employment
- firm turnover at regional levels.
The ABS has developed GLIDE as a proof-of-concept implementation of the sorts of technical components needed to represent and work with multiple data sources. Following the development of a business strategy, the intention is to expand it further into a robust production system that can support the exploratory analysis of large, complex, interconnected datasets. In addition, the ABS is investigating the use of ‘machine reasoning’ methods to mine such data automatically for new insights, and natural language processing to extract information efficiently from the content of text documents.
The ABS has developed an initial ‘information discovery environment’ as a proof-of-concept implementation of the sorts of technical components needed to represent and work with multiple data sources.
This prototype information discovery environment helps to find, manipulate and visualise data drawn from a number of sources and the intention is to further develop it into a robust tool that can support a range of valid statistical analyses. The goal is to have a data framework that will enable analysis from a number of different perspectives – spatial, temporal, relational, structural or schematic – and in a form that can be adjusted and evolved as required.
As part of further developing the potential of big data, the ABS is also:
- trialling the use of the linked data standards developed by World Wide Web Consortium (W3C)
- evaluating the use of the W3C vocabularies framework for developing prototype models of statistical units
- investigating the application of automatic ‘machine reasoning’ methods such as first-order logic
- looking at natural language processing as an aid to extracting information from datasets.
The ABS acknowledges the ongoing support and assistance of the Australian Taxation Office in the creation of the linked dataset employer-employee database.