ANALYSING VERY LARGE DATASETS
Government agencies and businesses are accumulating large databanks that potentially have considerable value for statistical purposes. The ABS has initiated projects to explore and exploit that potential:
- We are negotiating protocols with government agencies to set standards for the storage, vetting and documentation of administrative data.
- We are considering the implications of electronic commerce and business-to-business data exchange for our future data capture and estimation systems.
Over the years, Methodology Division and other parts of the ABS have developed a large array of mathematical tools and computer software to help us analyse datasets collected through the bureau's own censuses and sample surveys. The question arises, however, whether those tools and software will remain appropriate when we must deal with very large "by-product" datasets.
- We are exploring the statistical application of particular datasets, such as the Australian Taxation Office's Business Activity Statement and the scanner data collected at supermarket checkouts.
- Can the statistical analysis software we use today cope with very large (and rapidly growing) datasets? If not, what new computing tools do we need to store, transport, browse and transform these datasets?
- How might our traditional models have to be changed to deal with datasets that have not been assembled using ABS classifications, variable definitions and collection methods? What statistical methods do we need to navigate and manipulate these datasets? What methods do we need to assess their quality?
During 2000 and 2001, a group of staff from Methodology Division, ESG Strategic Development Section and Technology Applications Branch are thinking about these questions. We shall be contacting other statistical agencies and owners of large datasets, reviewing recent software developments, and ransacking the statistical literature for techniques and tools that look promising.
For more information, please contact Ken Tallis on (02) 6252 7290.
- How might our research strategies have to change? Might we do the bulk of our exploratory analyses on sampled datasets, then validate our preferred or final model against the full dataset? Can emerging techniques for data mining and data visualisation help us?