1504.0 - Methodological News, Sep 2015  
ARCHIVED ISSUE Released at 11:30 AM (CANBERRA TIME) 24/09/2015   
   Page tools: Print Print Page Print all pages in this productPrint All

A New Analytical Platform to Explore Linked Data

The Emerging Data and Methods (EDM) section has recently completed research work on identifying true and spurious firm deaths using a combination of traditional and new techniques. In particular, a prototype Graphically Linked Information Discovery Environment (GLIDE) was created from tax data sources and ABS Business Register data using a network-oriented Semantic Web approach. This enabled the application of multilevel modelling and Bayesian Network methods to the analysis of the network of employer-employee interactions among firms and people. The results of this work have been released as an ABS research paper, A New Analytical Platform to Explore Linked Data, available from the website.

The Semantic Web framework provides an alternative approach to data representation, linking and retrieval that can unlock the full potential of interconnected and multi-dimensional datasets. Instead of organising datasets in a structured row-column tabular form, the Semantic Web approach models information in the form of a network of entities and relationships. The relationships are given strong computable semantics by precisely specifying their logical properties in a machine-interpretable format.

The Semantic Web approach opens up new avenues of data exploration, visualisation and network analysis. One example of this has been demonstrated in the prototype GLIDE by using it to derive network statistics and create models to distinguish true firm deaths from spurious ones. The ABS has an established process for identifying firm exits, but is not able to distinguish the type of exit – whether it is due to restructuring, merger/takeover or a genuine death.

Both multilevel logistic regression and Bayesian Network (BN) models were used to distinguish true and spurious firm deaths. Multilevel models were developed both with and without network statistics, with the data partitioned into modelling (training) and prediction (test) subsets to assess the quality of out-of-sample predictions from the models. It was found that the model with network statistics performed substantially better (95% accuracy vs 74% accuracy). Significant variables were then incorporated in a BN model. This approach took account of the relationships between all the variables, achieving similar prediction accuracy with a subset of variables, and also handling observations with missing variables in the test data. The intention was not to compare both methods on the prediction outcomes but to build on the multilevel modelling results to provide a statistical framework for the BN model.

The analytical results have shown that it is important to account for spurious firm deaths for statistical production. This is because failure to account for spurious firm deaths can result in continuing enterprises being incorrectly classified as deaths, and as a result it can affect the statistical quality from the perspectives of survey frame and accuracy of the statistics. The conclusion is that the Semantic Web is a useful approach for statistical purposes, and that network analysis can be used to effectively distinguish true and spurious firm deaths.

Further Information

For more information, please contact Andreas Mayer or Joseph Chien (methodology@abs.gov.au)

The ABS Privacy Policy outlines how the ABS will handle any personal information that you provide to us.