1160.0 - ABS Confidentiality Series, Aug 2017  
Latest ISSUE Released at 11:30 AM (CANBERRA TIME) 23/08/2017  First Issue
   Page tools: Print Print Page Print all pages in this productPrint All RSS Feed RSS Bookmark and Share Search this Product

UNDERSTANDING RE-IDENTIFICATION


This page contains the following:
What is re-identification?
Re-identification risks in today’s context
Managing the risk of re-identification
Re-identification in aggregate data
Re-identification in microdata

WHAT IS RE-IDENTIFICATION?

Re-identification is the act of determining the identity of a person or organisation even though directly identifying information has been removed. This may be able to be done using other publicly or privately held information about the individual or organisation. It is a type of disclosure, or breach of confidentiality, that can occur when someone has access to either aggregate data or microdata (unit record data). While the focus of this part of the Series is re-identification, the discussion also applies to the other main disclosure risk, that of attribute disclosure. This is because the risk of re-identification of an individual is likely to be increased if an attribute about them is revealed, for example a particular level of income that is common to a group of 15-18 year olds.


RE-IDENTIFICATION RISKS IN TODAY'S CONTEXT

As ever more data become available, there is growing interest in how data, and especially unit record data, can be accessed securely and used effectively for research and policy making. Providing data in open environments is an important part of the Australian Government Public Data Policy Statement, but open data may not always be the most appropriate manner for providing data for research (particularly when the requirements for utility conflict with confidentiality).

For datasets that cannot be made open accessible, strategies to manage confidentiality and disclosure risks when providing access to the data should consider:
  • How the dataset could be used to re-identify an individual or organisation
  • Whether information available elsewhere could be combined with the dataset to re-identify a person or organisation.

Emerging data sources and analytical methods may also increase disclosure risk, and this risk needs to be carefully managed. Some of these data sources and methods are explained below.

Administrative data

Administrative datasets contain direct identifiers such as name, address and Tax File Number that allow an agency to identify the people accessing a government service or program. Because this information is usually collected from everyone who accesses a service, these datasets may cover a large proportion of the population. Even when the directly identifying information is removed, people are still at higher risk of being re-identified from other information held about them when they are known to be in a dataset, or when the dataset is large.

Integrated datasets

Multiple information sources about people and organisations can be combined (i.e. integrated), forming rich and deep repositories of information and presenting opportunities for detailed analysis. They also present re-identification risks similar to administrative datasets (the larger range of information for each record may increase the risk of re-identification).

Customer information

Businesses collect customer information through registration processes and reward schemes. As a result, they hold databases containing detailed information on user characteristics and behaviour. Knowledge of these characteristics may be combined with information in a released dataset to re-identify an individual or business.

Social media

Because many people are willing to share their private information for social purposes, vast and increasing amounts of personal information are available online. This publicly available information may be combined with information in a released dataset to re-identify an individual or business.

Big Data analytics

While new technologies make it possible to produce and store vast amounts of transactional data, advanced techniques also enable Big Data to be summarised, analysed and presented in new ways. Computer systems are increasingly able to draw together disparate data to discover patterns and trends. Research is being conducted into how new technologies can also create modern data treatment processes that match the scale of Big Data and balance the dual goals of privacy protection and analytical utility.

Data custodians have a responsibility (ethically as well as legally under the Privacy Act and/or other legislation) to actively manage the re-identification risks of their data collections. For further information, see Part 1: What is confidentiality and why is it important?


MANAGING THE RISK OF RE-IDENTIFICATION

Re-identification may occur through a malicious attack (where a user consciously tries to determine the identity of an individual or organisation) or it may occur spontaneously (where a user inadvertently thinks they recognise an individual or organisation without a deliberate attempt to identify them). As the amount of data collected and released by government increases and technologies advance, re-identification risk management should be an iterative process of assessment and evaluation.

Two broad approaches exist for managing re-identification risks:
  • Control the context of the data release.
  • Treat the data.

These should not be considered as mutually exclusive, rather they are complementary. Controlling the release context is increasingly important when managing re-identification risks as it allows for more detailed data to be made available to approved researchers in a safe manner. Decisions about the level of data treatment required can only be made after determining the release context.

The release context comprises:
  • The audience who will have access to the data.
  • The purpose for which the data will be used.
  • The release environment.

The level of data treatment appropriate for authorised access in a controlled environment is unlikely to be sufficient for open and unrestricted public access. It should also be noted that if one or more aspects of the context changes, a reassessment of the disclosure risks should be performed in order to ensure data subjects remain unlikely to be re-identified.

For further information, see Part 3: Managing the risk of disclosure: the Five Safes Framework.


RE-IDENTIFICATION IN AGGREGATE DATA

There can be a risk of disclosure even though data are aggregated (i.e. grouped into categories or with combined values). This is because publicly or privately held information may be used to identify one or more contributors to a cell in a table.

Established techniques such as cell suppression and data perturbation exist to protect the confidentiality of aggregate (or tabular) data and preventing re-identification. However, with the increased volume of aggregate data available through electronic channels (e.g. machine-to-machine distribution) and at lower levels of geography, the risk of re-identification is increased and poses challenges for data custodians.

Although commonly used by many agencies, the application of cell suppression may be insufficient to prevent re-identification. As a response to this challenge, the ABS' TableBuilder service applies a perturbation algorithm to automatically protect privacy in user-specified tables. This perturbation algorithm leads to some loss of utility, but maintains a very high level of confidentiality.

For more about these disclosure risks and mitigation techniques see Part 4: Managing the risk of disclosure: treating aggregate data.


RE-IDENTIFICATION IN MICRODATA

Preventing the re-identification of people or organisations from microdata (i.e. unit record data) requires one or both of the following:
  • Controlling the context
  • Treating the data.

Here the context refers to the manner in which data are released (on a continuum ranging from open data to highly controlled situations such as access in a locked room). At a minimum, data treatment means removing direct identifiers such as name and address, but also in most cases it should involve applying further statistical treatment depending on the release context. For open data, appropriate and sufficient data treatment will eliminate the need to control the context (but this will be at the expense of data utility).

The following factors should be considered when deciding whether and under what contextual controls data will be released.

Private knowledge

In many cases it would be expected that users looking at a dataset are likely to possess private information about individuals or organisations represented in the dataset (e.g. a neighbour or family member). In these cases, the private information could enable them to re-identify someone in the dataset.

Strategies to manage this risk include:
  • Releasing a sample, rather than the entire dataset.
  • Providing access only to authorised users who give a binding undertaking not to re-identify any individual or organisation.

Public knowledge

Here the user draws on publicly available information (e.g. about a well-known person or business) when examining a dataset. For example, if a dataset containing information on businesses with very high turnover is released (even to a restricted group of researchers) the researchers may be able to re-identify large public companies that hold monopolies in certain industries.

Strategies to manage this risk include:
  • Releasing a sample, rather than the entire dataset.
  • Providing access to authorised users only who give a binding undertaking not to re-identify any individual or organisation.
  • Modifying the data to mask high-profile publicly-known individuals or organisations.

List matching

List matching refers to a user linking records in a dataset with information from other datasets. This is done by either matching common identifiers or comparing corresponding characteristics that are common to both datasets. There is a potentially increased risk of re-identification simply because the combined data increases the amount of detail available for each unit record (e.g. a person or organisation).

Strategies to manage this risk include:
  • Using secure data facilities to control which datasets are available to authorised researchers at any one time.
  • Extracting subsets of the microdata to provide users with only the data they require.
  • Wherever possible, and particularly with open data, identifiers placed on records should be unique to each published dataset.

The process of matching characteristics that are common to datasets for linking purposes is undertaken legitimately as part of securely managed data integration processes. The ABS, Australian Institute for Health and Welfare (AIHW) and the Australian Institute of Family Studies (AIFS) are formally accredited Commonwealth Data Integrating Authorities. Bringing data together in this way is an increasingly important method of extending and enhancing research. Accredited Data Integrating Authorities have procedures and controls in place in order to perform this linking function safely.

For more about microdata disclosure risks and mitigation techniques see Part 5: Managing the risk of disclosure: treating microdata.