Data confidentiality guide

Learn about the Five Safes framework, confidentiality techniques and confidentialising your own data

Released
8/11/2021

Safely releasing valuable data

Data is a strategic and valuable resource "for growing the economy, improving service delivery and transforming policy outcomes for the Nation" (Australian Government Public Data Policy Statement).

It is important that organisations that collect data, including the Australian Bureau of Statistics (ABS), make their data holdings widely available. This includes releasing aggregate and unit record datasets (microdata) in ways that optimise their usefulness while still protecting the secrecy and privacy of those who have provided the information as required by Australian legislation. By law, the ABS must disseminate official statistics while making sure that information is not released in a way that is likely to enable individuals or organisations to be identified. The Australian Bureau of Statistics has safely and effectively released made data holdings available for over 110 years.

This guide focuses on methods and management techniques to securely release data while maintaining the confidentiality of individuals or organisations about which the information relates. If you have questions or feedback, email microdata.access@abs.gov.au.

What is confidentiality

Confidentiality is protecting the secrecy and privacy of information collected from individuals and organisations.

When information is made available to researchers, it needs to be done in a way that is unlikely to allow individuals or organisations to be identified. Maintaining confidentiality is both a legal and ethical obligation. Failure to maintain confidentiality is called a confidentiality breach, or disclosure.

We focus on managing the risks of two main types of breach:

  • re-identification: where the identity of a person or organisation is determined using other public or privately held information about them
  • attribute disclosure: where a characteristic of an individual or organisation is determined without formally re-identifying them

Maintaining confidentiality (protecting secrecy, privacy and identity) is essential to preserving public trust in data custodians (agencies that collect, manage and release data). All data custodians must carefully consider confidentiality requirements (secrecy, privacy and identity) before the release of any data, whether aggregate or microdata.

Obligation to maintain confidentiality

Australian Government agencies collect data from individuals and organisations as a standard part of their activities. There is a legal and ethical responsibility for agencies to respect and maintain the secrecy, privacy and identity of those providing the information. 

In practice, this means implementing policies and procedures that address all aspects of data protection. Agencies should ensure that identifiable information:

  • is not released publicly (except where allowed by legislation)
  • is maintained and accessed securely
  • is available only to approved people and on a need-to-know basis

Legal obligations

There is a public expectation that agencies treat information about individuals and organisations with respect and manage it appropriately.

The obligation to keep confidential the identities and characteristics of people and organisations is primarily reflected in laws governing the collection, use and dissemination of information. These laws include, for example:

These and other pieces of legislation have different terminology for the process of making data available in a safe manner. However, they all require reasonable steps to be taken to limit the likelihood of an individual person or organisation being re-identified in any data that is released. Penalties apply if the secrecy provisions set out in these Acts are breached. For example, the Census and Statistics Act stipulates criminal penalties for enabling the likely identification of an individual or organisation.

Organisations may also have policies and principles that outline additional non-legislative requirements for maintaining confidentiality. In the government sector, these documents set standards for employee behaviour and provide advice on the protocols and procedures for managing information safely. For example, the APS Values and Code of Conduct explain the high levels of ethical behaviour required of Commonwealth Government employees. Agencies planning to integrate datasets can find principle based obligations in the High Level Principles for Data Integration Involving Commonwealth Data for Statistical and Research Purposes

Privacy legislation

In Australia, data protections are recognised in the Privacy Act 1988. 

The Act sets out people's rights in relation to the collection, use, sharing and retention of information they provide to the Commonwealth. The Privacy Act also establishes the Australian Privacy Principles (APPs) which outline how most Australian Government agencies, all private sector and not-for-profit organisations with an annual turnover of more than $3 million, all private health service providers and some small businesses must treat personal information. Importantly APP6 limits the disclosure of personal information. Personal information is defined in s6(1) of the Privacy Act as:

"information or an opinion about an identified individual, or an individual who is reasonably identifiable: 

  • whether the information or opinion is true or not; and
  • whether the information or opinion is recorded in a material form or not."

Government agencies in the Northern Territory, ACT and most Australian states are bound by privacy legislation specific to their state or territory. Agencies in Western Australia are bound by the confidentiality provisions and privacy principles in the Freedom of Information Act 1992 (WA), while South Australia has an Information Privacy Principles Instruction administered by the Privacy Committee of South Australia. 

Contextual approach to confidentiality

Legislation enables data to be released as long as reasonable steps are taken to prevent re-identification. Using a contextual approach to confidentiality. This means that as long as the practical result (of processes applied) is that the confidentiality of individuals or organisations is not breached, then the legal and ethical requirements are satisfied. Processes used to achieve this are heavily dependent on the surrounding context (or manner) in which the data is released. In order to maintain confidentiality you must consider: 

  • the environment into which the data will be released (such as a public website, a secure data laboratory)
  • the method and the degree of data treatment to be applied to prevent re-identification in that environment
  • the balance between adequately treating the data and ensuring its usefulness

The Privacy Act supports this contextual approach to maintain confidentiality of the data it protects (ie personal information). The idea notion of 'identifiability' is central to the operation of the Privacy Act, although there is no formal definition of when an individual is 'identifiable' or 'reasonably identifiable' in a dataset. The Office of the Australian Information Commissioner sets out a number of factors that organisations should consider when determining the identifiability of data they hold (De-identification and the Privacy Act) as well as providing guidance on 'what is personal information'. These resources show that determining whether any data subjects are 'reasonably' identifiable' in a dataset requires a contextual consideration of the particular circumstances of the case, including:

  • the nature and amount of information
  • who will hold and have access to the information
  • the other information that is available to researchers (privately held or publicly available)
  • the practicality of using that information to identify an individual

In some cases, this contextual approach may mean that a focus on treating the data will be the only practical option (such as when data is made publicly available on a website). In other cases, controls on the environment in which data is to be accessed, used or released may play a larger role. Understanding this context informs decisions about the what level of treatment required for a data release. 

Other context controls could include:

  • establishing processes to approve researchers before being granted access
  • ensuring the purpose for which data is used is appropriate/legal/ethical
  • providing a secure access environment
  • checking outputs to prevent disclosure in publicly released information

For example, the ABS applies this contextual approach to confidentiality using the Five Safes Framework in order to provide researchers with secure access to detailed microdata within the ABS DataLab. A similar approach is taken by the Sax Institute in their Secure Unified Research Environment (SURE).

Understanding re-identification

Re-identification in aggregate data and microdata, managing re-identification risk

Released
8/11/2021

What is re-identification

Re-identification occurs when the identity of a person or organisation is determined even though directly identifying information has been removed. This may be able to be done using other publicly or privately held information about the individual or organisation. This type of disclosure, or breach of confidentiality, that can occur when someone has access to either aggregate data (such as tables) or microdata (unit record data). This section considers the risk of re-identification, the other main disclosure risk, attribute disclosure. The risk of re-identification of an individual is likely to be increased if an attribute about them is revealed, for example a particular level of income that is common to a group of 15-18 year olds.

It is important that data, and especially unit record data, can be accessed securely and used effectively for research and policy making. Providing data in open environments is an important part of the Australian Government Public Data Policy Statement. However, open data may not always be the most  appropriate manner for providing data for research, particularly when the requirements for utility conflict with confidentiality.

For datasets that cannot be made open and accessible, strategies to manage confidentiality and disclosure risks when providing access to the data should consider:

  • how the dataset could be used to re-identify an individual or organisation
  • whether information available elsewhere could be combined with the dataset to re-identify a person or organisation
     

Data sources and methods

Data sources and analytical methods may also increase disclosure risk, and this risk needs to be carefully managed.

Administrative data

  • contains direct identifiers such as name, address and Tax File Number that allow an agency to identify the people accessing a government service or program
  • is usually collected from everyone who accesses a service and may cover a large proportion of the population
  • even when the directly identifying information is removed, people are still at higher risk of being re-identified from other information held about them when they are known to be in a dataset, or when the dataset is large

Integrated datasets

  • multiple information sources about people and organisations can be combined (integrated), forming rich and deep repositories of information and presenting opportunities for detailed analysis
  • re-identification risks similar to administrative datasets (the larger range of information for each record may increase the risk of re-identification)

Customer information

  • businesses collect customer information through registration processes and reward schemes, holding databases containing detailed information on user characteristics and behaviour
  • knowledge of these characteristics may be combined with information in a released dataset to re-identify an individual or business

Social media

  • many people are willing to share their private information for social purposes, with vast and increasing amounts of personal information available online
  • publicly available information may be combined with information in a released dataset to re-identify an individual or business

Big Data analytics

  • while new technologies make it possible to produce and store vast amounts of transactional data, advanced techniques also enable Big Data to be summarised, analysed and presented in new ways
  • computer systems are increasingly able to draw together disparate data to discover patterns and trends
  • research is being conducted into how new technologies can also create modern data treatment processes that match the scale of Big Data and balance the dual goals of privacy protection and analytical utility

Data custodians have ethical and legal responsibilities to actively manage the re-identification risks of their data collections. 

Managing the risk of re-identification

Re-identification may occur through a deliberate attack (where a user consciously tries to determine the identity of an individual or organisation) or it may occur spontaneously (where a user inadvertently thinks they recognise an individual or organisation without a deliberate attempt to identify them). As the amount of data collected and released by government increases and technologies advance, re-identification risk management should be an iterative process of assessment and evaluation.

Two broad complementary approaches exist for managing re-identification risks:

  • control the context of the data release - important when managing re-identification risks as it allows for more detailed data to be made available to approved researchers in a safe manner
  • treat the data - decisions about the level of data treatment required can only be made after determining the release context

The release context includes:

  • the audience who will have access to the data
  • the purpose for which the data will be used
  • the release environment

The level of data treatment appropriate for authorised access in a controlled environment is unlikely to be sufficient for open and unrestricted public access. It should also be noted that if one or more aspects of the context changes, a reassessment of the disclosure risks should be performed in order to ensure data subjects remain unlikely to be re-identified.

Re-identification in aggregate data

There can be a risk of disclosure even though data is aggregated (grouped into categories or with combined values). This is because publicly or privately held information may be used to identify one or more contributors to a cell in a table. 

Established techniques such as cell suppression and data perturbation exist to protect the confidentiality of aggregate (or tabular) data and preventing re-identification. However, with the increased volume of aggregate data available through electronic channels (such as machine-to-machine distribution) and at finer levels of geography, the risk of re-identification is increased and poses challenges for data custodians.

Although commonly used by many agencies, the application of cell suppression may be insufficient to prevent re-identification. As a response to this challenge, the ABS' TableBuilder service applies a perturbation algorithm to automatically protect privacy in user-specified tables. This perturbation algorithm leads to some loss of utility, but maintains a very high level of confidentiality.

Re-identification in microdata

To prevent re-identification of people or organisations from microdata we need to do  requires one or both of:

  • controlling the context:
  • the manner in which data is released (on a continuum ranging from open data to highly controlled situations such as access in a locked room)
  • treating the data:
  • at a minimum, removing direct identifiers such as name and address
  • in most cases applying further statistical treatment depending on the release context.
  • for open data, appropriate and sufficient data treatment eliminates the need to control the context (but this is at the expense of data utility)

The following factors should be considered when deciding whether and under what contextual controls data will be released.

Private knowledge

Users looking at a dataset are likely to possess private information about individuals or organisations represented in the dataset (such as a neighbour or family member). In these cases, the Private information could enable them to re-identify someone in the dataset.

Strategies to manage this risk include:

  • releasing a sample, rather than the entire dataset
  • providing access only to authorised users who give a binding undertaking not to re-identify any individual or organisation

Public knowledge

Users may draw on publicly available information (such as a well-known person or business) when examining a dataset. For example, if a dataset containing information on businesses with very high turnover is released (even to a restricted group of researchers) the researchers may be able to re-identify large public companies that hold monopolies in certain industries. 

Strategies to manage this risk include:

  • releasing a sample, rather than the entire dataset
  • providing access to authorised users only who give a binding undertaking not to re-identify any individual or organisation
  • modifying the data to mask high-profile publicly-known individuals or organisations

List matching

List matching refers to a user linking records in a dataset with information from other datasets. This is done by either matching common identifiers or characteristics that are common to both datasets. There is a potentially increased risk of re-identification simply because the combined data increases the amount of detail available for each unit record (a person or organisation).

Strategies to manage this risk include:

  • using secure data facilities to control which datasets are available to authorised researchers at any one time
  • extracting subsets of the microdata to provide users with only the data they require
  • using unique randomised record identifiers for each published dataset

The process of matching characteristics that are common to datasets for linking purposes is undertaken legitimately as part of securely managed data integration processes. The ABS, Australian Institute for Health and Welfare (AIHW) and the Australian Institute of Family Studies (AIFS) are formally accredited Commonwealth Data Integrating Authorities. Bringing data together in this way is an important method of extending and enhancing research. Accredited Data Integrating Authorities have procedures and controls in place in order to perform this linking function safely.

Five Safes framework

Using safe people, projects, settings, data and output to balance disclosure risk and utility, ABS Fives Safes examples

Released
8/11/2021

Balancing disclosure risk and data utility

A key challenge for data custodians is to provide data with maximum utility for users but still maintain the confidentiality of the information. Every data release carries some risk of disclosure, so the benefits of each release (its utility or usefulness for research and statistical purposes) must substantially outweigh its risks and be clearly understood. This balancing of risk and utility is something everyone does on a daily basis (for example, when they choose to drive a car). Similarly, data custodians need to approach disclosure risk by managing it, rather than trying to eliminate it.

Confidentiality is breached when a person, group or an organisation is re-identified through a data release or when information can be attributed to them. The likelihood of this happening, or risk of disclosure, is not easily determined. Implicit in this is that the consequences of disclosure are always damaging (to some extent) to the individual or organisation. It is difficult to ascertain the degree of damage, mostly because people differ in the importance they place on information. What may be considered highly confidential to one person is of no consequence to another. The ABS assumes all information it collects to be potentially sensitive and manages it accordingly.

Managing disclosure risk becomes a question of assessing not only the data itself, but also the context in which the data is released. Once the context is clearly understood, it is much easier to determine how to protect against the threat of disclosure. The Five Safes framework provides a structure for assessing and managing disclosure risk that is appropriate to the intended data use. 

This framework has been adopted by ABS, several other Australian government agencies as well as national statistical organisations such as the Office of National Statistics (UK) and Statistics New Zealand.

Five Safes framework

The Five Safes framework takes a multi-dimensional approach to managing disclosure risk. Each safe refers to an independent but related aspect of disclosure risk. The framework poses specific questions to help assess and describe each risk aspect (or safe) in a qualitative way. This allows data custodians to place appropriate controls, not just on the data itself, but on the manner in which data is accessed. The framework is designed to facilitate safe data release and prevent over-regulation

The five elements of the framework are:

  • safe people
  • safe projects
  • safe settings
  • safe data
  • safe outputs

Safe people

Is the researcher appropriately authorised to access and use the data? 

By placing controls on the way data is accessed, the data custodian invests some responsibility in the researcher for preventing re-identification. Usually, as the detail in the data increases, so should the level of user authorisation required. 

Prerequisites for user authorisation usually include:

  • training in confidentiality and the conditions of data use
  • signing a legally binding undertaking to maintain data confidentiality 

By definition, a safe people assessment would not be required for open data (data that is released into the public domain with no restriction on use).

Safe projects

Is the data to be used for an appropriate purpose? 

Users wanting to access detailed microdata should be expected to explain the purpose of their project. For example, in order to access detailed microdata in the ABS DataLab, users must demonstrate to the ABS that their project has a statistical purpose and show it has:

  • a valid research aim
  • a public benefit
  • no capacity to be used for compliance or regulatory purposes

As with safe people, the need for a safe project assessment will depend on the context in which the data is accessed. It would not be required for open data.

Safe settings

Does the access environment prevent unauthorised use? 

The environment here can be considered in terms of both the IT and the physical environment. In data access contexts such as open data, safe settings are not required. At the other end of the spectrum however, sensitive data should only be accessed via secure research centres.

Secure research centres may have features such as:

  • a locked room requiring personal authentication
  • IT monitoring equipment
  • auditing and other supervision

Safe settings ensure that data access and use is occurring in a transparent way.

Safe data

Has appropriate and sufficient protection been applied to the data? 

At a minimum the removal of direct identifiers (such as name and address) must be applied to data before it is released. Further statistical disclosure controls should also be applied, depending on how the data will be released. Table 1 shows some of the statistical factors that should be considered when assessing disclosure risk.

Table 1: Factors to consider when assessing disclosure risk
FactorEffect on disclosure risk
Data ageOlder data is generally less risky
Sample data (e.g. a survey)Decreases risk
Population data (e.g. a census)Increases risk
Administrative dataIncreases risk
Longitudinal dataIncreases risk
Hierarchical dataIncreases risk
Sensitive dataIncreases risk (sensitive data may be a more attractive target)
Data qualityPoor quality data may offer some protection
MicrodataMain risk: re-identification
Aggregate dataMain risks: attribute disclosure and disclosure from differencing
Key variablesThe variables of most interest to users are usually the most disclosive

Source: UK Anonymisation Decision-making Framework

Safe outputs

Are the statistical results non-disclosive?

This is the final check on the information before it is made public, which aims to reduce the risk of disclosure to a minimum. All data made available outside the data custodian's IT environment must be checked for disclosure. For example in the ABS DataLab, statistical experts check all outputs for inadvertent disclosure before the data leaves the DataLab environment.

Examples from the ABS

The Five Safes framework provides a mechanism for data custodians to take necessary and reasonable steps to manage disclosure risk in their data releases. It broadens the approach to data confidentiality by considering not just the treatment of data, but also the manner and context in which data is released.

The safes are assessed independently, but also considered as a whole. They can be thought of as a series of adjustable levers or controls to effectively manage risk and maximise the usefulness of a data release. The degree to which each safe is controlled is critical to assessing the disclosure risk. Tightly controlling all five will be counterproductive because the restrictions applied will not produce a corresponding benefit (useful data).

In practice, the safe data part of the Five Safes should be addressed after the other four are considered. This is because the degree of data treatment required will become evident once it is clear who will be able to access the data, under what conditions, in what circumstances and how the resulting data will be protected in order to be made public. The process is likely to be iterative, as data treatment with a view to maintaining utility may necessitate reassessing one or more of the other four safes.

This table describes how the ABS applies the Five Safes framework to three different data access channels - open data, basic and detailed microdata files.

Table 2: Three examples of ABS application of the Five Safes framework
 Website or publication table (open data)Basic microdata file (via direct download)Detailed microdata file (via ABS DataLab)
Safe peopleNo control necessary

Anyone may view the data online
Some control

Users must register to use the data and sign a Declaration of Use
Breaches may be subject to sanctions and/or legal proceedings
High control

Users must undergo training, complete an authorisation process, sign legally binding confidentiality undertakings and a compliance declaration
Breaches of protocols or disclosure of information may be subject to sanctions and/or legal proceedings
Safe projectsNo control necessary

Anyone can use the data for their own purposes
Some control

Users sign a declaration regarding the purpose for which they will use the data
High control

Users must detail the purpose for which they will use the data
Purpose can be compared to what is actually produced (see Safe Outputs)
Safe settingsNo control necessarySome control

Users are required to store the data securely and can work on the data in their own physical and IT environment
High control

The DataLab a secure, closed environment, accessed virtually or on-site
Secure login, auditing and monitoring capabilities
No data can be removed without first being checked by ABS staff
Safe dataVery high control

The data is highly aggregated
High control

The data is treated by the ABS to ensure no individual is likely to be identified
Appropriate control

Direct identifiers are removed and the data is further treated where appropriate. Appropriate control of the data optimises its usefulness for statistical and research purposes.
Safe outputsVery high control

Every table is checked for disclosure before release
(in an Open Data context, the data is the safe output)
Some control

The output is technically controlled by the user, but the ABS provides guidelines or rules about what may be published or shared
High control

All statistical outputs are assessed by the ABS for disclosure before being released to the user. The outputs may also be compared for consistency with the original project proposal.

 

In all three cases, applying any one safe in isolation is unlikely to provide an effective confidentiality solution. However, when all five safes are considered in combination, the overall disclosure risk becomes very low.

Tabular data is most effectively protected through safe data and safe outputs. 

When data is loaded into the user's own environment, some of the safes can wit be more effectively controlled than others. The data custodian cannot directly monitor how the data is used. However, the data custodian mitigates disclosure risk by directly protecting the data. The downside of this approach is that the data can lose some of its utility. Examples of these types of datasets include:

  • basic microdata files (produced by the ABS)
  • public use files (PUFs)

The treatment of the microdata files in the ABS DataLab effectively uses all five safes. Safe people, safe projects, safe settings, safe data and safe outputs are all controlled to mitigate the risk of disclosure. This allows appropriately approved researchers to work securely with highly detailed microdata.

Treating aggregate data

Disclosure risks for tables, identifying risky cells, and data treatment techniques

Released
8/11/2021

Tables and disclosure risks

Aggregate data is usually presented as tables, though may also be in map or graph forms. 'Aggregate data and tables are often used interchangeably. Different types of tables can contain different disclosure risks. There are additional disclosure risks if users can access multiple tables containing common elements. Users could potentially re-identify a person by differencing the outputs of two separate tables to which the person contributes.

Frequency tables

Each cell in a table contains the number of contributors (such as individuals, households or organisations). Disclosures could be made from a table where one or more cells have a low count (a small number of contributors). Assessments of whether a cell's value is too low must be based on the underlying count of its contributors, not its weighted estimate.

Magnitude tables

Each cell contains summarising information about the contributors to that cell. This may be in the form of a total, or mean, median, mode or range. An example of a magnitude tables is reporting total turnover for groups of businesses. Disclosures can occur when the values of a small number of units (such as one or two businesses with extremely high turnover) dominate a cell value.

Rules to identify at-risk cells

Some common rules can help identify which cells may constitute a disclosure risk. By using these rules, a data custodian makes an implicit decision that all cells which break a rule are deemed an unacceptable disclosure risk, while all other cells are deemed non-disclosive. Each rule below provides protections against attackers trying to re-identify or disclose an attribute about a contributor. They also mitigate attacks where one contributor tries to discover information about another.

Strict applications of these rules may decrease the data's usefulness. Data custodians need to set appropriate rule values that are informed by legislation and organisational policy, disclosure risk assessment and statistical theory. The advantage of using rules is that they are simple, clear, consistent, transparent and amenable to automation. They are ideally applied in situations where:

  • data is released in a similar manner (for example when the same kind of dataset is regularly released)
  • transparency is important to both data custodian and users
  • limited opportunity or requirement exists for engagement between data custodians and users

However, there may be situations where non-disclosive data is treated unnecessarily, or where non-untreated data is in fact disclosive. 

Frequency rule

Also called the threshold rule, this sets a particular value for the minimum number of unweighted contributors to a cell. Application of this rule means that cells with counts below the threshold are defined as to pose an unacceptable disclosure risk and need to be protected. However, when using this rule you should also consider:

  • there is no strict statistical basis for choosing one threshold value over another
  • higher values increase protection against re-identification but also reduce the utility of the original data
  • lower threshold values may be appropriate for sampled datasets compared to population datasets
  • by implication, everything greater than or equal to the threshold is defined as an acceptable disclosure risk
  • it may be that a cell below the threshold value is not a disclosure risk, while a cell value greater than the threshold value is disclosive

In Table 1 a frequency rule of 4 is chosen, with any cell of less than 4 contributors considered a disclosure risk. The 25-29 year old age group has 3 contributors in the Low income cell and therefore the cell needs to be protected.

Table 1: Example of the frequency rule in aggregate counts (threshold value = 4)
Age (years)Low incomeMedium incomeHigh incomeTotal
15-19160016
20-24810725
25-29381122
30-34451827
Total31233690

 

Cell dominance rule

This applies to tables that present magnitude data such as income or turnover. The rule is designed to prevent the re-identification of units that contribute a large percentage of a cell's total value. Also called the cell concentration rule, the cell dominance rule defines the number of units that are allowed to contribute a defined percentage of the total. 

This rule can also be referred to as the (n,k) rule, where n is the number of units which may not contribute more than k% of the total. For example, a (2, 75) dominance rule means that the top 2 units may not contribute more than 75% of the total value to a cell. As with the frequency rule there is no strict statistical basis for choosing values for the dominance rule.

P% rule

The P% rule can be used in combination with the cell dominance rule to protect the attributes of contributors to magnitude data. It aims to protect the value of any contributor from being estimated to within P% and it is especially helpful where a user may know that a particular person or organisation is in a table. Applying this rule limits how close, in percentage terms, a contributor's estimated value can be from its true value. For example, a rule of P%=20%, means that the estimated value of any contributor must differ from its true value by at least 20%. 
 

Using the cell dominance and P% rules together

Here is an example of using the dominance and P% rules together where (n, k)=(2, 75) and P%=20%. Table 2a shows profit for Industries A-D.

Table 2a: Example of profit by industry
IndustryProfit ($m)
A267
B302
C212
D34
Total815

 

Although initially the table does not appear to contain a disclosure risk, there is a risk if information about the companies contributing the data to each industry is known.

This information could include the data in Table 2b which shows the companies contributing to Industry B. In this table, the two companies S and T contribute (150 + 93)/302 = 80.5% of the total. This exceeds the dominance rule value of k=75%, which means that this cell (Industry B) needs to be protected. Had the P% rule been applied on its own, the disclosure risk in Table 2a would not have been identified. If company S or T subtracted their own contribution to try to estimate the contribution of the other main contributor, in both cases the result would be more than 20% different from the actual value. For example, if company S subtracted its own $150m contribution from the $302m total, the resulting $152m would differ from company T's contribution by (152 – 93)/93 = 63%. Therefore the P% rule would not be broken.

Table 2b: Contributors to Industry B
CompanyProfit ($m)
S150
T93
U21
V13
W8
X8
Y6
Z3
Total302

 

This shows that a combination of approaches may be necessary to identify at-risk cells. Even though Table 2b data should never be released (because it refers to individual companies), anyone familiar with Industry B may know that Companies S and T are the two main contributors. Therefore the summary data in Table 2a requires treatment.

Tabular data treatment techniques

Due to the diverse nature of data there is no one solution for managing all re-identification risks through the treatment of aggregate data. However, the key is to only treat the cells that have been assessed as being to be an unacceptable disclosure risk.

The most common techniques are:

  • data reduction, which decreases the detail available to the user
  • data modification, which makes small changes to the data

How effective these techniques are depends on:

  • the structure of the dataset
  • the requirements of the data users
  • legislative or policy requirements
  • the available infrastructure to treat and disseminate the data

When releasing aggregate data into the public domain, all reasonable efforts should be made to prevent disclosure from the data. While it won't be possible to guarantee confidentiality, this effort must satisfy legislative requirements. For example, ABS legislation states that data must not be released in a manner that is likely to enable the identification of a particular person or organisation. This is important because once a table is made public, there are no further opportunities to control how the data will be used or to apply other confidentiality controls using the Five Safes framework.

Data reduction techniques are generally easy to implement but can have significant confidentiality issues. By contrast, data modification techniques are harder to implement but generally produce tables that are less disclosive (Table 3). These methods can be used to allow data custodians to release data that would otherwise remain inaccessible. 

It is recommended that data custodians start by using simple techniques (data reduction) and only proceed to more complex ones (data modification) if an objective assessment shows the results of the simple techniques have an unacceptable risk of disclosure. This assessment could be made at an organisational level, rather than repeated for every table. For example, the ABS has assessed that TableBuilder's perturbation algorithm, despite its complex methodology, is safer than data reduction for frequency tables. Though potentially time-consuming, these methods can be used to allow data custodians to release data that would otherwise remain inaccessible. 

Table 3: Comparison of techniques for treating tabular data
Confidentiality techniqueAdvantagesDisadvantages
Data reduction
  • relatively easy to implement
  • requires minimal education of users
  • does not reliably protect individuals from differencing between multiple overlapping tables
  • may reduce the data's usefulness
  • the data custodian chooses what data to remove without necessarily knowing what is most important to the data users
  • requires secondary suppression to protect the original primary suppressed cells
  • even with secondary suppression, some suppressed cells may still be estimated
Data modification
  • generally does not affect the data's overall utility
  • generally protects against differencing, zero cells and 100% cells
  • may be automated, requiring minimal human input
  • does not provide additivity within tables unless secondary modifications are applied
  • requires some education of users
  • may require significant setup time and costs
  • may reduce the data's usefulness, particularly when analysing small areas/populations

 

Data reduction

Data reduction protects tables by either combining categories or suppressing cells. It limits the risks of individual re-identification because cells with a low number of contributors or dominant contributors are not released.

Data reduction: involves:

  • combines variable categories
  • suppresses counts with a small number of contributors (as per the frequency rule), and it considers suppressing higher counts for sensitive items
  • suppresses cells with dominant contributors (as per the cell dominance rule)

Combining categories

This approach can be applied to Table 4a, where the value 3 does not meet a frequency threshold of 4. This cell can be protected in two ways:

  • combine the 20-24 and the 25-29 year old age groups to create a 20-29 year old range (Table 4b)
  • combine the Low and Medium income categories to create a single Low–medium category (Table 4c)
Table 4a: Unprotected income and age data (threshold value = 4)
Age (years)Low incomeMedium incomeHigh incomeTotal
15-19160016
20-24810725
25-29381122
30-34451827
Total31233690

 

Table 4b: Treatment applied - age groups combined (20-29 years)
Age (years)Low incomeMedium incomeHigh incomeTotal
15-19160016
20-2911181847
30-34451827
Total31233690

 

Table 4c: Treatment applied - income categories combined (Low-medium income)
Age (years)Low - medium incomeHigh incomeTotal
15-1916016
20-2418725
25-29111122
30-3491827
Total543690

 

The choice of which categories to combine depends on how important they are for further analysis and what the purpose is in releasing the data. For example, if researchers specifically want to analyse the 'Low' income range, collapsing it into a 'Low-Medium' range prevents this. There is no one answer when choosing which categories to combine: in either case, the data's utility may be affected.

Combining categories is often appropriate, but doesn't work in all situations. For example, if another table were produced containing the 20-24 year old row, the 25-29 year old values could be determined by subtracting the 20-24 row (in the new table) from the 20-29 row (in Table 4b). 

This can be a complex process. Data custodians must carefully consider any other tables that include the same contributors, that:

  • have already been released
  • are being released at the same time
  • are likely to be released in the future

Suppression

Suppression involves removing cells considered to be a disclosure risk from a table. 

In Table 4a, for example, the 3 could be replaced with not provided or np. This is called primary suppression. Sometimes secondary, or consequential, suppression is also required. For example, in addition to suppressing the 3 cell, other cells need to be suppressed to prevent the primary suppressed cell from being calculated. The data in Table 4a could be treated by:

  • using primary and consequential suppression by not providing data from other rows and columns (Table 5a) provides an example of how primary and consequential suppression can be used to treat 
  • suppressing cells in the totals (Table 5b)

Other cell suppression combinations can be employed. The key is to ensure that the suppressed cells cannot be derived from the remaining information.

Table 5a: Primary and secondary suppression to protect tabular data
Age (years)Low incomeMedium incomeHigh incomeTotal
15-19160016
20-24810725
25-29npnp1122
30-34npnp1827
Total31233690

 

Table 5b: Suppression of totals to protect tabular data
Age (years)Low incomeMedium incomeHigh incomeTotal
15-19160016
20-24810725
25-29np811>19
30-34451827
Total>282336>87

 

It is not usually recommended to suppress cells that contain a zero. For example if the two zeros in Table 5a were suppressed, it is apparent from the row total and the Low income 15-19 year old cell that they are zeros. In addition, it may be the case that a value must be zero by definition (often called structural zeros, for example the number of pregnant males in a population).

There is a large body of literature as well as commercially available software for determining the optimal pattern of secondary suppression.

Limitations of data reduction

As with combining categories, it is possible that other available tables may contain cell values or grand totals that could be used to break the suppression. For example, if tables 5a and 5b were both publicly available, the suppressed 'Medium' income values in Table 5a could be replaces with the unsuppressed values from Table 5b. Table 5a would then contain enough information to deduce all of the remaining suppressed values.

Table 6a: Original data - income by age
Age (years)Total
Low incomeMedium incomeHigh incomeVery high income
15-19123511
20-24632718
25-29278421
30-3441115434
Total1323282084

 

Table 6b: Suppressed data - income by age after applying threshold = 4
Age (years)Total
Low incomeMedium incomeHigh incomeVery high income
15-19npnpnp511
20-246npnp718
25-29np78np21
30-34np1115np34
Total1323282084

 

Table 6c: Income by age with variables assigned
Age (years)Total
Low incomeMedium incomeHigh incomeVery high income
15-19abc511
20-246de718
25-29f78g21
30-34h1115i34
Total1323282084

 

In some cases, the results of data suppression may appear safer than they truly are. Tables 6a–c illustrate how suppressed cell values can be deduced using techniques such as linear algebra. That is, While the data in Table 6b appears safe, someone could assign a variable to each of the 'np's to create Table 6c. 

With variables assigned to Table 6c, the values in rows 1-2 and columns 2-3 can be used to generate the following equation:

(a+b+c+5) + (6+d+e+7) – (b+d+7+11) – (c+e+8+15) = 11 + 18 – 23 – 28

The variables b, c, d and e cancel out in this equation to give:

a – 23 = –22

Therefore, a = 1

This is the correct and disclosive value (from Table 6a) that the suppression (in Table 6b) was meant to protect. Thus linear algebra and other techniques such as integer linear-programming can be used to calculate or make reasonable estimates of the missing values in tables. In fact, the larger the table (or set of related tables available) the more accurately the results can be estimated. As computing power increases and more data is released, these sorts of attacks on data become easier.

Problems with suppression: 

  • The usefulness of the data is reduced (for example, Table 5a has lost one third of its data). There are cells that aren't disclosive, but that have been suppressed nonetheless. 
  • It can be difficult and time consuming to select the best cells for secondary suppression for large tables especially. Software packages are available that optimise the suppression pattern in the table or set of tables.
  • The data is no longer machine readable because the table now includes symbols ['>'] or letters ['np'].

Although it can be relatively simple to suppress cells or combine categories, data custodians must be confident their outputs are not disclosive.

Data modification

In contrast to data reduction techniques, data modification is generally applied to all non-zero cells in a table - not just those with a high disclosure risk. Data modification assumes that all cells could either be disclosive themselves or contribute to disclosure. It aims to change all non-zero cells by a small amount without reducing the table's overall usefulness for most purposes.

The two methods discussed below are:

  • rounding
  • perturbation (global or targeted)

Rounding

The simplest approach to data modification is rounding to a specified base, where all values become divisible by the same number (often 3, 5 or 10). For example, Table 7b shows how original data values (Table 7a) would look with its values rounded to base 3.

Table 7a: Income by age original values (from Table 1)
Age (years)Low incomeMedium incomeHigh income Total
15-19160016
20-24810725
25-29381122
30-34451827
Total31233690

 

Table 7b: Income by age with rounding to base 3
Age (years)Low incomeMedium incomeHigh incomeTotal
15-19150015
20-2499624
25-29391221
30-34361827
Total30243690

 

Using this technique, the data is still numerical (containing no symbols or letters) which is a practical advantage for users requiring machine readability.

Though the data loses some utility, rounding brings some significant advantages from a confidentiality point of view:

  • Users won't know whether the rounded value of '3' in Table 7 is actually a 2, 3 or 4. 
  • Users won't know whether the zeros are true zeros. This mitigates the problem of group disclosure, whereas the original values showed that all 15-19 year olds were on a low income. 
  • Even if the true grand total or marginal totals were known from other sources, the user is still unable to calculate the true values of the internal cells. 

These advantages are not universally true and there may be situations where rounding can still lead to disclosure. For example, if the count of 15-19 year olds in Low Income was 14 in Table 7, then the rounded counts for this row would still be 15, 0, 0 (total = 15). Now, if the true total was somehow known from other sources to be 14, then the true value of the low income category must also be 14 (it can't be 15 or 16 since this would necessitate a negative number in one or both of the other two income categories).

The main disadvantage to rounding is that there can be inconsistency within the table. For example in Table 7b, the internal cells of the 25-29 year-olds row sum to 24, whereas the total for that row is 21. Although controlled rounding can be used to ensure additivity within the table (where the totals add up), it may not provide consistency across the same cells in different tables. 

Graduated rounding can also be used to round magnitude tables, which means the rounding base varies by the size of the cell. Table 8 shows how data could be protected by rounding the original values to base 100 (Industry A, B, C and Total) or 10 (Industry D). Again, the total is not equal to the sum of the internal cells, but it is now much harder to estimate the relative contributions of units to these industry totals.

Table 8: Example of profit by industry with graduated rounding
IndustryOriginal values (Profit $m)Rounded values (Profit $m)
A267300
B302300
C212200
D3430
Total815800

 

Perturbation

Perturbation is another data modification method. This is where a change (often with a random component) is made to some or all non-zero cells in a table. 

  • For count data (frequency tables), a randomised number is added to the original values. This is called additive perturbation.
  • For magnitude tables, the original values are multiplied by a randomised number. This is called multiplicative perturbation.

For both table types, this can be further broken down into targeted or global approaches. 

Targeted perturbation is the approach taken when only those cells that are considered to be a disclosure risk are treated. Often this requires the application of secondary perturbation in order to maintain additivity within a table. 

This approach is used by the ABS to release some economic data. An example (as shown in Table 9) would be to remove $50m from Industry B and add $25m to Industries A and C.

Table 9: Example of profit by industry with targeted perturbation
IndustryOriginal values (Profit $m)Rounded values (Profit $m)
A267292
B302252
C212237
D3434
Total815815

 

There are two key advantages of this approach:

  • the total does not change (this is an important feature when the ABS releases economic data which then feed into National Accounts)
  • generally, there is minimal loss of information

A disadvantage is that some data that are not disclosive per se is being altered to protect another cell (for example, the value of Industries A and C in Table 9). An alternative is to place the $50m that was taken from Industry B and place it in a new sundry category. Although some of the processes for targeted perturbation are amenable to automation (such as using a constrained optimisation approach), there is often a significant manual effort required (more so, when it is considered that all other tables produced need to have matching perturbed values). 

This requirement for manual treatment means targeted perturbation often requires a skilled team to maintain the usefulness of the data (such as those specialising in releasing economic trade data in the ABS). If this is not feasible for data custodians, and particularly when the data is not economically sensitive, then a global perturbation approach may be more appropriate, as it can be automated. An important feature of global perturbation is that each non-zero cell (including totals) is perturbed independently. Perturbation can also be applied in a way that ensures consistency between tables, but it cannot guarantee consistency within a table. The marginal totals may not be the same as the sum of their constituent cells. This methodology is applied in TableBuilder, an ABS product used for safely releasing both Census and survey data.
 
Tables 10a and 10b show how data might look before and after additive perturbation is applied.

Here the user has no chance of determining that the count of low income 15-19 year olds is '1'. They can still make reasonable estimates of the true value, but they are unable to confirm their guesses. Because perturbation only applies small changes and applies changes to every cell, the results are unbiased and for most purposes, the overall value of the table is retained.

Table 10a: Income data by age (original)
Age (years)Low incomeMedium incomeHigh incomeVery high incomeTotal
15-19123511
20-24632718
25-29278421
30-3441115434
Total1323282084

 

Table 10b: Income data by age with additive perturbation
Age (years)Low incomeMedium incomeHigh incomeVery high incomeTotal
15-19043710
20-24460421
25-29079521
30-3471016432
Total1225252183

 

Perturbation can also be applied to magnitude tables (assuming targeted perturbation was not appropriate or viable), but in these cases it is multiplicative. This is because adding or subtracting a few dollars to a company's income is unlikely to reduce the disclosure risk in any meaningful way. With multiplicative perturbation, the largest contributors to a cell total have their values changed by a percentage. The total of these perturbed values then becomes the new published cell total. 

Table 11 shows how data might look before and after multiplicative perturbation is applied. 

The total has been changed by only about 6%, but the individual values of companies S and T have been protected. An attacker does not know the extent to which each contributor's profit has been perturbed or therefore how close the perturbed total is to the true total.

Table 11: Top 3 contributing companies to an industry (Industry B) with multiplicative perturbation
CompanyOriginal values (Profit $m)Rounded values (Profit $m)
S150123
T93104
U2118
V1313
W88
X88
Y66
Z33
Total302283

 

Table 12 shows how data from Table 11 might be combined with other industry totals in a released table. As with rounding the table no longer adds up, but there is robust protection of its individual contributors. For magnitude tables it may also be necessary to suppress low counts of contributors (or perturb all counts) as an extra protection.

Table 12: Example of profit by industry with multiplicative perturbation
IndustryOriginal values (Profit $m)Rounded values (Profit $m)
A267296
B302283
C212185
D3438
Total815821

 

An important issue to keep in mind with all perturbation is that the perturbation itself may adversely impact on the usefulness of the data. Tables 10 and 12 show perturbation where the totals are quite close to the true value, but individual cells can be changed by 100%. When data custodians are releasing data, they need to clearly communicate the process by which they have perturbed the data (although they shouldn't provide exact details on key parameters of perturbation that could be used to undo the treatment). This communication should also include a caveat that further analysis of the perturbed data may not give accurate results. Further analysis done on the unperturbed data should be carefully checked before being released into the public domain to ensure the analysis and the perturbed table cannot be used in conjunction to breach confidentiality. On the other hand, it should also be recognised that small cell values may be unreliable either due to sampling or that responses are not always accurate. 

Hierarchical data treatment techniques

All of the methods described above are limited in how they deal with hierarchical datasets (datasets with information at different levels). For example, a file may contain records for each family at one level and in another file separate records for individual family members. The same structure could apply to datasets containing businesses and employees, or to health care services, providers and their clients.

In hierarchical datasets, contributors at all levels must be protected. A table that appears to be non-disclosive at one level may contain information that a knowledgeable user could use to re-identify a contributor at another level. For example, a table cell that passes the cell frequency rule at one level may not pass the same rule at a higher level.

The following example shows a summary table (Table 13), which is derived from detailed information in Table 14.

On the surface, Table 13 appears non-disclosive, for example, it doesn't violate a frequency rule of 4. However, a closer look at the source data reveals the following disclosure risks: 

  • The summary count of 'Pathology' in the 'Private' sector in 'East' location (61) is based on only 2 providers (Lisa and Stu). Thus, both Lisa and Stu could subtract their own contribution to determine the income of the other. 
  • Only one company (Clinic D) is represented by this same cell.
  • The summary count of 'Surgery' in the 'Public' sector in 'West' location (5) is from only 2 patients and 1 provider (Pru). So other companies can use the service fee value to estimate Clinic E's income for 'Surgery'.

All of these scenarios are confidentiality breaches. In a real-world situation Table 14 would not be released because it contains names. But it should be assumed that someone familiar with the sector would know the contributors and may even be one of them.

Table 13: Summary of health care service counts (based on Table 14 data)
Service typeEast (Public)West (Public)East (Private)West (Private)Total
Treatment047950142
Surgery050209214
Pathology01061778
Total062156216434

 

Table 14: Counts of health care patients and services
SectorLocationCompanyClinicServiceProviderPatientsService
PrivateEastQDPathologyLisa1529
PrivateEastQDPathologyStu1832
PrivateEastQBTreatmentJoe35
PrivateEastQBTreatmentJan831
PrivateEastQBTreatmentDeb622
PrivateEastQBTreatmentEm531
PrivateEastQBTreatmentFred36
PrivateWestQCPathologyIan77
PrivateWestQCSurgeryBill38
PrivateWestRESurgeryTess36201
PublicWestPAPathologyMeg410
PublicWestPESurgeryPru25
PublicWestPATreatmentRob37
PublicWestPATreatmentAl1440
Total235 14127433

 

The example above shows the need to protect information at all levels. With hierarchical data, data treatment must be applied at every level, and any consequences of these changes need to be followed through to other levels. For example, if the number of providers in Table 14 is reduced to zero, then the count of services (in Tables 13 and 14) also needs to be zero.

Treating microdata

Assessing and treating microdata disclosure risks

Released
8/11/2021

Microdata and disclosure risks

Microdata files are datasets of unit records, where each record contains information about a person, or organisation, or other type of record. This information can include individual responses to questions on surveys, censuses or administrative forms. Microdata files are potentially valuable resources for researchers and policy makers because they contain detailed information about each record. The challenge for data custodians is to strike the right balance between maximising the availability of information for statistical and research purposes and fulfilling their obligations to maintain confidentiality by:

  • assessing the context in which the data will be released
  • treating the data appropriately for that context

Assessing disclosure risk

The two key risks when releasing microdata are when disclosure occurs through:

  • spontaneous recognition - where, in the normal course of their research analysis, a data user recognises an individual or organisation without deliberately attempting to identify them (for example, when checking for outliers in a population)
  • deliberate attempts at re-identification - looking for a specific individual in the data, or using other research to confirm the identity of an individual who stands out because of their characteristics

As with aggregate data, there is also a risk with any published analysis from the microdata output.

Several methods for assessing microdata disclosure risk can be used:

  • cross-tabulate the variables (e.g. look at age by income or marital status) to identify records with unique or remarkable characteristics
  • compare sample data with population data to determine whether records with unique characteristics in the sample are in fact unique in the population
  • compare potentially risky records to see how similar they are to other records that may provide some protection (a unique 30 year old with certain characteristics may be considered similar to a 31 year old with the same characteristics)
  • identify high-profile individuals or organisations known to be in the dataset and who may be easily recognisable
  • consider other datasets and publicly available information that could be used to re-identify records, such as through list matching

Factors contributing to the risk of disclosure should also be considered. These factors have different bearings in different contexts. For example, if releasing microdata publicly, data custodians should carefully consider each of the factors below. If these microdata is only be released in a secure data facility to authorised researchers, some factors may not be applicable.

Level of detail

The more detailed a unit record, the more likely re-identification becomes. Microdata files containing detailed categories or many data items could, through unique combinations of characteristics, reveal enough to enable re-identification. 

With microdata output (or aggregate data), the main risk is attribute disclosure which may in turn increase risks of re-identification. In addition, with detailed tables, there is increased risk of disclosure due to differencing attacks or mathematical techniques that undo some or all of the data protections.

Data sensitivity

Some variables may require additional treatment if they are sensitive, such as health, ancestry or criminal information. This treatment may be dictated by legislation and policy as well as confidentiality obligations. This can be a significant balancing act as often the variables that are of most interest to researchers are also sensitive.

Rare characteristics

A disclosure risk may exist if the data contains a rare and remarkable characteristic (or combination of characteristics). This can happen even if there are few data items or categories. This risk depends on how remarkable the characteristic is. For example, a widow aged 19 years is more likely to be identifiable than one aged 79 years. In addition, it is important to consider the rarity of a record from a population perspective. For example, there may only be one 79 year old widow in a sample, but they are not unique in the entire population. The sampling process is a significant contributor to protection of the confidentiality of that individual (a user is unlikely to know which 79 year old widow was selected). It is advisable however, to protect that single individual in any subsequent outputs that may be publically released.

Data accuracy

Data accuracy can increase the risk of disclosure. While it is not recommended to produce data with low accuracy as a method to manage this risk, data custodians should be aware that datasets subject to reporting errors or containing out-of-date information may present a lower disclosure risk.

Data age

As a general rule accessing older data is less likely to enable re-identification of an individual or organisation than accessing up to date information. This is particularly true for variables that change over time, such as area of residence or marital status. 

Data coverage (completeness)

Individuals or organisations are more easily identifiable if they are known to be in the dataset. Datasets that cover the complete population increase the risk of disclosure because a user knows that all individuals are represented in the dataset. This risk applies to administrative data and population censuses.

Sample data

Data based on surveys or samples taken from a population are generally of lower disclosure risk than full population datasets. This is because there is the inherent uncertainty whether a record belongs to a particular individual or organisation. The risk is not reduced to zero, particularly when considering records with rare characteristics which may be re-identifiable in a sample as well as the population, so sampling should not be the only method of protection.

Data structure

In some cases, how the dataset is structured increases the disclosure risk. Longitudinal datasets (those where individuals are tracked over time, as opposed to datasets that are time-based snapshots of different sample of the population) may have significant disclosure issues. Individuals or organisations that have changes in their characteristics over time are much more likely to be re-identified than those that don't (and in reality very few individuals or organisations don't change characteristics over time). For example a business that has a relatively constant income over 5 years, but then triples their income for the next three years is more likely to be re-identified compared to a business with a constant income over the same time frame.

Another structural aspect of datasets is their hierarchical nature. This is where datasets have information at more than one level such as a person level as well as a family level. The information may be non-disclosive at one level, but be disclosive at a higher level. For example a count of people with household income of $801-$1,000 per week may be 6. However, the 6 may refer to a single household (2 parents and 4 children), which has effectively disclosed information about all the people in the household. 

Incentive

The more an individual or organisation is likely to gain from re-identifying a record, the greater the risk of disclosure. Conversely, the risk of attack is lower when the gains are lower. This is the fundamental principle of trusted access, where researchers share accountability for protecting data confidentiality and where the incentive for them is ongoing authorisation to access information.

Software for assessing disclosure risks

Various software packages can help data custodians assess, detect and treat disclosure risks in microdata. These include:

  • Mu-ARGUS: Developed by Statistics Netherlands to protect against spontaneous recognition only (not against list matching).
  • SDC-Micro: An R-based open source package, developed by International Household Survey Network. This program calculates disclosure risk, for whole datasets and individual records, and applies treatments.
  • SUDA (Special Uniques Detection Algorithm): Developed by the University of Manchester to identify unit records which, due to rare or unique combinations of characteristics, pose a re-identification risk. SUDA looks for uniqueness in the dataset but does not consider whether a particular record is unique in the population as a whole.

Microdata treatment methods

Once a re-identification or disclosure risk is known, it can be addressed through a number of data modification and reduction techniques. These techniques should be applied to only those records or variables judged to be a risk - a judgement that should consider the specific release context:

  • detailed microdata may require few of these treatments because it is accessed in a context where people, projects, settings and outputs are controlled (see the Five Safes framework)
  • publicly available files may require many or data treatments.

Usually the minimum level of protection for any microdata to be used for statistical or research purposes is removal of direct identifiers such as name and address. Depending on the legal obligations of data custodians and controls on access, such as (e.g. user authorisation, project assessment, security of access environment, output checking), the removal of direct identifiers alone may be sufficient to protect confidentiality. Further disclosure controls may be required depending on the data release context, especially for public access of open data. Data custodians must carefully assess the microdata to identify records posing a disclosure risk and treat them to prevent re-identification.

Limit the number of variables

This means reducing the number of variables in the dataset. For example, you could remove detailed geographic variables.

Modify cell values

This can be done through rounding or perturbation. The amount of rounding should be relative to the magnitude of the original value. For example, rounding could vary from $1,000 (for personal income) to $1 million (for business income). Perturbation in the context of microdata means adding 'noise' to the values for individual records. For example, someone's true income of $1,000,000 might be perturbed to $1,237,000. In order to maintain totals, that $237,000 could be removed from one or more other records.

Rounding or perturbation may also be applied to exact dollar amount that may be otherwise at risk of list matching. If a record in a population is the only one with the exact value of income of $97,8999.21 then this record may be at risk if a user also has access to another dataset with income variables. The user could combine both datasets based on exact income values and be able to learn new information about the records. Treating all dollar amounts on a dataset by a small amount provides protection. This can be done by grouping the records into clusters and adjusting records within each cluster so that the mean for each cluster remains the same.

Combine categories

Combine categories that are likely to enable re-identification, such as: 

  • using age ranges rather than single years
  • collapsing hierarchical classifications such as industry at higher levels (e.g. mining rather than the more detailed coal mining or nickel ore mining)
  • combining small territories with larger ones (e.g. ACT into NSW)

You can combine categories containing a small number of records so that the identities of individuals in those groups remain protected (e.g. combine use of electric wheelchairs and use of manual wheelchairs). See also Treating aggregate data section.

Top/bottom coding

Collapse top or bottom categories containing small populations. For example, survey respondents aged over 85 could be coded to an 85+ category as opposed to having individual categories for 85-89, 90-94 and 95-100 years which might be very sparse.

Data swapping

To hide a record that may be identifiable by its unique combination of characteristics, swap it for another record where some other characteristics are shared. For example, someone in NSW who speaks an uncommon language could have their record moved to Victoria, where the language may be more commonly spoken. This allows characteristics to be reflected in the data without the risk of re-identification.

As a consequence of applying this method, additional changes may also be required. In the previous example, after moving the record from NSW to Victoria, family-related information would also need to be adjusted so that both the original record and the records of family members remain consistent.

Suppression

If the above methods are insufficient, suppress particular values or remove records that cannot otherwise be protected from the risk of re-identification.

Understand the relationship between microdata and aggregate data

To ensure that hierarchical data cannot be used to identify higher level contributors (i.e. in aggregate data), the above methods may need to be applied to a greater degree. Alternatively, removing variables or other information relating to the higher levels may be effective. See also Treating aggregate data.

Use of ABS microdata and impact on research quality

ABS microdata products in the context of confidentiality controls, and impact of data treatments on analysis

Released
8/11/2021

ABS microdata products

ABS releases microdata products that can be accessed for statistical and research purposes:

  • basic microdata which can be downloaded into a user's own computing environment
  • detailed microdata files, available through the ABS DataLab

ABS also releases TableBuilder, which uses underlying microdata to allow researchers to create their own automatically confidentialised tables, graphs and maps (aggregate output).

Each product is designed to meet a different research requirement. A comparison of these products is shown in Table 1. You can also Compare data services for more detail about a wider range of ABS products.

Table 1: ABS microdata products
Microdata productBasic microdataDetailed microdataTableBuilder
Access via
Utility and suitability

Basic utility:

  • surveys
  • census samples

Very high utility:

  • surveys
  • census data
  • administrative data
  • complex integrated data

High utility:

  • surveys
  • census data
  • administrative data
  • complex integrated data
Purpose
  • Simple modelling
  • Multivariate analysis
  • Exploratory analysis
  • Complex modelling
  • Detailed analysis
  • Small to very large tables
  • Graphs and maps
Confidentiality controls applied
Data treatment
  • Direct identifiers removed
  • Data available at broad levels only (eg state)
  • Variables aggregated (eg 5 or 10 year age groupings)
  • At risk records suppressed/removed
  • Direct identifiers removed
  • At risk records suppressed/removed
  • Direct identifiers removed
  • At risk records suppressed/removed
Context controls
  • User registration
  • Legal undertakings
  • Secure storage of microdata within user's environment
  • approval of people
  • approval of projects
  • secure IT/physical access environment
  • clearance process applied to output to be removed from system
  • approval of people
  • approval of projects
  • secure IT/physical access environment
  • clearance process applied to output to be removed from system

 

Impact of data treatment on analysis

Treating the data itself may restrict the ability of a researcher to answer a particular question. For example, a major difference between basic microdata and detailed microdata is that data item categories in the former have been collapsed or aggregated to a greater degree, which reduces the level of detail available. In some instances it may therefore be more appropriate to use an Expanded CURF. Similarly, it may be better to conduct research (within the ABS DataLab) using detailed microdata files if the Expanded CURF does not contain enough detail to answer the researcher's question. 

There are, however, situations where data treatment may not adversely affect the quality of the data or the reliability of the research.  For example, a 2010 study used data from the ABS's Survey of Mental Health (2007) to compare results attained from using the Expanded CURF with results attained from using the untreated main-unit-record file (MURF). As Table 2 shows, the results were almost identical.

Table 2: Hazard ratios* for smoking cessation and incident of anxiety disorders
DisorderHazard ratio
(from treated microdata)
Hazard ratio
(from original microdata)
Standard error of
hazard ratio
Difference between hazard ratios,
as proportion of standard error
No lifetime mental disorder11(reference category) 
Anxiety disorder (type):
Panic disorder0.596780.596150.103350.0102
Agoraphobia0.449360.449360.08420
Social phobia0.553740.553740.079730
Generalised anxiety disorder0.336310.336310.078460
Obsessive-compulsive disorder0.477820.477820.106740
Post-traumatic stress disorder0.632210.632210.078650
Anxiety disorder (severity):
Mild0.749010.748820.114950.00218
Moderate0.589420.58920.082750.00447
Severe0.391960.392020.060160.00249

 

* Hazard ratios compare the incidence of an event in one group to another group over time

Source: Lawrence, D., Considine, J., Mitrou, F. & Zubrick, S.R. (2010) ‘Anxiety disorders and cigarette smoking: Results from the Australian Survey of Mental Health and Wellbeing’, Australian and New Zealand Journal of Psychiatry. Vol. 44, pp. 521-528.

Glossary

Explanation of statistical and confidentiality terms used in this guide

Released
8/11/2021

Administrative data

  • information (including personal information) collected by agencies for the administration of programs, policies or services
  • can be microdata (unit-record data) or macrodata (aggregate data)
  • may be used for statistical or research purposes

Aggregate data

  • produced by grouping information into categories and combining values within these categories
  • example: a count of the number of people of a particular age (obtained from the question 'In what year were you born?'). 
  • also known as tabular data or macrodata
  • aggregate data is often presented in tables

Attribute disclosure

  • occurs when previously unknown information is revealed about an individual, group or organisation (without necessarily formally re-identifying them)

Big Data

  • extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions 

Cell concentration rule

  • used to assess whether a table cell may enable re-identification or attribute disclosure
  • also called the cell concentration rule
  • finds cells where a small number of data providers contribute a large percentage to the cell. iIf a cell fails this rule, further investigation or data treatment is needed to ensure the attributes of predominant data providers are not disclosed
  • aggregate data treatment method

Confidentiality

  • protecting the secrecy and privacy of information collected from individuals and organisations, and ensuring that no data is released in a manner likely to enable their identification

Confidentiality rules

Data access context

  • the environment and manner in which data is released
  • data custodians need to consider who will have access to the data, the purpose for which the data will be used and the release environment itself (whether physical, IT or legal)

Data custodian

  • organisation or agency responsible for the collection, management and release of data
  • they have legal and ethical obligations to keep the information they are entrusted with confidential

Data laboratory

  • secure data environment where researchers can perform detailed analysis of microdata
  • also known as secure research centres
  • can be accessed virtually (remotely) or on-site
  • ABS data laboratory is called the DataLab

Data modification

  • technique used to treat data to limit re-identification or other disclosure
  • changes all non-zero cells by a small amount while aiming to maintain the table's overall usefulness
  • examples include rounding and perturbation

Data provider

  • an individual, household, business or other entity that supplies data for statistical or administrative purposes
  • also known as a respondent

Data reduction

  • a technique for statistical disclosure control
  • methods to control or limit the amount of detail available in a table to prevent individuals or organisations from being re-identified
  • methods include combining variables or categories, or suppressing (removing) information in unsafe cells
  • can be applied to aggregate data or microdata

Data rounding

  • slightly altering cells in a table to make them all divisible by the same number
  • common numbers used for rounding are 3, 5 or 10
  • may be random or controlled
  • prevents the original data values from being known with certainty while ensuring the usefulness of the data is not significantly affected
  • aggregate data treatment method

Data swapping

  • process of moving the values of one or more variables from one microdata record to another record, so it no longer poses a disclosure risk
  • microdata treatment method

Differencing or differencing attack

  • where someone with access to multiple tables can deduce the true values of cells that had been modified or suppressed
  • individual tables may be non-disclosive, but when the tables are compared, the difference between cells across the tables may be disclosive
  • example: if a user accessed a table with information on 20-25 year olds and then accessed a subsequent table with information on 20-24 year olds, the difference between the two tables will reveal information about 25 year olds only

Direct identification

  • when the data includes an identifier (such as name or address) that can be used, without any additional information, to establish the identity of a person, group or organisation

Disclosure or disclosive

  • a breach of confidentiality, where a person, group or organisation is identified or has previously unknown characteristics (attributes) associated to them as a result of releasing data

Disclosure control

  • the process of limiting the risk of an individual or organisation being directly or indirectly identified
  • can be via statistical (data focused) or non-statistical (data context-focused) techniques or processes

Disclosure risk management

  • In the context of confidentiality, determining whether released datasets (or sections of released datasets) constitute a risk of disclosure or re-identification, and then putting in place controlling mechanisms to mitigate those risks
  • the Five Safes framework provides a way of assessing risk within the constraints provided by policies and legislation

Five Safes framework

  • multi-dimensional approach to managing disclosure risk, consisting of safe people, safe projects, safe settings, safe data and safe outputs
  • each safe is considered both individually and in combination to determine disclosure risks and to put in place mitigation strategies for releasing and accessing data

Frequency rule

  • sets a particular value for the minimum number of unweighted contributors (such as people, households or businesses) to any cell in the table
  • cells with very few contributors (small cells) may pose a disclosure risk
  • common threshold values are 3, 5 or 10
  • if a cell fails this rule, further investigation or action is needed to ensure the cell is adequately protected
  • also called the threshold rule

Hierarchical data

  • datasets that contain more than one level
  • example: a dataset containing unit records with information about individual people (such as personal income) may also contain information about the families these people are part of (such as household income)

Identified data

  • data that includes information that refers directly to an individual or organisation, such as name or address, ABN, Medicare number

Identifier(s)

  • information that directly establishes the identity of an individual or organisation
  • examples include name, address, driver's licence number, Medicare number and ABN
  • also known as direct identifiers

Indirect identification

  • occurs when the identity of an individual, group or organisation is disclosed due to a unique combination of characteristics (that are not direct identifiers) in a dataset
  • example: a famous individual may be identifiable on the basis of their age, sex, occupation, geography and income

List matching

  • where a user compares records from one dataset with records from another in an attempt to find records that have corresponding information, so that it may be concluded that the two records belong to the same individual
  • this is a clear breach of the Privacy Act and other legislation governing data access where this is done in an attempt to re-identify that individual

Macrodata

  • see aggregate data

Microdata

  • datasets of unit records where each record contains information about a person, organisation or other type of unit
  • can include individual responses to a census, survey or administrative form

Open data

  • Data that is made available with no restriction on access or use (excluding possible copyright or licensing requirements). In terms of the Five Safes framework, the only control is on safe data. 
  • Data on data.gov.au is open data as any researcher can download files
  • Data underlying ABS TableBuilder is not considered open data as there is a safe setting control - users cannot directly access the underlying microdata
  • Aggregate output (tables, graphs or maps) from TableBuilder are open data

Outlier

  • an unusual record that, because it has an extreme value for one or more data items, stands out from the rest of the population or sample because it has an extreme value for one or more data items
  • outliers are potentially risky for confidentiality

P% rule

  • statistical disclosure control rule that prevents any user from estimating the value of a cell contributor to within P% (where P is defined by the data custodian)
  • aggregate data treatment method

Personal information

  • information that identifies, or could identify, a person
  • can include not only names and addresses, but also medical records, bank account details, photos, videos, and even information about what a person likes or where they work
  • information can still be personal without having a name attached to it
  • example: idate of birth and postcode may be enough to identify someone
  • see also Sensitive information

In the Privacy Act 1988, personal information is "information or an opinion about an identified individual, or an individual who is reasonably identifiable: 

  1. whether the information or opinion is true or not true; and 
  2. whether the information or opinion is recorded in a material form or not." 

Perturbation

  • a statistical disclosure control technique used for count or magnitude data (aggregate data) or for microdata
  • data modification method that involves changing the data slightly to reduce the risk of disclosure while retaining as much data content and structure as possible
  • data rounding is a type of perturbation

Privacy

  • not specifically defined in the Privacy Act
  • an individual's right to have their personal information kept confidential unless informed consent has been given to release the information, or a legal authority exists - this is in accordance with the requirements of the Privacy Act 1988

Re-identification

  • the act of determining the identity of a person or organisation using publicly or privately held information about that individual or organisation

Remote analysis facility

  • remote access facilities are used by agencies around the world
  • enables approved researchers to submit data queries from their desktops through a secure online interface
  • requests are run against microdata that is securely stored within the data custodian's control

Remarkable characteristics

  • rare characteristics or attributes in the data that can pose an identification risk, depending on how extraordinary or noticeable they are
  • may include unusual jobs, very large families or very high income
  • remarkable characteristics (or remarkable combinations of characteristics) can lead to re-identification of individuals, households or organisations

Respondent

  • see data provider

Response knowledge

  • information that is publicly or privately known about a respondent
  • may be used to breach confidentiality

Rounding

  • see data rounding

Safe data

  • one of the Five Safes, safe data poses the question: has appropriate and sufficient protection been applied to the data? 
  • at a minimum, direct identifiers such as name and address must be removed or encrypted
  • further statistical disclosure control may be needed depending on the context in which data is released

Safe outputs

  • one of the Five Safes, safe outputs poses the question: are the statistical results non-disclosive? 
  • the final check, aiming for negligible risk of disclosure
  • all data made available outside of the data custodian's IT environment must be checked for disclosure
  • example: statistical experts may check all outputs for inadvertent disclosure before the data leave a secure data centre

Safe people

  • one of the Five Safes, safe people poses the question: is the researcher appropriately authorised to access and use the data? 
  • by placing controls on the way data is accessed, the data custodian requires the researcher to take some responsibility for preventing re-identification
  • as the detail in the data increases, so should the level of user authorisation required

Safe projects

  • one of the Five Safes, safe projects poses the question: is the data to be used for an appropriate purpose? 
  • before users can access detailed microdata, they may need to demonstrate to the data custodian that their project has a valid research aim, public benefit and/or statistical purpose
  • depends on the context in which the data is accessed

Safe settings

  • one of the Five Safes, safe settings poses the question: does the access environment prevent unauthorised use? 
  • can be considered in terms of both the IT and physical environment
  • in some data access contexts, such as open data, safe settings are not applicable
  • at the other end of the spectrum, sensitive information is accessed through secure research centres

Secure Research Centre

  • see data laboratory

Security

  • safe storage of, and access to, data held by organisations or individuals
  • covers both IT security and the physical security of buildings

Sensitive information (data)

  • sensitive information is considered a subset of personal information
  • under the Privacy Act, is of greater importance in terms of confidentiality (in particular where it leads to worse consequences for a re-identified individual)
  • the Office of the Australian Information Commissioner lists a number of characteristics about an individual that are defined as sensitive [link]
  • community and ethical expectations may not consider this list to be exhaustive (example: financial information is not present)
  • all personal information can be potentially sensitive depending on the context and the individual concerned
  • businesses may consider much of their information to be sensitive, but only personal data applicable under the Privacy Act

Statistical or research purposes

  • purposes which support the collection, storage, compilation, analysis and transformation of data for the production of statistical outputs, the dissemination of those outputs and the information describing them
  • statistical or research purposes may be distinguished from administrative, regulatory, compliance, law enforcement or other purposes that affect the rights, privileges or benefits of particular individuals or organisations

Suppression

  • not releasing information that is considered a disclosure risk 
  • aggregate data:
    1. removing specific values from a table so that people and organisations cannot be re-identified from the released data
    2. initial suppression is known as primary suppression
    3. additional cells needing suppression are known as consequential or secondary suppression
  • microdata:
    1. removing specific records from the microdata file
    2. removing specific data items for all records on the microdata file

Tabular data

  • see aggregate data

Threshold rule

  • see frequency rule

Uniqueness

  • where an individual has a characteristic or combination of characteristics that are different to all other members in a population or sample
  • determined by the size of the population or sample, the degree to which it is segmented (for example by geographic information), and the number and detail of characteristics provided for each unit in the dataset 
  • records that are unique are not necessarily re-identifiable, as this also depends on the remarkability of the characteristics and the availability of other information or knowledge held by the researcher (response knowledge)

Unit record data

  • see microdata