|Page tools: Print Page Print All RSS Search this Product|
MANAGING THE RISK OF DISCLOSURE: TREATING MICRODATA
For more information about re-identification risks and how to assess them, see:
ASSESSING DISCLOSURE RISK
Microdata brings with it two key risks:
The second risk is often broken down into spontaneous recognition and malicious attempts at re-identification. Spontaneous recognition is where, in the normal course of analysis, a data user recognises an individual or organisation without deliberately attempting to identify them (for example, when checking for outliers in a population). A malicious attempt could involve looking for a specific individual in the data, or using other research to confirm the identity of an individual who stands out because of their characteristics.
Methods for assessing microdata disclosure risk
When assessing the risk of disclosure in microdata files, several methods can be used:
Factors contributing to disclosure risk
Factors contributing to the risk of disclosure should also be considered. These factors will have different bearings in different contexts. For example, if releasing microdata publicly, data custodians should carefully consider each of the factors below. But if these microdata will only be released in a secure data facility to authorised researchers, some factors may not be applicable.
The more an individual or organisation is likely to gain from re-identifying a record, the greater the risk of disclosure. Conversely, the risk of attack is lower when the gains are lower. This is the fundamental principle of trusted access, where researchers share accountability for protecting data confidentiality and where the incentive for them is ongoing authorisation to access information.
Level of detail
The more detailed a unit record, the more likely re-identification becomes. Microdata files containing detailed categories or many data items could, through unique combinations of characteristics, reveal enough to enable re-identification.
With aggregate data, the main risk is attribute disclosure which may in turn increase risks of re-identification. In addition, with detailed tables, there is increased risk of disclosure due to differencing attacks or mathematical techniques that undo some or all of the data protections.
Some variables may require additional treatment if they are deemed sensitive (e.g. health, ancestry or criminal information). This treatment may be dictated by legislation and policy as well as confidentiality obligations. This can be a significant balancing act as often the variables that are of most interest to researchers are also sensitive.
Even if a file contains few data items and categories, a disclosure risk may exist if the data contain a rare and remarkable characteristic (or combination of characteristics). This risk depends on how remarkable the characteristic is. For example, a widow aged 19 years is more likely to be identifiable than one aged 79 years. In addition, it is important to consider the rarity of a record from a population perspective. For example, there may only be one 79 year old widow in a sample, but they will not be unique in the entire population. The sampling process is a significant contributor to protection of the confidentiality of that individual (a user is unlikely to know which 79 year old widow was selected). It is advisable however, to protect that single individual in any subsequent outputs that may be publically released.
Interestingly, data accuracy can increase the risk of disclosure. This is not to suggest that producing data with low accuracy should be used as a method to manage this risk; but data custodians should be aware that datasets subject to reporting errors or containing out-of-date information may present a lower disclosure risk.
As a general rule accessing data from 10 years ago (e.g. area of residence or marital status) is less likely to enable re-identification of an individual or organisation than accessing up to date information.
Data coverage (completeness)
Individuals or organisations are more easily identifiable if they are known to be in the dataset. Datasets that cover the complete population therefore increase the risk of disclosure because a user knows that all individuals will be represented somewhere in the dataset. This risk applies to administrative data and population censuses. For surveys, however, the sampling process provides some protection against re-identification (especially if the specific sample selection methodology is not made public).
Data based on surveys or samples taken from a population are generally of lower disclosure risk than full population datasets. This is because there is the inherent uncertainty that a record belongs to a particular individual or organisation. The risk is not reduced to zero, particularly when considering records with rare characteristics which may be re-identifiable in a sample as well as the population.
In some cases, how the dataset is structured will increase the disclosure risk. Longitudinal datasets (those where individuals are tracked over time, as opposed to datasets that are time-based snapshots of different sample of the population) have significant disclosure issues. Individuals or organisations that have changes in their characteristics over time are much more likely to be re-identified than those that don’t (and in reality very few individuals or organisations don’t change characteristics over time). For example a business that has a relatively constant income over 5 years, but then triples their income for the next three years is more likely to be re-identified compared to a business with a constant income over the same time frame.
Another structural aspect of datasets is their hierarchical nature. This is where datasets have information at more than one level (eg a person level as well as a family level). The disclosure risk here is that information at one level may be non-disclosive, but is disclosive at a higher level. For example a count of people with household income of $801-$1,000 per week may be 6; however, the 6 may refer to a single household (2 parents and 4 children), which has effectively disclosed information about all the people in the household.
Software for assessing disclosure risks
Various software packages can help data custodians assess, detect and treat disclosure risks in microdata. These include:
MICRODATA TREATMENT METHODS
Once a re-identification or disclosure risk is known, it can be addressed through a number of data modification and reduction techniques. These techniques should be applied to only those records or variables judged to be a risk—a judgement that should consider the specific release context. For example, detailed microdata accessed in a context where people, projects, settings and outputs are controlled (see the Five Safes Framework) may require few of these treatments, whereas a publicly available file may require all of them.
Usually the minimum level of protection for any microdata to be used for statistical purposes is removal of direct identifiers such as name and address. Depending on the legal obligations of data custodians and controls on access (e.g. user authorisation, project assessment, security of access environment, output checking), the removal of direct identifiers alone may be sufficient to protect confidentiality. Further disclosure controls may be required depending on the data release context, especially for public access of open data. Data custodians must carefully assess the microdata to identify records posing a disclosure risk and treat them to prevent re-identification.
Techniques for treating microdata
To limit disclosure risk in microdata, data custodians can do the following:
Limit the number of variables
This means reducing the number of variables in the dataset by removing, for example, geographic variables.
Modify cell values
This can be done through rounding or perturbation. The amount of rounding should be relative to the magnitude of the original value. For example, rounding could vary from $1,000 (for personal income) to $1 million (for business income). Perturbation in the context of microdata means adding 'noise' to the values for individual records. For example, someone’s true income of $1,000,000 might be perturbed to $1,237,000. In order to maintain totals, that $237,000 could be removed from one or more other records.
Combine categories that are likely to enable re-identification. Examples include
Wherever possible, start by combining categories containing a small number of records so that the identities of individuals in those groups remain protected (e.g. combine use of electric wheelchairs and use of manual wheelchairs). For more information on combining categories see Part 4: Managing the risk of disclosure: Treating aggregate data.
Apply top/bottom coding
Collapse top or bottom categories containing small populations. For example, survey respondents aged over 85 could be coded to an 85-and-over category (as opposed to having individual categories for 85–89, 90–94 and 95–100 years).
Apply data swapping
To hide a record that may be identifiable by its unique characteristics, swap it for another record where some other characteristics are shared. For example, someone in NSW who speaks an uncommon language could have their record moved to Victoria, where the language is more commonly spoken. This allows characteristics to be reflected in the data without the risk of re-identification.
As a consequence of applying this method, additional changes may also be required. In the previous example, after moving the record from NSW to Victoria, family-related information would also need to be adjusted so that both the original record and the records of family members remain consistent.
If the above methods are insufficient, suppress particular values or remove records that cannot otherwise be protected from the risk of re-identification.
Understand the relationship between microdata and aggregate data
To ensure that hierarchical data cannot be used to identify higher level contributors (i.e. in aggregate data), the above methods may need to be applied to a greater degree. Alternatively, removing variables or other information relating to the higher levels may be effective. For more information, see Part 4: Managing the risk of disclosure: Treating aggregate data.
These documents will be presented in a new window.