Learn about the Five Safes framework, confidentiality techniques and confidentialising your own data
Re-identification in aggregate data and microdata, managing re-identification risk
What is re-identification
Re-identification occurs when the identity of a person or organisation is determined even though directly identifying information has been removed. This may be able to be done using other publicly or privately held information about the individual or organisation. This type of disclosure, or breach of confidentiality, that can occur when someone has access to either aggregate data (such as tables) or microdata (unit record data). This section considers the risk of re-identification, the other main disclosure risk, attribute disclosure. The risk of re-identification of an individual is likely to be increased if an attribute about them is revealed, for example a particular level of income that is common to a group of 15-18 year olds.
It is important that data, and especially unit record data, can be accessed securely and used effectively for research and policy making. Providing data in open environments is an important part of the Australian Government Public Data Policy Statement. However, open data may not always be the most appropriate manner for providing data for research, particularly when the requirements for utility conflict with confidentiality.
For datasets that cannot be made open and accessible, strategies to manage confidentiality and disclosure risks when providing access to the data should consider:
- how the dataset could be used to re-identify an individual or organisation
- whether information available elsewhere could be combined with the dataset to re-identify a person or organisation
Data sources and methods
Data sources and analytical methods may also increase disclosure risk, and this risk needs to be carefully managed.
- contains direct identifiers such as name, address and Tax File Number that allow an agency to identify the people accessing a government service or program
- is usually collected from everyone who accesses a service and may cover a large proportion of the population
- even when the directly identifying information is removed, people are still at higher risk of being re-identified from other information held about them when they are known to be in a dataset, or when the dataset is large
- multiple information sources about people and organisations can be combined (integrated), forming rich and deep repositories of information and presenting opportunities for detailed analysis
- re-identification risks similar to administrative datasets (the larger range of information for each record may increase the risk of re-identification)
- businesses collect customer information through registration processes and reward schemes, holding databases containing detailed information on user characteristics and behaviour
- knowledge of these characteristics may be combined with information in a released dataset to re-identify an individual or business
- many people are willing to share their private information for social purposes, with vast and increasing amounts of personal information available online
- publicly available information may be combined with information in a released dataset to re-identify an individual or business
Big Data analytics
- while new technologies make it possible to produce and store vast amounts of transactional data, advanced techniques also enable Big Data to be summarised, analysed and presented in new ways
- computer systems are increasingly able to draw together disparate data to discover patterns and trends
- research is being conducted into how new technologies can also create modern data treatment processes that match the scale of Big Data and balance the dual goals of privacy protection and analytical utility
Data custodians have ethical and legal responsibilities to actively manage the re-identification risks of their data collections.
Managing the risk of re-identification
Re-identification may occur through a deliberate attack (where a user consciously tries to determine the identity of an individual or organisation) or it may occur spontaneously (where a user inadvertently thinks they recognise an individual or organisation without a deliberate attempt to identify them). As the amount of data collected and released by government increases and technologies advance, re-identification risk management should be an iterative process of assessment and evaluation.
Two broad complementary approaches exist for managing re-identification risks:
- control the context of the data release - important when managing re-identification risks as it allows for more detailed data to be made available to approved researchers in a safe manner
- treat the data - decisions about the level of data treatment required can only be made after determining the release context
The release context includes:
- the audience who will have access to the data
- the purpose for which the data will be used
- the release environment
The level of data treatment appropriate for authorised access in a controlled environment is unlikely to be sufficient for open and unrestricted public access. It should also be noted that if one or more aspects of the context changes, a reassessment of the disclosure risks should be performed in order to ensure data subjects remain unlikely to be re-identified.
Re-identification in aggregate data
There can be a risk of disclosure even though data is aggregated (grouped into categories or with combined values). This is because publicly or privately held information may be used to identify one or more contributors to a cell in a table.
Established techniques such as cell suppression and data perturbation exist to protect the confidentiality of aggregate (or tabular) data and preventing re-identification. However, with the increased volume of aggregate data available through electronic channels (such as machine-to-machine distribution) and at finer levels of geography, the risk of re-identification is increased and poses challenges for data custodians.
Although commonly used by many agencies, the application of cell suppression may be insufficient to prevent re-identification. As a response to this challenge, the ABS' TableBuilder service applies a perturbation algorithm to automatically protect privacy in user-specified tables. This perturbation algorithm leads to some loss of utility, but maintains a very high level of confidentiality.
Re-identification in microdata
To prevent re-identification of people or organisations from microdata we need to do requires one or both of:
- controlling the context:
- the manner in which data is released (on a continuum ranging from open data to highly controlled situations such as access in a locked room)
- treating the data:
- at a minimum, removing direct identifiers such as name and address
- in most cases applying further statistical treatment depending on the release context.
- for open data, appropriate and sufficient data treatment eliminates the need to control the context (but this is at the expense of data utility)
The following factors should be considered when deciding whether and under what contextual controls data will be released.
Users looking at a dataset are likely to possess private information about individuals or organisations represented in the dataset (such as a neighbour or family member). In these cases, the Private information could enable them to re-identify someone in the dataset.
Strategies to manage this risk include:
- releasing a sample, rather than the entire dataset
- providing access only to authorised users who give a binding undertaking not to re-identify any individual or organisation
Users may draw on publicly available information (such as a well-known person or business) when examining a dataset. For example, if a dataset containing information on businesses with very high turnover is released (even to a restricted group of researchers) the researchers may be able to re-identify large public companies that hold monopolies in certain industries.
Strategies to manage this risk include:
- releasing a sample, rather than the entire dataset
- providing access to authorised users only who give a binding undertaking not to re-identify any individual or organisation
- modifying the data to mask high-profile publicly-known individuals or organisations
List matching refers to a user linking records in a dataset with information from other datasets. This is done by either matching common identifiers or characteristics that are common to both datasets. There is a potentially increased risk of re-identification simply because the combined data increases the amount of detail available for each unit record (a person or organisation).
Strategies to manage this risk include:
- using secure data facilities to control which datasets are available to authorised researchers at any one time
- extracting subsets of the microdata to provide users with only the data they require
- using unique randomised record identifiers for each published dataset
The process of matching characteristics that are common to datasets for linking purposes is undertaken legitimately as part of securely managed data integration processes. The ABS, Australian Institute for Health and Welfare (AIHW) and the Australian Institute of Family Studies (AIFS) are formally accredited Commonwealth Data Integrating Authorities. Bringing data together in this way is an important method of extending and enhancing research. Accredited Data Integrating Authorities have procedures and controls in place in order to perform this linking function safely.
Five Safes framework
Using safe people, projects, settings, data and output to balance disclosure risk and utility, ABS Fives Safes examples
Balancing disclosure risk and data utility
A key challenge for data custodians is to provide data with maximum utility for users but still maintain the confidentiality of the information. Every data release carries some risk of disclosure, so the benefits of each release (its utility or usefulness for research and statistical purposes) must substantially outweigh its risks and be clearly understood. This balancing of risk and utility is something everyone does on a daily basis (for example, when they choose to drive a car). Similarly, data custodians need to approach disclosure risk by managing it, rather than trying to eliminate it.
Confidentiality is breached when a person, group or an organisation is re-identified through a data release or when information can be attributed to them. The likelihood of this happening, or risk of disclosure, is not easily determined. Implicit in this is that the consequences of disclosure are always damaging (to some extent) to the individual or organisation. It is difficult to ascertain the degree of damage, mostly because people differ in the importance they place on information. What may be considered highly confidential to one person is of no consequence to another. The ABS assumes all information it collects to be potentially sensitive and manages it accordingly.
Managing disclosure risk becomes a question of assessing not only the data itself, but also the context in which the data is released. Once the context is clearly understood, it is much easier to determine how to protect against the threat of disclosure. The Five Safes framework provides a structure for assessing and managing disclosure risk that is appropriate to the intended data use.
This framework has been adopted by ABS, several other Australian government agencies as well as national statistical organisations such as the Office of National Statistics (UK) and Statistics New Zealand.
Five Safes framework
The Five Safes framework takes a multi-dimensional approach to managing disclosure risk. Each safe refers to an independent but related aspect of disclosure risk. The framework poses specific questions to help assess and describe each risk aspect (or safe) in a qualitative way. This allows data custodians to place appropriate controls, not just on the data itself, but on the manner in which data is accessed. The framework is designed to facilitate safe data release and prevent over-regulation
The five elements of the framework are:
- safe people
- safe projects
- safe settings
- safe data
- safe outputs
Is the researcher appropriately authorised to access and use the data?
By placing controls on the way data is accessed, the data custodian invests some responsibility in the researcher for preventing re-identification. Usually, as the detail in the data increases, so should the level of user authorisation required.
Prerequisites for user authorisation usually include:
- training in confidentiality and the conditions of data use
- signing a legally binding undertaking to maintain data confidentiality
By definition, a safe people assessment would not be required for open data (data that is released into the public domain with no restriction on use).
Is the data to be used for an appropriate purpose?
Users wanting to access detailed microdata should be expected to explain the purpose of their project. For example, in order to access detailed microdata in the ABS DataLab, users must demonstrate to the ABS that their project has a statistical purpose and show it has:
- a valid research aim
- a public benefit
- no capacity to be used for compliance or regulatory purposes
As with safe people, the need for a safe project assessment will depend on the context in which the data is accessed. It would not be required for open data.
Does the access environment prevent unauthorised use?
The environment here can be considered in terms of both the IT and the physical environment. In data access contexts such as open data, safe settings are not required. At the other end of the spectrum however, sensitive data should only be accessed via secure research centres.
Secure research centres may have features such as:
- a locked room requiring personal authentication
- IT monitoring equipment
- auditing and other supervision
Safe settings ensure that data access and use is occurring in a transparent way.
Has appropriate and sufficient protection been applied to the data?
At a minimum the removal of direct identifiers (such as name and address) must be applied to data before it is released. Further statistical disclosure controls should also be applied, depending on how the data will be released. Table 1 shows some of the statistical factors that should be considered when assessing disclosure risk.
|Factor||Effect on disclosure risk|
|Data age||Older data is generally less risky|
|Sample data (e.g. a survey)||Decreases risk|
|Population data (e.g. a census)||Increases risk|
|Administrative data||Increases risk|
|Longitudinal data||Increases risk|
|Hierarchical data||Increases risk|
|Sensitive data||Increases risk (sensitive data may be a more attractive target)|
|Data quality||Poor quality data may offer some protection|
|Microdata||Main risk: re-identification|
|Aggregate data||Main risks: attribute disclosure and disclosure from differencing|
|Key variables||The variables of most interest to users are usually the most disclosive|
Source: UK Anonymisation Decision-making Framework
Are the statistical results non-disclosive?
This is the final check on the information before it is made public, which aims to reduce the risk of disclosure to a minimum. All data made available outside the data custodian's IT environment must be checked for disclosure. For example in the ABS DataLab, statistical experts check all outputs for inadvertent disclosure before the data leaves the DataLab environment.
Examples from the ABS
The Five Safes framework provides a mechanism for data custodians to take necessary and reasonable steps to manage disclosure risk in their data releases. It broadens the approach to data confidentiality by considering not just the treatment of data, but also the manner and context in which data is released.
The safes are assessed independently, but also considered as a whole. They can be thought of as a series of adjustable levers or controls to effectively manage risk and maximise the usefulness of a data release. The degree to which each safe is controlled is critical to assessing the disclosure risk. Tightly controlling all five will be counterproductive because the restrictions applied will not produce a corresponding benefit (useful data).
In practice, the safe data part of the Five Safes should be addressed after the other four are considered. This is because the degree of data treatment required will become evident once it is clear who will be able to access the data, under what conditions, in what circumstances and how the resulting data will be protected in order to be made public. The process is likely to be iterative, as data treatment with a view to maintaining utility may necessitate reassessing one or more of the other four safes.
This table describes how the ABS applies the Five Safes framework to three different data access channels - open data, basic and detailed microdata files.
|Website or publication table (open data)||Basic microdata file (via direct download)||Detailed microdata file (via ABS DataLab)|
|Safe people||No control necessary|
Anyone may view the data online
Users must register to use the data and sign a Declaration of Use
Breaches may be subject to sanctions and/or legal proceedings
Users must undergo training, complete an authorisation process, sign legally binding confidentiality undertakings and a compliance declaration
Breaches of protocols or disclosure of information may be subject to sanctions and/or legal proceedings
|Safe projects||No control necessary|
Anyone can use the data for their own purposes
Users sign a declaration regarding the purpose for which they will use the data
Users must detail the purpose for which they will use the data
Purpose can be compared to what is actually produced (see Safe Outputs)
|Safe settings||No control necessary||Some control|
Users are required to store the data securely and can work on the data in their own physical and IT environment
The DataLab a secure, closed environment, accessed virtually or on-site
Secure login, auditing and monitoring capabilities
No data can be removed without first being checked by ABS staff
|Safe data||Very high control|
The data is highly aggregated
The data is treated by the ABS to ensure no individual is likely to be identified
Direct identifiers are removed and the data is further treated where appropriate. Appropriate control of the data optimises its usefulness for statistical and research purposes.
|Safe outputs||Very high control|
Every table is checked for disclosure before release
(in an Open Data context, the data is the safe output)
The output is technically controlled by the user, but the ABS provides guidelines or rules about what may be published or shared
All statistical outputs are assessed by the ABS for disclosure before being released to the user. The outputs may also be compared for consistency with the original project proposal.
In all three cases, applying any one safe in isolation is unlikely to provide an effective confidentiality solution. However, when all five safes are considered in combination, the overall disclosure risk becomes very low.
Tabular data is most effectively protected through safe data and safe outputs.
When data is loaded into the user's own environment, some of the safes can wit be more effectively controlled than others. The data custodian cannot directly monitor how the data is used. However, the data custodian mitigates disclosure risk by directly protecting the data. The downside of this approach is that the data can lose some of its utility. Examples of these types of datasets include:
- basic microdata files (produced by the ABS)
- public use files (PUFs)
The treatment of the microdata files in the ABS DataLab effectively uses all five safes. Safe people, safe projects, safe settings, safe data and safe outputs are all controlled to mitigate the risk of disclosure. This allows appropriately approved researchers to work securely with highly detailed microdata.
Treating aggregate data
Disclosure risks for tables, identifying risky cells, and data treatment techniques
Tables and disclosure risks
Aggregate data is usually presented as tables, though may also be in map or graph forms. 'Aggregate data and tables are often used interchangeably. Different types of tables can contain different disclosure risks. There are additional disclosure risks if users can access multiple tables containing common elements. Users could potentially re-identify a person by differencing the outputs of two separate tables to which the person contributes.
Each cell in a table contains the number of contributors (such as individuals, households or organisations). Disclosures could be made from a table where one or more cells have a low count (a small number of contributors). Assessments of whether a cell's value is too low must be based on the underlying count of its contributors, not its weighted estimate.
Each cell contains summarising information about the contributors to that cell. This may be in the form of a total, or mean, median, mode or range. An example of a magnitude tables is reporting total turnover for groups of businesses. Disclosures can occur when the values of a small number of units (such as one or two businesses with extremely high turnover) dominate a cell value.
Rules to identify at-risk cells
Some common rules can help identify which cells may constitute a disclosure risk. By using these rules, a data custodian makes an implicit decision that all cells which break a rule are deemed an unacceptable disclosure risk, while all other cells are deemed non-disclosive. Each rule below provides protections against attackers trying to re-identify or disclose an attribute about a contributor. They also mitigate attacks where one contributor tries to discover information about another.
Strict applications of these rules may decrease the data's usefulness. Data custodians need to set appropriate rule values that are informed by legislation and organisational policy, disclosure risk assessment and statistical theory. The advantage of using rules is that they are simple, clear, consistent, transparent and amenable to automation. They are ideally applied in situations where:
- data is released in a similar manner (for example when the same kind of dataset is regularly released)
- transparency is important to both data custodian and users
- limited opportunity or requirement exists for engagement between data custodians and users
However, there may be situations where non-disclosive data is treated unnecessarily, or where non-untreated data is in fact disclosive.
Also called the threshold rule, this sets a particular value for the minimum number of unweighted contributors to a cell. Application of this rule means that cells with counts below the threshold are defined as to pose an unacceptable disclosure risk and need to be protected. However, when using this rule you should also consider:
- there is no strict statistical basis for choosing one threshold value over another
- higher values increase protection against re-identification but also reduce the utility of the original data
- lower threshold values may be appropriate for sampled datasets compared to population datasets
- by implication, everything greater than or equal to the threshold is defined as an acceptable disclosure risk
- it may be that a cell below the threshold value is not a disclosure risk, while a cell value greater than the threshold value is disclosive
In Table 1 a frequency rule of 4 is chosen, with any cell of less than 4 contributors considered a disclosure risk. The 25-29 year old age group has 3 contributors in the Low income cell and therefore the cell needs to be protected.
|Age (years)||Low income||Medium income||High income||Total|
Cell dominance rule
This applies to tables that present magnitude data such as income or turnover. The rule is designed to prevent the re-identification of units that contribute a large percentage of a cell's total value. Also called the cell concentration rule, the cell dominance rule defines the number of units that are allowed to contribute a defined percentage of the total.
This rule can also be referred to as the (n,k) rule, where n is the number of units which may not contribute more than k% of the total. For example, a (2, 75) dominance rule means that the top 2 units may not contribute more than 75% of the total value to a cell. As with the frequency rule there is no strict statistical basis for choosing values for the dominance rule.
The P% rule can be used in combination with the cell dominance rule to protect the attributes of contributors to magnitude data. It aims to protect the value of any contributor from being estimated to within P% and it is especially helpful where a user may know that a particular person or organisation is in a table. Applying this rule limits how close, in percentage terms, a contributor's estimated value can be from its true value. For example, a rule of P%=20%, means that the estimated value of any contributor must differ from its true value by at least 20%.
Using the cell dominance and P% rules together
Here is an example of using the dominance and P% rules together where (n, k)=(2, 75) and P%=20%. Table 2a shows profit for Industries A-D.
Although initially the table does not appear to contain a disclosure risk, there is a risk if information about the companies contributing the data to each industry is known.
This information could include the data in Table 2b which shows the companies contributing to Industry B. In this table, the two companies S and T contribute (150 + 93)/302 = 80.5% of the total. This exceeds the dominance rule value of k=75%, which means that this cell (Industry B) needs to be protected. Had the P% rule been applied on its own, the disclosure risk in Table 2a would not have been identified. If company S or T subtracted their own contribution to try to estimate the contribution of the other main contributor, in both cases the result would be more than 20% different from the actual value. For example, if company S subtracted its own $150m contribution from the $302m total, the resulting $152m would differ from company T's contribution by (152 – 93)/93 = 63%. Therefore the P% rule would not be broken.
This shows that a combination of approaches may be necessary to identify at-risk cells. Even though Table 2b data should never be released (because it refers to individual companies), anyone familiar with Industry B may know that Companies S and T are the two main contributors. Therefore the summary data in Table 2a requires treatment.
Tabular data treatment techniques
Due to the diverse nature of data there is no one solution for managing all re-identification risks through the treatment of aggregate data. However, the key is to only treat the cells that have been assessed as being to be an unacceptable disclosure risk.
The most common techniques are:
- data reduction, which decreases the detail available to the user
- data modification, which makes small changes to the data
How effective these techniques are depends on:
- the structure of the dataset
- the requirements of the data users
- legislative or policy requirements
- the available infrastructure to treat and disseminate the data
When releasing aggregate data into the public domain, all reasonable efforts should be made to prevent disclosure from the data. While it won't be possible to guarantee confidentiality, this effort must satisfy legislative requirements. For example, ABS legislation states that data must not be released in a manner that is likely to enable the identification of a particular person or organisation. This is important because once a table is made public, there are no further opportunities to control how the data will be used or to apply other confidentiality controls using the Five Safes framework.
Data reduction techniques are generally easy to implement but can have significant confidentiality issues. By contrast, data modification techniques are harder to implement but generally produce tables that are less disclosive (Table 3). These methods can be used to allow data custodians to release data that would otherwise remain inaccessible.
It is recommended that data custodians start by using simple techniques (data reduction) and only proceed to more complex ones (data modification) if an objective assessment shows the results of the simple techniques have an unacceptable risk of disclosure. This assessment could be made at an organisational level, rather than repeated for every table. For example, the ABS has assessed that TableBuilder's perturbation algorithm, despite its complex methodology, is safer than data reduction for frequency tables. Though potentially time-consuming, these methods can be used to allow data custodians to release data that would otherwise remain inaccessible.
Data reduction protects tables by either combining categories or suppressing cells. It limits the risks of individual re-identification because cells with a low number of contributors or dominant contributors are not released.
Data reduction: involves:
- combines variable categories
- suppresses counts with a small number of contributors (as per the frequency rule), and it considers suppressing higher counts for sensitive items
- suppresses cells with dominant contributors (as per the cell dominance rule)
This approach can be applied to Table 4a, where the value 3 does not meet a frequency threshold of 4. This cell can be protected in two ways:
- combine the 20-24 and the 25-29 year old age groups to create a 20-29 year old range (Table 4b)
- combine the Low and Medium income categories to create a single Low–medium category (Table 4c)
|Age (years)||Low income||Medium income||High income||Total|
|Age (years)||Low income||Medium income||High income||Total|
|Age (years)||Low - medium income||High income||Total|
The choice of which categories to combine depends on how important they are for further analysis and what the purpose is in releasing the data. For example, if researchers specifically want to analyse the 'Low' income range, collapsing it into a 'Low-Medium' range prevents this. There is no one answer when choosing which categories to combine: in either case, the data's utility may be affected.
Combining categories is often appropriate, but doesn't work in all situations. For example, if another table were produced containing the 20-24 year old row, the 25-29 year old values could be determined by subtracting the 20-24 row (in the new table) from the 20-29 row (in Table 4b).
This can be a complex process. Data custodians must carefully consider any other tables that include the same contributors, that:
- have already been released
- are being released at the same time
- are likely to be released in the future
Suppression involves removing cells considered to be a disclosure risk from a table.
In Table 4a, for example, the 3 could be replaced with not provided or np. This is called primary suppression. Sometimes secondary, or consequential, suppression is also required. For example, in addition to suppressing the 3 cell, other cells need to be suppressed to prevent the primary suppressed cell from being calculated. The data in Table 4a could be treated by:
- using primary and consequential suppression by not providing data from other rows and columns (Table 5a) provides an example of how primary and consequential suppression can be used to treat
- suppressing cells in the totals (Table 5b)
Other cell suppression combinations can be employed. The key is to ensure that the suppressed cells cannot be derived from the remaining information.
|Age (years)||Low income||Medium income||High income||Total|
|Age (years)||Low income||Medium income||High income||Total|
It is not usually recommended to suppress cells that contain a zero. For example if the two zeros in Table 5a were suppressed, it is apparent from the row total and the Low income 15-19 year old cell that they are zeros. In addition, it may be the case that a value must be zero by definition (often called structural zeros, for example the number of pregnant males in a population).
There is a large body of literature as well as commercially available software for determining the optimal pattern of secondary suppression.
Limitations of data reduction
As with combining categories, it is possible that other available tables may contain cell values or grand totals that could be used to break the suppression. For example, if tables 5a and 5b were both publicly available, the suppressed 'Medium' income values in Table 5a could be replaces with the unsuppressed values from Table 5b. Table 5a would then contain enough information to deduce all of the remaining suppressed values.
In some cases, the results of data suppression may appear safer than they truly are. Tables 6a–c illustrate how suppressed cell values can be deduced using techniques such as linear algebra. That is, While the data in Table 6b appears safe, someone could assign a variable to each of the 'np's to create Table 6c.
With variables assigned to Table 6c, the values in rows 1-2 and columns 2-3 can be used to generate the following equation:
(a+b+c+5) + (6+d+e+7) – (b+d+7+11) – (c+e+8+15) = 11 + 18 – 23 – 28
The variables b, c, d and e cancel out in this equation to give:
a – 23 = –22
Therefore, a = 1
This is the correct and disclosive value (from Table 6a) that the suppression (in Table 6b) was meant to protect. Thus linear algebra and other techniques such as integer linear-programming can be used to calculate or make reasonable estimates of the missing values in tables. In fact, the larger the table (or set of related tables available) the more accurately the results can be estimated. As computing power increases and more data is released, these sorts of attacks on data become easier.
Problems with suppression:
- The usefulness of the data is reduced (for example, Table 5a has lost one third of its data). There are cells that aren't disclosive, but that have been suppressed nonetheless.
- It can be difficult and time consuming to select the best cells for secondary suppression for large tables especially. Software packages are available that optimise the suppression pattern in the table or set of tables.
- The data is no longer machine readable because the table now includes symbols ['>'] or letters ['np'].
Although it can be relatively simple to suppress cells or combine categories, data custodians must be confident their outputs are not disclosive.
In contrast to data reduction techniques, data modification is generally applied to all non-zero cells in a table - not just those with a high disclosure risk. Data modification assumes that all cells could either be disclosive themselves or contribute to disclosure. It aims to change all non-zero cells by a small amount without reducing the table's overall usefulness for most purposes.
The two methods discussed below are:
- perturbation (global or targeted)
The simplest approach to data modification is rounding to a specified base, where all values become divisible by the same number (often 3, 5 or 10). For example, Table 7b shows how original data values (Table 7a) would look with its values rounded to base 3.
Using this technique, the data is still numerical (containing no symbols or letters) which is a practical advantage for users requiring machine readability.
Though the data loses some utility, rounding brings some significant advantages from a confidentiality point of view:
- Users won't know whether the rounded value of '3' in Table 7 is actually a 2, 3 or 4.
- Users won't know whether the zeros are true zeros. This mitigates the problem of group disclosure, whereas the original values showed that all 15-19 year olds were on a low income.
- Even if the true grand total or marginal totals were known from other sources, the user is still unable to calculate the true values of the internal cells.
These advantages are not universally true and there may be situations where rounding can still lead to disclosure. For example, if the count of 15-19 year olds in Low Income was 14 in Table 7, then the rounded counts for this row would still be 15, 0, 0 (total = 15). Now, if the true total was somehow known from other sources to be 14, then the true value of the low income category must also be 14 (it can't be 15 or 16 since this would necessitate a negative number in one or both of the other two income categories).
The main disadvantage to rounding is that there can be inconsistency within the table. For example in Table 7b, the internal cells of the 25-29 year-olds row sum to 24, whereas the total for that row is 21. Although controlled rounding can be used to ensure additivity within the table (where the totals add up), it may not provide consistency across the same cells in different tables.
Graduated rounding can also be used to round magnitude tables, which means the rounding base varies by the size of the cell. Table 8 shows how data could be protected by rounding the original values to base 100 (Industry A, B, C and Total) or 10 (Industry D). Again, the total is not equal to the sum of the internal cells, but it is now much harder to estimate the relative contributions of units to these industry totals.
|Industry||Original values||Rounded values|
|Profit ($m)||Profit ($m)|
Perturbation is another data modification method. This is where a change (often with a random component) is made to some or all non-zero cells in a table.
- For count data (frequency tables), a randomised number is added to the original values. This is called additive perturbation.
- For magnitude tables, the original values are multiplied by a randomised number. This is called multiplicative perturbation.
For both table types, this can be further broken down into targeted or global approaches.
Targeted perturbation is the approach taken when only those cells that are considered to be a disclosure risk are treated. Often this requires the application of secondary perturbation in order to maintain additivity within a table.
This approach is used by the ABS to release some economic data. An example (as shown in Table 9) would be to remove $50m from Industry B and add $25m to Industries A and C.
|Industry||Original values||Rounded values|
|Profit ($m)||Profit ($m)|
There are two key advantages of this approach:
- the total does not change (this is an important feature when the ABS releases economic data which then feed into National Accounts)
- generally, there is minimal loss of information
A disadvantage is that some data that are not disclosive per se is being altered to protect another cell (for example, the value of Industries A and C in Table 9). An alternative is to place the $50m that was taken from Industry B and place it in a new sundry category. Although some of the processes for targeted perturbation are amenable to automation (such as using a constrained optimisation approach), there is often a significant manual effort required (more so, when it is considered that all other tables produced need to have matching perturbed values).
This requirement for manual treatment means targeted perturbation often requires a skilled team to maintain the usefulness of the data (such as those specialising in releasing economic trade data in the ABS). If this is not feasible for data custodians, and particularly when the data is not economically sensitive, then a global perturbation approach may be more appropriate, as it can be automated. An important feature of global perturbation is that each non-zero cell (including totals) is perturbed independently. Perturbation can also be applied in a way that ensures consistency between tables, but it cannot guarantee consistency within a table. The marginal totals may not be the same as the sum of their constituent cells. This methodology is applied in TableBuilder, an ABS product used for safely releasing both Census and survey data.
Tables 10a and 10b show how data might look before and after additive perturbation is applied.
Here the user has no chance of determining that the count of low income 15-19 year olds is '1'. They can still make reasonable estimates of the true value, but they are unable to confirm their guesses. Because perturbation only applies small changes and applies changes to every cell, the results are unbiased and for most purposes, the overall value of the table is retained.
|Age (years)||Low||Medium||High||Very High||Total|
|Age (years)||Low||Medium||High||Very High||Total|
Perturbation can also be applied to magnitude tables (assuming targeted perturbation was not appropriate or viable), but in these cases it is multiplicative. This is because adding or subtracting a few dollars to a company's income is unlikely to reduce the disclosure risk in any meaningful way. With multiplicative perturbation, the largest contributors to a cell total have their values changed by a percentage. The total of these perturbed values then becomes the new published cell total.
Table 11 shows how data might look before and after multiplicative perturbation is applied.
The total has been changed by only about 6%, but the individual values of companies S and T have been protected. An attacker does not know the extent to which each contributor's profit has been perturbed or therefore how close the perturbed total is to the true total.
|Company||Original values||Rounded values|
|Profit ($m)||Profit ($m)|
Table 12 shows how data from Table 11 might be combined with other industry totals in a released table. As with rounding the table no longer adds up, but there is robust protection of its individual contributors. For magnitude tables it may also be necessary to suppress low counts of contributors (or perturb all counts) as an extra protection.
|Industry||Original values||Rounded values|
|Profit ($m)||Profit ($m)|
An important issue to keep in mind with all perturbation is that the perturbation itself may adversely impact on the usefulness of the data. Tables 10 and 12 show perturbation where the totals are quite close to the true value, but individual cells can be changed by 100%. When data custodians are releasing data, they need to clearly communicate the process by which they have perturbed the data (although they shouldn't provide exact details on key parameters of perturbation that could be used to undo the treatment). This communication should also include a caveat that further analysis of the perturbed data may not give accurate results. Further analysis done on the unperturbed data should be carefully checked before being released into the public domain to ensure the analysis and the perturbed table cannot be used in conjunction to breach confidentiality. On the other hand, it should also be recognised that small cell values may be unreliable either due to sampling or that responses are not always accurate.
Hierarchical data treatment techniques
All of the methods described above are limited in how they deal with hierarchical datasets (datasets with information at different levels). For example, a file may contain records for each family at one level and in another file separate records for individual family members. The same structure could apply to datasets containing businesses and employees, or to health care services, providers and their clients.
In hierarchical datasets, contributors at all levels must be protected. A table that appears to be non-disclosive at one level may contain information that a knowledgeable user could use to re-identify a contributor at another level. For example, a table cell that passes the cell frequency rule at one level may not pass the same rule at a higher level.
The following example shows a summary table (Table 13), which is derived from detailed information in Table 14.
On the surface, Table 13 appears non-disclosive, for example, it doesn't violate a frequency rule of 4. However, a closer look at the source data reveals the following disclosure risks:
- The summary count of 'Pathology' in the 'Private' sector in 'East' location (61) is based on only 2 providers (Lisa and Stu). Thus, both Lisa and Stu could subtract their own contribution to determine the income of the other.
- Only one company (Clinic D) is represented by this same cell.
- The summary count of 'Surgery' in the 'Public' sector in 'West' location (5) is from only 2 patients and 1 provider (Pru). So other companies can use the service fee value to estimate Clinic E's income for 'Surgery'.
All of these scenarios are confidentiality breaches. In a real-world situation Table 14 would not be released because it contains names. But it should be assumed that someone familiar with the sector would know the contributors and may even be one of them.
|Sector||Location||Company||Clinic||Service||Provider||Patients||Service||Bulk bill service||Service fee ($)|
The example above shows the need to protect information at all levels. With hierarchical data, data treatment must be applied at every level, and any consequences of these changes need to be followed through to other levels. For example, if the number of providers in Table 14 is reduced to zero, then the count of services (in Tables 13 and 14) also needs to be zero.
Assessing and treating microdata disclosure risks
Microdata and disclosure risks
Microdata files are datasets of unit records, where each record contains information about a person, or organisation, or other type of record. This information can include individual responses to questions on surveys, censuses or administrative forms. Microdata files are potentially valuable resources for researchers and policy makers because they contain detailed information about each record. The challenge for data custodians is to strike the right balance between maximising the availability of information for statistical and research purposes and fulfilling their obligations to maintain confidentiality by:
- assessing the context in which the data will be released
- treating the data appropriately for that context
Assessing disclosure risk
The two key risks when releasing microdata are when disclosure occurs through:
- spontaneous recognition - where, in the normal course of their research analysis, a data user recognises an individual or organisation without deliberately attempting to identify them (for example, when checking for outliers in a population)
- deliberate attempts at re-identification - looking for a specific individual in the data, or using other research to confirm the identity of an individual who stands out because of their characteristics
As with aggregate data, there is also a risk with any published analysis from the microdata output.
Several methods for assessing microdata disclosure risk can be used:
- cross-tabulate the variables (e.g. look at age by income or marital status) to identify records with unique or remarkable characteristics
- compare sample data with population data to determine whether records with unique characteristics in the sample are in fact unique in the population
- compare potentially risky records to see how similar they are to other records that may provide some protection (a unique 30 year old with certain characteristics may be considered similar to a 31 year old with the same characteristics)
- identify high-profile individuals or organisations known to be in the dataset and who may be easily recognisable
- consider other datasets and publicly available information that could be used to re-identify records, such as through list matching
Factors contributing to the risk of disclosure should also be considered. These factors have different bearings in different contexts. For example, if releasing microdata publicly, data custodians should carefully consider each of the factors below. If these microdata is only be released in a secure data facility to authorised researchers, some factors may not be applicable.
Level of detail
The more detailed a unit record, the more likely re-identification becomes. Microdata files containing detailed categories or many data items could, through unique combinations of characteristics, reveal enough to enable re-identification.
With microdata output (or aggregate data), the main risk is attribute disclosure which may in turn increase risks of re-identification. In addition, with detailed tables, there is increased risk of disclosure due to differencing attacks or mathematical techniques that undo some or all of the data protections.
Some variables may require additional treatment if they are sensitive, such as health, ancestry or criminal information. This treatment may be dictated by legislation and policy as well as confidentiality obligations. This can be a significant balancing act as often the variables that are of most interest to researchers are also sensitive.
A disclosure risk may exist if the data contains a rare and remarkable characteristic (or combination of characteristics). This can happen even if there are few data items or categories. This risk depends on how remarkable the characteristic is. For example, a widow aged 19 years is more likely to be identifiable than one aged 79 years. In addition, it is important to consider the rarity of a record from a population perspective. For example, there may only be one 79 year old widow in a sample, but they are not unique in the entire population. The sampling process is a significant contributor to protection of the confidentiality of that individual (a user is unlikely to know which 79 year old widow was selected). It is advisable however, to protect that single individual in any subsequent outputs that may be publically released.
Data accuracy can increase the risk of disclosure. While it is not recommended to produce data with low accuracy as a method to manage this risk, data custodians should be aware that datasets subject to reporting errors or containing out-of-date information may present a lower disclosure risk.
As a general rule accessing older data is less likely to enable re-identification of an individual or organisation than accessing up to date information. This is particularly true for variables that change over time, such as area of residence or marital status.
Data coverage (completeness)
Individuals or organisations are more easily identifiable if they are known to be in the dataset. Datasets that cover the complete population increase the risk of disclosure because a user knows that all individuals are represented in the dataset. This risk applies to administrative data and population censuses.
Data based on surveys or samples taken from a population are generally of lower disclosure risk than full population datasets. This is because there is the inherent uncertainty whether a record belongs to a particular individual or organisation. The risk is not reduced to zero, particularly when considering records with rare characteristics which may be re-identifiable in a sample as well as the population, so sampling should not be the only method of protection.
In some cases, how the dataset is structured increases the disclosure risk. Longitudinal datasets (those where individuals are tracked over time, as opposed to datasets that are time-based snapshots of different sample of the population) may have significant disclosure issues. Individuals or organisations that have changes in their characteristics over time are much more likely to be re-identified than those that don't (and in reality very few individuals or organisations don't change characteristics over time). For example a business that has a relatively constant income over 5 years, but then triples their income for the next three years is more likely to be re-identified compared to a business with a constant income over the same time frame.
Another structural aspect of datasets is their hierarchical nature. This is where datasets have information at more than one level such as a person level as well as a family level. The information may be non-disclosive at one level, but be disclosive at a higher level. For example a count of people with household income of $801-$1,000 per week may be 6. However, the 6 may refer to a single household (2 parents and 4 children), which has effectively disclosed information about all the people in the household.
The more an individual or organisation is likely to gain from re-identifying a record, the greater the risk of disclosure. Conversely, the risk of attack is lower when the gains are lower. This is the fundamental principle of trusted access, where researchers share accountability for protecting data confidentiality and where the incentive for them is ongoing authorisation to access information.
Software for assessing disclosure risks
Various software packages can help data custodians assess, detect and treat disclosure risks in microdata. These include:
- Mu-ARGUS: Developed by Statistics Netherlands to protect against spontaneous recognition only (not against list matching).
- SDC-Micro: An R-based open source package, developed by International Household Survey Network. This program calculates disclosure risk, for whole datasets and individual records, and applies treatments.
- SUDA (Special Uniques Detection Algorithm): Developed by the University of Manchester to identify unit records which, due to rare or unique combinations of characteristics, pose a re-identification risk. SUDA looks for uniqueness in the dataset but does not consider whether a particular record is unique in the population as a whole.
Microdata treatment methods
Once a re-identification or disclosure risk is known, it can be addressed through a number of data modification and reduction techniques. These techniques should be applied to only those records or variables judged to be a risk - a judgement that should consider the specific release context:
- detailed microdata may require few of these treatments because it is accessed in a context where people, projects, settings and outputs are controlled (see the Five Safes framework)
- publicly available files may require many or data treatments.
Usually the minimum level of protection for any microdata to be used for statistical or research purposes is removal of direct identifiers such as name and address. Depending on the legal obligations of data custodians and controls on access, such as (e.g. user authorisation, project assessment, security of access environment, output checking), the removal of direct identifiers alone may be sufficient to protect confidentiality. Further disclosure controls may be required depending on the data release context, especially for public access of open data. Data custodians must carefully assess the microdata to identify records posing a disclosure risk and treat them to prevent re-identification.
Limit the number of variables
This means reducing the number of variables in the dataset. For example, you could remove detailed geographic variables.
Modify cell values
This can be done through rounding or perturbation. The amount of rounding should be relative to the magnitude of the original value. For example, rounding could vary from $1,000 (for personal income) to $1 million (for business income). Perturbation in the context of microdata means adding 'noise' to the values for individual records. For example, someone's true income of $1,000,000 might be perturbed to $1,237,000. In order to maintain totals, that $237,000 could be removed from one or more other records.
Rounding or perturbation may also be applied to exact dollar amount that may be otherwise at risk of list matching. If a record in a population is the only one with the exact value of income of $97,8999.21 then this record may be at risk if a user also has access to another dataset with income variables. The user could combine both datasets based on exact income values and be able to learn new information about the records. Treating all dollar amounts on a dataset by a small amount provides protection. This can be done by grouping the records into clusters and adjusting records within each cluster so that the mean for each cluster remains the same.
Combine categories that are likely to enable re-identification, such as:
- using age ranges rather than single years
- collapsing hierarchical classifications such as industry at higher levels (e.g. mining rather than the more detailed coal mining or nickel ore mining)
- combining small territories with larger ones (e.g. ACT into NSW)
You can combine categories containing a small number of records so that the identities of individuals in those groups remain protected (e.g. combine use of electric wheelchairs and use of manual wheelchairs). See also Treating aggregate data section.
Collapse top or bottom categories containing small populations. For example, survey respondents aged over 85 could be coded to an 85+ category as opposed to having individual categories for 85-89, 90-94 and 95-100 years which might be very sparse.
To hide a record that may be identifiable by its unique combination of characteristics, swap it for another record where some other characteristics are shared. For example, someone in NSW who speaks an uncommon language could have their record moved to Victoria, where the language may be more commonly spoken. This allows characteristics to be reflected in the data without the risk of re-identification.
As a consequence of applying this method, additional changes may also be required. In the previous example, after moving the record from NSW to Victoria, family-related information would also need to be adjusted so that both the original record and the records of family members remain consistent.
If the above methods are insufficient, suppress particular values or remove records that cannot otherwise be protected from the risk of re-identification.
Understand the relationship between microdata and aggregate data
To ensure that hierarchical data cannot be used to identify higher level contributors (i.e. in aggregate data), the above methods may need to be applied to a greater degree. Alternatively, removing variables or other information relating to the higher levels may be effective. See also Treating aggregate data.
Use of ABS microdata and impact on research quality
ABS microdata products in the context of confidentiality controls, and impact of data treatments on analysis
ABS microdata products
ABS releases microdata products that can be accessed for statistical and research purposes:
- basic microdata which can be downloaded into a user's own computing environment
- detailed microdata files, available through the ABS DataLab
ABS also releases TableBuilder, which uses underlying microdata to allow researchers to create their own automatically confidentialised tables, graphs and maps (aggregate output).
Each product is designed to meet a different research requirement. A comparison of these products is shown in Table 1. You can also Compare data services for more detail about a wider range of ABS products.
|Microdata product||Basic microdata||Detailed microdata||TableBuilder|
Utility and suitability
Very high utility:
|Confidentiality controls applied|
Impact of data treatment on analysis
Treating the data itself may restrict the ability of a researcher to answer a particular question. For example, a major difference between basic microdata and detailed microdata is that data item categories in the former have been collapsed or aggregated to a greater degree, which reduces the level of detail available. In some instances it may therefore be more appropriate to use an Expanded CURF. Similarly, it may be better to conduct research (within the ABS DataLab) using detailed microdata files if the Expanded CURF does not contain enough detail to answer the researcher's question.
There are, however, situations where data treatment may not adversely affect the quality of the data or the reliability of the research. For example, a 2010 study used data from the ABS's Survey of Mental Health (2007) to compare results attained from using the Expanded CURF with results attained from using the untreated main-unit-record file (MURF). As Table 2 shows, the results were almost identical.
(from treated microdata)
(from original microdata)
|Standard error of|
|Difference between hazard ratios,|
as proportion of standard error
|No lifetime mental disorder||1||1||(reference category)|
|Anxiety disorder (type):|
|Generalised anxiety disorder||0.33631||0.33631||0.07846||0|
|Post-traumatic stress disorder||0.63221||0.63221||0.07865||0|
|Anxiety disorder (severity):|
* Hazard ratios compare the incidence of an event in one group to another group over time
Source: Lawrence, D., Considine, J., Mitrou, F. & Zubrick, S.R. (2010) ‘Anxiety disorders and cigarette smoking: Results from the Australian Survey of Mental Health and Wellbeing’, Australian and New Zealand Journal of Psychiatry. Vol. 44, pp. 521-528.
Explanation of statistical and confidentiality terms used in this guide
- information (including personal information) collected by agencies for the administration of programs, policies or services
- can be microdata (unit-record data) or macrodata (aggregate data)
- may be used for statistical or research purposes
- produced by grouping information into categories and combining values within these categories
- example: a count of the number of people of a particular age (obtained from the question 'In what year were you born?').
- also known as tabular data or macrodata
- aggregate data is often presented in tables
- occurs when previously unknown information is revealed about an individual, group or organisation (without necessarily formally re-identifying them)
- extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions
Cell concentration rule
- used to assess whether a table cell may enable re-identification or attribute disclosure
- also called the cell concentration rule
- finds cells where a small number of data providers contribute a large percentage to the cell. iIf a cell fails this rule, further investigation or data treatment is needed to ensure the attributes of predominant data providers are not disclosed
- aggregate data treatment method
- protecting the secrecy and privacy of information collected from individuals and organisations, and ensuring that no data is released in a manner likely to enable their identification
Data access context
- the environment and manner in which data is released
- data custodians need to consider who will have access to the data, the purpose for which the data will be used and the release environment itself (whether physical, IT or legal)
- organisation or agency responsible for the collection, management and release of data
- they have legal and ethical obligations to keep the information they are entrusted with confidential
- secure data environment where researchers can perform detailed analysis of microdata
- also known as secure research centres
- can be accessed virtually (remotely) or on-site
- ABS data laboratory is called the DataLab
- technique used to treat data to limit re-identification or other disclosure
- changes all non-zero cells by a small amount while aiming to maintain the table's overall usefulness
- examples include rounding and perturbation
- an individual, household, business or other entity that supplies data for statistical or administrative purposes
- also known as a respondent
- a technique for statistical disclosure control
- methods to control or limit the amount of detail available in a table to prevent individuals or organisations from being re-identified
- methods include combining variables or categories, or suppressing (removing) information in unsafe cells
- can be applied to aggregate data or microdata
- slightly altering cells in a table to make them all divisible by the same number
- common numbers used for rounding are 3, 5 or 10
- may be random or controlled
- prevents the original data values from being known with certainty while ensuring the usefulness of the data is not significantly affected
- aggregate data treatment method
- process of moving the values of one or more variables from one microdata record to another record, so it no longer poses a disclosure risk
- microdata treatment method
Differencing or differencing attack
- where someone with access to multiple tables can deduce the true values of cells that had been modified or suppressed
- individual tables may be non-disclosive, but when the tables are compared, the difference between cells across the tables may be disclosive
- example: if a user accessed a table with information on 20-25 year olds and then accessed a subsequent table with information on 20-24 year olds, the difference between the two tables will reveal information about 25 year olds only
- when the data includes an identifier (such as name or address) that can be used, without any additional information, to establish the identity of a person, group or organisation
Disclosure or disclosive
- a breach of confidentiality, where a person, group or organisation is identified or has previously unknown characteristics (attributes) associated to them as a result of releasing data
- the process of limiting the risk of an individual or organisation being directly or indirectly identified
- can be via statistical (data focused) or non-statistical (data context-focused) techniques or processes
Disclosure risk management
- In the context of confidentiality, determining whether released datasets (or sections of released datasets) constitute a risk of disclosure or re-identification, and then putting in place controlling mechanisms to mitigate those risks
- the Five Safes framework provides a way of assessing risk within the constraints provided by policies and legislation
Five Safes framework
- multi-dimensional approach to managing disclosure risk, consisting of safe people, safe projects, safe settings, safe data and safe outputs
- each safe is considered both individually and in combination to determine disclosure risks and to put in place mitigation strategies for releasing and accessing data
- sets a particular value for the minimum number of unweighted contributors (such as people, households or businesses) to any cell in the table
- cells with very few contributors (small cells) may pose a disclosure risk
- common threshold values are 3, 5 or 10
- if a cell fails this rule, further investigation or action is needed to ensure the cell is adequately protected
- also called the threshold rule
- datasets that contain more than one level
- example: a dataset containing unit records with information about individual people (such as personal income) may also contain information about the families these people are part of (such as household income)
- data that includes information that refers directly to an individual or organisation, such as name or address, ABN, Medicare number
- information that directly establishes the identity of an individual or organisation
- examples include name, address, driver's licence number, Medicare number and ABN
- also known as direct identifiers
- occurs when the identity of an individual, group or organisation is disclosed due to a unique combination of characteristics (that are not direct identifiers) in a dataset
- example: a famous individual may be identifiable on the basis of their age, sex, occupation, geography and income
- where a user compares records from one dataset with records from another in an attempt to find records that have corresponding information, so that it may be concluded that the two records belong to the same individual
- this is a clear breach of the Privacy Act and other legislation governing data access where this is done in an attempt to re-identify that individual
- see aggregate data
- datasets of unit records where each record contains information about a person, organisation or other type of unit
- can include individual responses to a census, survey or administrative form
- Data that is made available with no restriction on access or use (excluding possible copyright or licensing requirements). In terms of the Five Safes framework, the only control is on safe data.
- Data on data.gov.au is open data as any researcher can download files
- Data underlying ABS TableBuilder is not considered open data as there is a safe setting control - users cannot directly access the underlying microdata
- Aggregate output (tables, graphs or maps) from TableBuilder are open data
- an unusual record that, because it has an extreme value for one or more data items, stands out from the rest of the population or sample because it has an extreme value for one or more data items
- outliers are potentially risky for confidentiality
- statistical disclosure control rule that prevents any user from estimating the value of a cell contributor to within P% (where P is defined by the data custodian)
- aggregate data treatment method
- information that identifies, or could identify, a person
- can include not only names and addresses, but also medical records, bank account details, photos, videos, and even information about what a person likes or where they work
- information can still be personal without having a name attached to it
- example: idate of birth and postcode may be enough to identify someone
- see also Sensitive information
In the Privacy Act 1988, personal information is "information or an opinion about an identified individual, or an individual who is reasonably identifiable:
- whether the information or opinion is true or not true; and
- whether the information or opinion is recorded in a material form or not."
- a statistical disclosure control technique used for count or magnitude data (aggregate data) or for microdata
- data modification method that involves changing the data slightly to reduce the risk of disclosure while retaining as much data content and structure as possible
- data rounding is a type of perturbation
- not specifically defined in the Privacy Act
- an individual's right to have their personal information kept confidential unless informed consent has been given to release the information, or a legal authority exists - this is in accordance with the requirements of the Privacy Act 1988
- the act of determining the identity of a person or organisation using publicly or privately held information about that individual or organisation
Remote analysis facility
- remote access facilities are used by agencies around the world
- enables approved researchers to submit data queries from their desktops through a secure online interface
- requests are run against microdata that is securely stored within the data custodian's control
- rare characteristics or attributes in the data that can pose an identification risk, depending on how extraordinary or noticeable they are
- may include unusual jobs, very large families or very high income
- remarkable characteristics (or remarkable combinations of characteristics) can lead to re-identification of individuals, households or organisations
- see data provider
- information that is publicly or privately known about a respondent
- may be used to breach confidentiality
- see data rounding
- one of the Five Safes, safe data poses the question: has appropriate and sufficient protection been applied to the data?
- at a minimum, direct identifiers such as name and address must be removed or encrypted
- further statistical disclosure control may be needed depending on the context in which data is released
- one of the Five Safes, safe outputs poses the question: are the statistical results non-disclosive?
- the final check, aiming for negligible risk of disclosure
- all data made available outside of the data custodian's IT environment must be checked for disclosure
- example: statistical experts may check all outputs for inadvertent disclosure before the data leave a secure data centre
- one of the Five Safes, safe people poses the question: is the researcher appropriately authorised to access and use the data?
- by placing controls on the way data is accessed, the data custodian requires the researcher to take some responsibility for preventing re-identification
- as the detail in the data increases, so should the level of user authorisation required
- one of the Five Safes, safe projects poses the question: is the data to be used for an appropriate purpose?
- before users can access detailed microdata, they may need to demonstrate to the data custodian that their project has a valid research aim, public benefit and/or statistical purpose
- depends on the context in which the data is accessed
- one of the Five Safes, safe settings poses the question: does the access environment prevent unauthorised use?
- can be considered in terms of both the IT and physical environment
- in some data access contexts, such as open data, safe settings are not applicable
- at the other end of the spectrum, sensitive information is accessed through secure research centres
Secure Research Centre
- see data laboratory
- safe storage of, and access to, data held by organisations or individuals
- covers both IT security and the physical security of buildings
Sensitive information (data)
- sensitive information is considered a subset of personal information
- under the Privacy Act, is of greater importance in terms of confidentiality (in particular where it leads to worse consequences for a re-identified individual)
- the Office of the Australian Information Commissioner lists a number of characteristics about an individual that are defined as sensitive [link]
- community and ethical expectations may not consider this list to be exhaustive (example: financial information is not present)
- all personal information can be potentially sensitive depending on the context and the individual concerned
- businesses may consider much of their information to be sensitive, but only personal data applicable under the Privacy Act
Statistical or research purposes
- purposes which support the collection, storage, compilation, analysis and transformation of data for the production of statistical outputs, the dissemination of those outputs and the information describing them
- statistical or research purposes may be distinguished from administrative, regulatory, compliance, law enforcement or other purposes that affect the rights, privileges or benefits of particular individuals or organisations
- not releasing information that is considered a disclosure risk
- aggregate data:
- removing specific values from a table so that people and organisations cannot be re-identified from the released data
- initial suppression is known as primary suppression
- additional cells needing suppression are known as consequential or secondary suppression
- removing specific records from the microdata file
- removing specific data items for all records on the microdata file
- see aggregate data
- see frequency rule
- where an individual has a characteristic or combination of characteristics that are different to all other members in a population or sample
- determined by the size of the population or sample, the degree to which it is segmented (for example by geographic information), and the number and detail of characteristics provided for each unit in the dataset
- records that are unique are not necessarily re-identifiable, as this also depends on the remarkability of the characteristics and the availability of other information or knowledge held by the researcher (response knowledge)
Unit record data
- see microdata