|Page tools: Print Page Print All RSS Search this Product|
MANAGING THE RISK OF DISCLOSURE: TREATING AGGREGATE DATA
However, strict application means that there may be situations where non-disclosive data are treated unnecessarily, or where non-treated data are in fact disclosive. The ABS is able to provide advice on these rules.
Also called the threshold rule, this sets a particular value for the minimum number of unweighted contributors to a cell. Application of this rule means that cells with counts below the threshold are defined to pose an unacceptable disclosure risk and need to be protected. However, some things need to be kept in mind when using this rule:
Judgement needs to be exercised when considering cells with no contributors (i.e. ‘zero cells’) or where all contributors to a row or column are concentrated in one cell (i.e. ‘100% cells’), as these may also pose a disclosure risk.
A hypothetical application of the frequency rule is presented in Table 1. Because a frequency rule of 4 is chosen for this example, any cell with fewer than 4 contributors is deemed a disclosure risk. In this example, the 25–29 year old age group has only 3 contributors in the Low Income cell, and therefore the cell needs to be protected.
Cell dominance rule
This applies to tables that present magnitude data such as income or turnover. The rule is designed to prevent the re-identification of units that contribute a large percentage of a cell’s total value. Also called the cell concentration rule, the cell dominance rule defines the number of units that are allowed to contribute a defined percentage of the total.
This rule can also be referred to as the (n,k) rule, where n is the number of units which may not contribute more than k% of the total. For example, a (2, 75) dominance rule means that the top 2 units may not contribute more than 75% of the total value to a cell. As with the frequency rule there is no strict statistical basis for choosing values for the dominance rule.
The P% rule can be used in combination with the cell dominance rule to protect the attributes of contributors to magnitude data. It aims to protect the value of any contributor from being estimated to within P% and it is especially helpful where a user may know that a particular person or organisation is in a table. Applying this rule limits how close, in percentage terms, a contributor’s estimated value can be from its true value. For example, a rule of P%=20%, means that the estimated value of any contributor must differ from its true value by at least 20%.
Using the Cell dominance and P% rules together
Below is an example of the dominance and P% rules used together where (n, k)=(2, 75) and P%=20%. Consider the data in Table 2a, which shows profit for Industries A–D.
Although at first glance the table does not appear to contain a disclosure risk, one does exist if information about the companies contributing the data to each industry is known.
This information could include the data in Table 2b which shows the companies contributing to Industry B. In this table, the two companies S and T contribute (150 + 93)/302 = 80.5% of the total. This exceeds the dominance rule value of k=75%, which means that this cell (Industry B) needs to be protected. Had the P% rule been applied on its own, the disclosure risk in Table 2a would not have been identified. If company S or T subtracted their own contribution to try to estimate the contribution of the other main contributor, in both cases the result would be more than 20% different from the actual value. For example, if company S subtracted its own $150m contribution from the $302m total, the resulting $152m would differ from company T’s contribution by (152 – 93)/93 = 63%. Therefore the P% rule would not be broken.
This shows that a combination of approaches may be necessary to identify at-risk cells. Even though Table 2b data should never be released (because it refers to individual companies), anyone familiar with Industry B may know that Companies S and T are the two main contributors. The summary data in Table 2a would therefore require treatment, even if at first glance it appears safe.
DATA TREATMENT TECHNIQUES: TABULAR DATA
Due to the diverse nature of data there is no ‘one size fits all’ solution for managing re-identification risks through the treatment of aggregate data. The key however, is to only treat the cells that have been deemed to be an unacceptable disclosure risk.
The most common techniques are:
How effective these techniques are depends on:
When releasing aggregate data into the public domain, all reasonable efforts should be made to prevent disclosure from those data. While it won’t be possible to guarantee the confidentiality, this effort must satisfy legislative requirements (for example ABS legislation states that data may not be released “in a manner likely to identify”). This is important because once a table is made public, there are no further opportunities to control how the data will be used (e.g. to apply other confidentiality controls using the Five Safes Framework).
Data reduction techniques are generally easy to implement but can have significant confidentiality issues. By contrast, data modification techniques are harder to implement but generally produce tables that are less disclosive. These issues are highlighted in Table 3.
It is recommended that data custodians start by using simple techniques (i.e. data reduction) and only proceed to more difficult ones (i.e. data modification) if an objective assessment shows the results of the simple techniques have an unacceptable risk of disclosure. This assessment could be made at an organisational level, rather than repeated for every table. For example, the ABS has assessed that TableBuilder’s perturbation algorithm, despite its complex methodology, is safer than data reduction for frequency tables.
Methods for applying these different disclosure control methods are outlined below. Though potentially time-consuming, these methods can be used to allow data custodians to release data that would otherwise remain inaccessible.
Data reduction protects tables by either combining categories or suppressing cells. It limits the risks of individual re-identification because cells with a low number of contributors or dominant contributors are not released.
Data reduction involves:
This approach can be applied to Table 4a, where the value ‘3’ does not meet a frequency threshold of 4. This cell can be protected in either of two ways:
The choice of which categories to combine will depend on how important they are for further analysis and what the purpose is in releasing the data. For example, if researchers specifically want to analyse the ‘Low’ income range, collapsing it into a ‘Low–Medium’ range will prevent this. There is therefore no ‘right answer’ when choosing which categories to combine: in either case, the data’s utility may be affected.
Combining categories is often appropriate, but it is not foolproof. For example, if another table were produced containing the 20–24 year old row, the 25–29 year old values could be determined by subtracting the 20–24 row (in the new table) from the 20–29 row (in Table 4a).
Data custodians must carefully consider other tables, about the same contributors, that:
This is not a trivial process and the ABS can provide advice on how these techniques can best be applied.
This approach involves removing cells deemed to be a disclosure risk from a table.
In Table 4a, for example, the ‘3’ could be replaced with ‘not provided’ or ‘np’. This is called primary suppression. Sometimes secondary, or consequential, suppression will also be required. For example, in addition to suppressing the ‘3’ cell, other cells will need to be suppressed to prevent the primary suppressed cell from being calculated. Table 5a provides an example of how primary and consequential suppression can be used to treat Table 4a data.
Other cell suppression combinations could be employed. Table 5b, for example, suppresses cells in the totals. The key is to ensure that the suppressed cells cannot be derived from the remaining information.
Care needs to be taken when choosing the pattern of secondary suppression. In particular, the suppression of cells that contain a zero is generally not recommended. For example if the two zeros in Table 5a were suppressed, then it is quite apparent from the row total and the Low Income 15-19 year old cell that they are in fact zeros. In addition, it may be the case that a value must be zero by definition (often called structural zeros, for example the number of pregnant males in a population).
There is a large body of literature as well as commercially available software for determining the optimal pattern of secondary suppression.
Limitations of data reduction
As with combining categories, it is possible that other available tables may contain cell values or grand totals that could be used to break the suppression. For example, if tables 5a and 5b were both publicly available, the suppressed ‘Medium’ income values in Table 5a could be replaces with the unsuppressed values from Table 5b. Table 5a would then contain enough information to deduce all of the remaining suppressed values.
In some cases, the results of data suppression may appear safer than they truly are. Tables 6a–c illustrate how suppressed cell values can be deduced using techniques such as linear algebra. That is, while the data in Table 6b appear safe, someone could assign a variable to each of the ‘np’s to create Table 6c—the consequences of which are explained below.
With variables assigned to Table 6c, the values in rows 1–2 and columns 2–3 can be used to generate the following equation:
(a+b+c+5) + (6+d+e+7) – (b+d+7+11) – (c+e+8+15) = 11 + 18 – 23 – 28
The variables b, c, d and e cancel out in this equation to give:
a – 23 = –22
Therefore, a = 1
This is the correct and disclosive value (from Table 6a) that the suppression (in Table 6b) was meant to protect. Thus linear algebra and other techniques such as integer linear-programming can be used to calculate or make reasonable estimates of the missing values in tables. In fact, the larger the table (or set of related tables available) the more accurately the results can be estimated. As computing power increases and more data are released, these sorts of attacks on data will become easier.
Three other practical problems arise with suppression:
Although it can be relatively simple to suppress cells or combine categories, data custodians must still take care that they are confident their outputs are not disclosive.
In contrast to data reduction techniques, data modification is generally applied to all non-zero cells in a table—not just those with a high disclosure risk. Data modification assumes that all cells could either be disclosive themselves or contribute to disclosure. It aims to change all non-zero cells by a small amount without reducing the table’s overall usefulness for most purposes.
The two methods discussed below are:
The simplest approach to data modification is rounding to a specified base, where all values become divisible by the same number (often 3, 5 or 10). For example, Table 7 shows how original data values (from Table 1) would look with its values rounded to base 3.
It should be apparent that the data are still numerical (i.e. containing no symbols or letters) which is a practical advantage for users requiring machine readability.
Though the data loses some utility, rounding brings some significant advantages from a confidentiality point of view:
These advantages are not universally true and there may be situations where rounding can still lead to disclosure. For example, if the count of 15-19 year olds in Low Income was in fact 14 in Table 7, then the rounded counts for this row would still be 15, 0, 0 (total = 15). Now, if the true total was somehow known from other sources to be 14, then the true value of the low income category must also be 14 (it can’t be 15 or 16 since this would necessitate a negative number in one or both of the other two income categories).
The main disadvantage to rounding is that there can be inconsistency within the table (e.g. in Table 7, the internal cells of the 25–29 year-olds row sum to 24, whereas the total for that row is 21). Although controlled rounding can be used to ensure additivity within the table (i.e. that the totals add up), it may not provide consistency across the same cells in different tables.
Graduated rounding can also be used to round magnitude tables, which means the rounding base varies by the size of the cell. Table 8 shows how data could be protected by rounding the original values to base 100 (Industry A, B, C and Total) or 10 (Industry D). Again, the total is not equal to the sum of the internal cells, but it is now much harder to estimate the relative contributions of units to these industry totals.
A second approach to data modification is perturbation. This is where a change (often with a random component) is made to some or all non-zero cells in a table.
For both table types, this can be further broken down into targeted or global approaches.
Targeted perturbation is the approach taken when only those cells that are deemed a disclosure risk are treated. Often this will require the application of secondary perturbation in order to maintain additivity within a table.
This approach is used by the ABS to release some economic data. An example (as shown in Table 9) would be to remove $50m from Industry B and add $25m to Industries A and C.
There are two key advantages of this approach:
A disadvantage is that some data that are not disclosive per se are being altered to protect another cell (for example, the value of Industries A and C in Table 9). An alternative then is to place the $50m that was taken from Industry B and place it in a new sundry category. Although some of the processes for targeted perturbation are amenable to automation (such as using a constrained optimisation approach), there is often a significant manual effort required (more so, when it is considered that all other tables produced need to have matching perturbed values).
This requirement for manual treatment means targeted perturbation often requires a skilled team to maintain the usefulness of the data (such as those specialising in releasing economic trade data in the ABS). If this is not feasible for data custodians, and particularly when the data are not economically sensitive, then a global perturbation approach may be more appropriate (as it can be automated). An important feature of global perturbation is that each non-zero cell (including totals) is perturbed independently. Perturbation can also be applied in a way that ensures consistency between tables, but it cannot guarantee consistency within a table (i.e. the marginal totals may not be the same as the sum of their constituent cells). This methodology is applied to TableBuilder, an ABS product used for safely releasing both Census and survey data.
Table 10 shows how data might look before and after additive perturbation is applied.
Here the user has no chance of determining that the count of low income 15–19 year olds is ‘1’. They can, of course, still make reasonable estimates of the true value, but they will be unable to have confidence in their guesses. Because perturbation only applies small changes and applies changes to every cell, the results will be unbiased and for most purposes, the overall value of the table is retained.
Perturbation can also be applied to magnitude tables (assuming targeted perturbation was not appropriate or viable), but in these cases it is multiplicative. This is because adding or subtracting a few dollars to a company’s income is unlikely to reduce the disclosure risk in any meaningful way. With multiplicative perturbation, the largest contributors to a cell total have their values changed by a percentage. The total of these perturbed values then becomes the new published cell total.
Table 11 shows how data might look before and after multiplicative perturbation is applied.
The total has been changed by only about 6%, but the individual values of companies S and T have been protected. An attacker will not know the extent to which each contributor’s profit has been perturbed or therefore how close the perturbed total is to the true total.
Table 12 shows how data from Table 11 might be combined with other industry totals in a released table. As with rounding the table no longer adds up, but there is robust protection of its individual contributors. For magnitude tables it may also be necessary to suppress low counts of contributors (or perturb all counts) as an extra protection.
An important issue to keep in mind with all perturbation is that the perturbation itself may adversely impact on the usefulness of the data. Tables 10 and 12 show perturbation where the totals are quite close to the true value, but individual cells can be changed by 100%. When data custodians are releasing data, they need to clearly communicate the process by which they have perturbed the data (although they shouldn’t provide exact details on key parameters of perturbation that could be used to undo the treatment). This communication should also include a caveat that further analysis of the perturbed data may not give accurate results. Of course further analysis done on the unperturbed data should be carefully checked before being released into the public domain to ensure the analysis and the perturbed table cannot be used in conjunction to breach confidentiality. On the other hand, it should also be recognised that small cell values may be unreliable (either due to sampling or that responses are not always accurate).
DATA TREATMENT TECHNIQUES: HIERARCHICAL DATA
All of the methods described above are limited in how they deal with hierarchical datasets (these are datasets comprising information at different levels). For example, a file may comprise records for each family at one level and below that separate records for individual family members. The same structure could apply to datasets containing businesses and employees, or to health care services, providers and their clients.
In hierarchical datasets, contributors at all levels must be protected. A table that appears to be non-disclosive at one level may contain information that a knowledgeable user could use to re-identify a contributor at another level. For example, a table cell that passes the cell frequency rule at one level may not pass the same rule at a higher level.
The following hypothetical situation shows a summary table (Table 13), which is derived from detailed information in Table 14.
On the surface, Table 13 appears non-disclosive (for example, it doesn’t violate a frequency rule of 4). However, a closer look at the source data reveals the following disclosure risks:
All of these scenarios are confidentiality breaches. In a real-world situation Table 14 would not be released because it contains names. But it should be assumed that someone familiar with the sector would know the contributors and may even be one of them.
The example above shows the need to protect information at all levels. With hierarchical data, data treatment must be applied at every level, and any consequences of these changes need to be followed through to other levels. For example, if the number of providers in Table 14 is reduced to zero, then the count of services (in Tables 13 and 14) also needs to be zero.
These documents will be presented in a new window.