Treating aggregate data

Data confidentiality guide

Disclosure risks for tables, identifying risky cells, and data treatment techniques

Released

8/11/2021

Tables and disclosure risks

Aggregate data is usually presented as tables, though may also be in map or graph forms. 'Aggregate data and tables are often used interchangeably. Different types of tables can contain different disclosure risks. There are additional disclosure risks if users can access multiple tables containing common elements. Users could potentially re-identify a person by differencing the outputs of two separate tables to which the person contributes.

Frequency tables

Each cell in a table contains the number of contributors (such as individuals, households or organisations). Disclosures could be made from a table where one or more cells have a low count (a small number of contributors). Assessments of whether a cell's value is too low must be based on the underlying count of its contributors, not its weighted estimate.

Magnitude tables

Each cell contains summarising information about the contributors to that cell. This may be in the form of a total, or mean, median, mode or range. An example of a magnitude tables is reporting total turnover for groups of businesses. Disclosures can occur when the values of a small number of units (such as one or two businesses with extremely high turnover) dominate a cell value.

Rules to identify at-risk cells

Some common rules can help identify which cells may constitute a disclosure risk. By using these rules, a data custodian makes an implicit decision that all cells which break a rule are deemed an unacceptable disclosure risk, while all other cells are deemed non-disclosive. Each rule below provides protections against attackers trying to re-identify or disclose an attribute about a contributor. They also mitigate attacks where one contributor tries to discover information about another.

Strict applications of these rules may decrease the data's usefulness. Data custodians need to set appropriate rule values that are informed by legislation and organisational policy, disclosure risk assessment and statistical theory. The advantage of using rules is that they are simple, clear, consistent, transparent and amenable to automation. They are ideally applied in situations where:

data is released in a similar manner (for example when the same kind of dataset is regularly released)
transparency is important to both data custodian and users
limited opportunity or requirement exists for engagement between data custodians and users

However, there may be situations where non-disclosive data is treated unnecessarily, or where non-untreated data is in fact disclosive.

Frequency rule

Also called the threshold rule, this sets a particular value for the minimum number of unweighted contributors to a cell. Application of this rule means that cells with counts below the threshold are defined as to pose an unacceptable disclosure risk and need to be protected. However, when using this rule you should also consider:

there is no strict statistical basis for choosing one threshold value over another
higher values increase protection against re-identification but also reduce the utility of the original data
lower threshold values may be appropriate for sampled datasets compared to population datasets
by implication, everything greater than or equal to the threshold is defined as an acceptable disclosure risk
it may be that a cell below the threshold value is not a disclosure risk, while a cell value greater than the threshold value is disclosive

In Table 1 a frequency rule of 4 is chosen, with any cell of less than 4 contributors considered a disclosure risk. The 25-29 year old age group has 3 contributors in the Low income cell and therefore the cell needs to be protected.

Table 1: Example of the frequency rule in aggregate counts (threshold value = 4)
Age (years)	Low income	Medium income	High income	Total
15-19	16	0	0	16
20-24	8	10	7	25
25-29	3	8	11	22
30-34	4	5	18	27
Total	31	23	36	90

Cell dominance rule

This applies to tables that present magnitude data such as income or turnover. The rule is designed to prevent the re-identification of units that contribute a large percentage of a cell's total value. Also called the cell concentration rule, the cell dominance rule defines the number of units that are allowed to contribute a defined percentage of the total.

This rule can also be referred to as the (n,k) rule, where n is the number of units which may not contribute more than k% of the total. For example, a (2, 75) dominance rule means that the top 2 units may not contribute more than 75% of the total value to a cell. As with the frequency rule there is no strict statistical basis for choosing values for the dominance rule.

P% rule

The P% rule can be used in combination with the cell dominance rule to protect the attributes of contributors to magnitude data. It aims to protect the value of any contributor from being estimated to within P% and it is especially helpful where a user may know that a particular person or organisation is in a table. Applying this rule limits how close, in percentage terms, a contributor's estimated value can be from its true value. For example, a rule of P%=20%, means that the estimated value of any contributor must differ from its true value by at least 20%.

Using the cell dominance and P% rules together

Here is an example of using the dominance and P% rules together where (n, k)=(2, 75) and P%=20%. Table 2a shows profit for Industries A-D.

Table 2a: Example of profit by industry
Industry	Profit ($m)
A	267
B	302
C	212
D	34
Total	815

Although initially the table does not appear to contain a disclosure risk, there is a risk if information about the companies contributing the data to each industry is known.

This information could include the data in Table 2b which shows the companies contributing to Industry B. In this table, the two companies S and T contribute (150 + 93)/302 = 80.5% of the total. This exceeds the dominance rule value of k=75%, which means that this cell (Industry B) needs to be protected. Had the P% rule been applied on its own, the disclosure risk in Table 2a would not have been identified. If company S or T subtracted their own contribution to try to estimate the contribution of the other main contributor, in both cases the result would be more than 20% different from the actual value. For example, if company S subtracted its own $150m contribution from the $302m total, the resulting $152m would differ from company T's contribution by (152 – 93)/93 = 63%. Therefore the P% rule would not be broken.

Table 2b: Contributors to Industry B
Company	Profit ($m)
S	150
T	93
U	21
V	13
W	8
X	8
Y	6
Z	3
Total	302

This shows that a combination of approaches may be necessary to identify at-risk cells. Even though Table 2b data should never be released (because it refers to individual companies), anyone familiar with Industry B may know that Companies S and T are the two main contributors. Therefore the summary data in Table 2a requires treatment.

Tabular data treatment techniques

Due to the diverse nature of data there is no one solution for managing all re-identification risks through the treatment of aggregate data. However, the key is to only treat the cells that have been assessed as being to be an unacceptable disclosure risk.

The most common techniques are:

data reduction, which decreases the detail available to the user
data modification, which makes small changes to the data

How effective these techniques are depends on:

the structure of the dataset
the requirements of the data users
legislative or policy requirements
the available infrastructure to treat and disseminate the data

When releasing aggregate data into the public domain, all reasonable efforts should be made to prevent disclosure from the data. While it won't be possible to guarantee confidentiality, this effort must satisfy legislative requirements. For example, ABS legislation states that data must not be released in a manner that is likely to enable the identification of a particular person or organisation. This is important because once a table is made public, there are no further opportunities to control how the data will be used or to apply other confidentiality controls using the Five Safes framework.

Data reduction techniques are generally easy to implement but can have significant confidentiality issues. By contrast, data modification techniques are harder to implement but generally produce tables that are less disclosive (Table 3). These methods can be used to allow data custodians to release data that would otherwise remain inaccessible.

It is recommended that data custodians start by using simple techniques (data reduction) and only proceed to more complex ones (data modification) if an objective assessment shows the results of the simple techniques have an unacceptable risk of disclosure. This assessment could be made at an organisational level, rather than repeated for every table. For example, the ABS has assessed that TableBuilder's perturbation algorithm, despite its complex methodology, is safer than data reduction for frequency tables. Though potentially time-consuming, these methods can be used to allow data custodians to release data that would otherwise remain inaccessible.

Table 3: Comparison of techniques for treating tabular data
Confidentiality technique	Advantages	Disadvantages
Data reduction	relatively easy to implement requires minimal education of users	does not reliably protect individuals from differencing between multiple overlapping tables may reduce the data's usefulness the data custodian chooses what data to remove without necessarily knowing what is most important to the data users requires secondary suppression to protect the original primary suppressed cells even with secondary suppression, some suppressed cells may still be estimated
Data modification	generally does not affect the data's overall utility generally protects against differencing, zero cells and 100% cells may be automated, requiring minimal human input	does not provide additivity within tables unless secondary modifications are applied requires some education of users may require significant setup time and costs may reduce the data's usefulness, particularly when analysing small areas/populations

Data reduction

Data reduction protects tables by either combining categories or suppressing cells. It limits the risks of individual re-identification because cells with a low number of contributors or dominant contributors are not released.

Data reduction: involves:

combines variable categories
suppresses counts with a small number of contributors (as per the frequency rule), and it considers suppressing higher counts for sensitive items
suppresses cells with dominant contributors (as per the cell dominance rule)

Combining categories

This approach can be applied to Table 4a, where the value 3 does not meet a frequency threshold of 4. This cell can be protected in two ways:

combine the 20-24 and the 25-29 year old age groups to create a 20-29 year old range (Table 4b)
combine the Low and Medium income categories to create a single Low–medium category (Table 4c)

Table 4a: Unprotected income and age data (threshold value = 4)
Age (years)	Low income	Medium income	High income	Total
15-19	16	0	0	16
20-24	8	10	7	25
25-29	3	8	11	22
30-34	4	5	18	27
Total	31	23	36	90

Table 4b: Treatment applied - age groups combined (20-29 years)
Age (years)	Low income	Medium income	High income	Total
15-19	16	0	0	16
20-29	11	18	18	47
30-34	4	5	18	27
Total	31	23	36	90

Table 4c: Treatment applied - income categories combined (Low-medium income)
Age (years)	Low - medium income	High income	Total
15-19	16	0	16
20-24	18	7	25
25-29	11	11	22
30-34	9	18	27
Total	54	36	90

The choice of which categories to combine depends on how important they are for further analysis and what the purpose is in releasing the data. For example, if researchers specifically want to analyse the 'Low' income range, collapsing it into a 'Low-Medium' range prevents this. There is no one answer when choosing which categories to combine: in either case, the data's utility may be affected.

Combining categories is often appropriate, but doesn't work in all situations. For example, if another table were produced containing the 20-24 year old row, the 25-29 year old values could be determined by subtracting the 20-24 row (in the new table) from the 20-29 row (in Table 4b).

This can be a complex process. Data custodians must carefully consider any other tables that include the same contributors, that:

have already been released
are being released at the same time
are likely to be released in the future

Suppression

Suppression involves removing cells considered to be a disclosure risk from a table.

In Table 4a, for example, the 3 could be replaced with not provided or np. This is called primary suppression. Sometimes secondary, or consequential, suppression is also required. For example, in addition to suppressing the 3 cell, other cells need to be suppressed to prevent the primary suppressed cell from being calculated. The data in Table 4a could be treated by:

using primary and consequential suppression by not providing data from other rows and columns (Table 5a) provides an example of how primary and consequential suppression can be used to treat
suppressing cells in the totals (Table 5b)

Other cell suppression combinations can be employed. The key is to ensure that the suppressed cells cannot be derived from the remaining information.

Table 5a: Primary and secondary suppression to protect tabular data
Age (years)	Low income	Medium income	High income	Total
15-19	16	0	0	16
20-24	8	10	7	25
25-29	np	np	11	22
30-34	np	np	18	27
Total	31	23	36	90

Table 5b: Suppression of totals to protect tabular data
Age (years)	Low income	Medium income	High income	Total
15-19	16	0	0	16
20-24	8	10	7	25
25-29	np	8	11	>19
30-34	4	5	18	27
Total	>28	23	36	>87

It is not usually recommended to suppress cells that contain a zero. For example if the two zeros in Table 5a were suppressed, it is apparent from the row total and the Low income 15-19 year old cell that they are zeros. In addition, it may be the case that a value must be zero by definition (often called structural zeros, for example the number of pregnant males in a population).

There is a large body of literature as well as commercially available software for determining the optimal pattern of secondary suppression.

Limitations of data reduction

As with combining categories, it is possible that other available tables may contain cell values or grand totals that could be used to break the suppression. For example, if tables 5a and 5b were both publicly available, the suppressed 'Medium' income values in Table 5a could be replaces with the unsuppressed values from Table 5b. Table 5a would then contain enough information to deduce all of the remaining suppressed values.

Table 6a: Original data - income by age
Age (years)	Total
Age (years)	Total	Low income	Medium income	High income	Very high income
15-19	1	2	3	5	11
20-24	6	3	2	7	18
25-29	2	7	8	4	21
30-34	4	11	15	4	34
Total	13	23	28	20	84

Table 6b: Suppressed data - income by age after applying threshold = 4
Age (years)	Total
Age (years)	Total	Low income	Medium income	High income	Very high income
15-19	np	np	np	5	11
20-24	6	np	np	7	18
25-29	np	7	8	np	21
30-34	np	11	15	np	34
Total	13	23	28	20	84

Table 6c: Income by age with variables assigned
Age (years)	Total
Age (years)	Total	Low income	Medium income	High income	Very high income
15-19	a	b	c	5	11
20-24	6	d	e	7	18
25-29	f	7	8	g	21
30-34	h	11	15	i	34
Total	13	23	28	20	84

In some cases, the results of data suppression may appear safer than they truly are. Tables 6a–c illustrate how suppressed cell values can be deduced using techniques such as linear algebra. That is, While the data in Table 6b appears safe, someone could assign a variable to each of the 'np's to create Table 6c.

With variables assigned to Table 6c, the values in rows 1-2 and columns 2-3 can be used to generate the following equation:

(a+b+c+5) + (6+d+e+7) – (b+d+7+11) – (c+e+8+15) = 11 + 18 – 23 – 28

The variables b, c, d and e cancel out in this equation to give:

a – 23 = –22

Therefore, a = 1

This is the correct and disclosive value (from Table 6a) that the suppression (in Table 6b) was meant to protect. Thus linear algebra and other techniques such as integer linear-programming can be used to calculate or make reasonable estimates of the missing values in tables. In fact, the larger the table (or set of related tables available) the more accurately the results can be estimated. As computing power increases and more data is released, these sorts of attacks on data become easier.

Problems with suppression:

The usefulness of the data is reduced (for example, Table 5a has lost one third of its data). There are cells that aren't disclosive, but that have been suppressed nonetheless.
It can be difficult and time consuming to select the best cells for secondary suppression for large tables especially. Software packages are available that optimise the suppression pattern in the table or set of tables.
The data is no longer machine readable because the table now includes symbols ['>'] or letters ['np'].

Although it can be relatively simple to suppress cells or combine categories, data custodians must be confident their outputs are not disclosive.

Data modification

In contrast to data reduction techniques, data modification is generally applied to all non-zero cells in a table - not just those with a high disclosure risk. Data modification assumes that all cells could either be disclosive themselves or contribute to disclosure. It aims to change all non-zero cells by a small amount without reducing the table's overall usefulness for most purposes.

The two methods discussed below are:

rounding
perturbation (global or targeted)

Rounding

The simplest approach to data modification is rounding to a specified base, where all values become divisible by the same number (often 3, 5 or 10). For example, Table 7b shows how original data values (Table 7a) would look with its values rounded to base 3.

Table 7a: Income by age original values (from Table 1)
Age (years)	Low income	Medium income	High income	Total
15-19	16	0	0	16
20-24	8	10	7	25
25-29	3	8	11	22
30-34	4	5	18	27
Total	31	23	36	90

Table 7b: Income by age with rounding to base 3
Age (years)	Low income	Medium income	High income	Total
15-19	15	0	0	15
20-24	9	9	6	24
25-29	3	9	12	21
30-34	3	6	18	27
Total	30	24	36	90

Using this technique, the data is still numerical (containing no symbols or letters) which is a practical advantage for users requiring machine readability.

Though the data loses some utility, rounding brings some significant advantages from a confidentiality point of view:

Users won't know whether the rounded value of '3' in Table 7 is actually a 2, 3 or 4.
Users won't know whether the zeros are true zeros. This mitigates the problem of group disclosure, whereas the original values showed that all 15-19 year olds were on a low income.
Even if the true grand total or marginal totals were known from other sources, the user is still unable to calculate the true values of the internal cells.

These advantages are not universally true and there may be situations where rounding can still lead to disclosure. For example, if the count of 15-19 year olds in Low Income was 14 in Table 7, then the rounded counts for this row would still be 15, 0, 0 (total = 15). Now, if the true total was somehow known from other sources to be 14, then the true value of the low income category must also be 14 (it can't be 15 or 16 since this would necessitate a negative number in one or both of the other two income categories).

The main disadvantage to rounding is that there can be inconsistency within the table. For example in Table 7b, the internal cells of the 25-29 year-olds row sum to 24, whereas the total for that row is 21. Although controlled rounding can be used to ensure additivity within the table (where the totals add up), it may not provide consistency across the same cells in different tables.

Graduated rounding can also be used to round magnitude tables, which means the rounding base varies by the size of the cell. Table 8 shows how data could be protected by rounding the original values to base 100 (Industry A, B, C and Total) or 10 (Industry D). Again, the total is not equal to the sum of the internal cells, but it is now much harder to estimate the relative contributions of units to these industry totals.

Table 8: Example of profit by industry with graduated rounding
Industry	Original values (Profit $m)	Rounded values (Profit $m)
Industry	Original values (Profit $m)	A	267	300
B	302	300
C	212	200
D	34	30
Total	815	800

Perturbation

Perturbation is another data modification method. This is where a change (often with a random component) is made to some or all non-zero cells in a table.

For count data (frequency tables), a randomised number is added to the original values. This is called additive perturbation.
For magnitude tables, the original values are multiplied by a randomised number. This is called multiplicative perturbation.

For both table types, this can be further broken down into targeted or global approaches.

Targeted perturbation is the approach taken when only those cells that are considered to be a disclosure risk are treated. Often this requires the application of secondary perturbation in order to maintain additivity within a table.

This approach is used by the ABS to release some economic data. An example (as shown in Table 9) would be to remove $50m from Industry B and add $25m to Industries A and C.

Table 9: Example of profit by industry with targeted perturbation
Industry	Original values (Profit $m)	Rounded values (Profit $m)
Industry	Original values (Profit $m)	Rounded values (Profit $m)	A	267	292
B	302	252
C	212	237
D	34	34
Total	815	815

There are two key advantages of this approach:

the total does not change (this is an important feature when the ABS releases economic data which then feed into National Accounts)
generally, there is minimal loss of information

A disadvantage is that some data that are not disclosive per se is being altered to protect another cell (for example, the value of Industries A and C in Table 9). An alternative is to place the $50m that was taken from Industry B and place it in a new sundry category. Although some of the processes for targeted perturbation are amenable to automation (such as using a constrained optimisation approach), there is often a significant manual effort required (more so, when it is considered that all other tables produced need to have matching perturbed values).

This requirement for manual treatment means targeted perturbation often requires a skilled team to maintain the usefulness of the data (such as those specialising in releasing economic trade data in the ABS). If this is not feasible for data custodians, and particularly when the data is not economically sensitive, then a global perturbation approach may be more appropriate, as it can be automated. An important feature of global perturbation is that each non-zero cell (including totals) is perturbed independently. Perturbation can also be applied in a way that ensures consistency between tables, but it cannot guarantee consistency within a table. The marginal totals may not be the same as the sum of their constituent cells. This methodology is applied in TableBuilder, an ABS product used for safely releasing both Census and survey data.

Tables 10a and 10b show how data might look before and after additive perturbation is applied.

Here the user has no chance of determining that the count of low income 15-19 year olds is '1'. They can still make reasonable estimates of the true value, but they are unable to confirm their guesses. Because perturbation only applies small changes and applies changes to every cell, the results are unbiased and for most purposes, the overall value of the table is retained.

Table 10a: Income data by age (original)
Age (years)	Low income	Medium income	High income	Very high income	Total
15-19	1	2	3	5	11
20-24	6	3	2	7	18
25-29	2	7	8	4	21
30-34	4	11	15	4	34
Total	13	23	28	20	84

Table 10b: Income data by age with additive perturbation
Age (years)	Low income	Medium income	High income	Very high income	Total
15-19	0	4	3	7	10
20-24	4	6	0	4	21
25-29	0	7	9	5	21
30-34	7	10	16	4	32
Total	12	25	25	21	83

Perturbation can also be applied to magnitude tables (assuming targeted perturbation was not appropriate or viable), but in these cases it is multiplicative. This is because adding or subtracting a few dollars to a company's income is unlikely to reduce the disclosure risk in any meaningful way. With multiplicative perturbation, the largest contributors to a cell total have their values changed by a percentage. The total of these perturbed values then becomes the new published cell total.

Table 11 shows how data might look before and after multiplicative perturbation is applied.

The total has been changed by only about 6%, but the individual values of companies S and T have been protected. An attacker does not know the extent to which each contributor's profit has been perturbed or therefore how close the perturbed total is to the true total.

Table 11: Top 3 contributing companies to an industry (Industry B) with multiplicative perturbation
Company	Original values (Profit $m)	Rounded values (Profit $m)
Company	Original values (Profit $m)	Rounded values (Profit $m)	S	150	123
T	93	104
U	21	18
V	13	13
W	8	8
X	8	8
Y	6	6
Z	3	3
Total	302	283

Table 12 shows how data from Table 11 might be combined with other industry totals in a released table. As with rounding the table no longer adds up, but there is robust protection of its individual contributors. For magnitude tables it may also be necessary to suppress low counts of contributors (or perturb all counts) as an extra protection.

Table 12: Example of profit by industry with multiplicative perturbation
Industry	Original values (Profit $m)	Rounded values (Profit $m)
Industry	Original values (Profit $m)	Rounded values (Profit $m)	A	267	296
B	302	283
C	212	185
D	34	38
Total	815	821

An important issue to keep in mind with all perturbation is that the perturbation itself may adversely impact on the usefulness of the data. Tables 10 and 12 show perturbation where the totals are quite close to the true value, but individual cells can be changed by 100%. When data custodians are releasing data, they need to clearly communicate the process by which they have perturbed the data (although they shouldn't provide exact details on key parameters of perturbation that could be used to undo the treatment). This communication should also include a caveat that further analysis of the perturbed data may not give accurate results. Further analysis done on the unperturbed data should be carefully checked before being released into the public domain to ensure the analysis and the perturbed table cannot be used in conjunction to breach confidentiality. On the other hand, it should also be recognised that small cell values may be unreliable either due to sampling or that responses are not always accurate.

Hierarchical data treatment techniques

All of the methods described above are limited in how they deal with hierarchical datasets (datasets with information at different levels). For example, a file may contain records for each family at one level and in another file separate records for individual family members. The same structure could apply to datasets containing businesses and employees, or to health care services, providers and their clients.

In hierarchical datasets, contributors at all levels must be protected. A table that appears to be non-disclosive at one level may contain information that a knowledgeable user could use to re-identify a contributor at another level. For example, a table cell that passes the cell frequency rule at one level may not pass the same rule at a higher level.

The following example shows a summary table (Table 13), which is derived from detailed information in Table 14.

On the surface, Table 13 appears non-disclosive, for example, it doesn't violate a frequency rule of 4. However, a closer look at the source data reveals the following disclosure risks:

The summary count of 'Pathology' in the 'Private' sector in 'East' location (61) is based on only 2 providers (Lisa and Stu). Thus, both Lisa and Stu could subtract their own contribution to determine the income of the other.
Only one company (Clinic D) is represented by this same cell.
The summary count of 'Surgery' in the 'Public' sector in 'West' location (5) is from only 2 patients and 1 provider (Pru). So other companies can use the service fee value to estimate Clinic E's income for 'Surgery'.

All of these scenarios are confidentiality breaches. In a real-world situation Table 14 would not be released because it contains names. But it should be assumed that someone familiar with the sector would know the contributors and may even be one of them.

Table 13: Summary of health care service counts (based on Table 14 data)
Service type	East (Public)	West (Public)	East (Private)	West (Private)	Total
Service type	East (Public)	West (Public)	East (Private)	West (Private)	Total	Treatment	0	47	95	0	142
Surgery	0	5	0	209	214
Pathology	0	10	61	7	78
Total	0	62	156	216	434

Table 14: Counts of health care patients and services
Sector	Location	Company	Clinic	Service	Provider	Patients	Service
Private	East	Q	D	Pathology	Lisa	15	29
Private	East	Q	D	Pathology	Stu	18	32
Private	East	Q	B	Treatment	Joe	3	5
Private	East	Q	B	Treatment	Jan	8	31
Private	East	Q	B	Treatment	Deb	6	22
Private	East	Q	B	Treatment	Em	5	31
Private	East	Q	B	Treatment	Fred	3	6
Private	West	Q	C	Pathology	Ian	7	7
Private	West	Q	C	Surgery	Bill	3	8
Private	West	R	E	Surgery	Tess	36	201
Public	West	P	A	Pathology	Meg	4	10
Public	West	P	E	Surgery	Pru	2	5
Public	West	P	A	Treatment	Rob	3	7
Public	West	P	A	Treatment	Al	14	40
Total	2	3	5		14	127	433

The example above shows the need to protect information at all levels. With hierarchical data, data treatment must be applied at every level, and any consequences of these changes need to be followed through to other levels. For example, if the number of providers in Table 14 is reduced to zero, then the count of services (in Tables 13 and 14) also needs to be zero.

APA

Citation