Treating aggregate data

Data confidentiality guide

Disclosure risks for tables, identifying risky cells, and data treatment techniques

Released
8/11/2021

Tables and disclosure risks

Aggregate data is usually presented as tables, though may also be in map or graph forms. 'Aggregate data and tables are often used interchangeably. Different types of tables can contain different disclosure risks. There are additional disclosure risks if users can access multiple tables containing common elements. Users could potentially re-identify a person by differencing the outputs of two separate tables to which the person contributes.

Frequency tables

Each cell in a table contains the number of contributors (such as individuals, households or organisations). Disclosures could be made from a table where one or more cells have a low count (a small number of contributors). Assessments of whether a cell's value is too low must be based on the underlying count of its contributors, not its weighted estimate.

Magnitude tables

Each cell contains summarising information about the contributors to that cell. This may be in the form of a total, or mean, median, mode or range. An example of a magnitude tables is reporting total turnover for groups of businesses. Disclosures can occur when the values of a small number of units (such as one or two businesses with extremely high turnover) dominate a cell value.

Rules to identify at-risk cells

Some common rules can help identify which cells may constitute a disclosure risk. By using these rules, a data custodian makes an implicit decision that all cells which break a rule are deemed an unacceptable disclosure risk, while all other cells are deemed non-disclosive. Each rule below provides protections against attackers trying to re-identify or disclose an attribute about a contributor. They also mitigate attacks where one contributor tries to discover information about another.

Strict applications of these rules may decrease the data's usefulness. Data custodians need to set appropriate rule values that are informed by legislation and organisational policy, disclosure risk assessment and statistical theory. The advantage of using rules is that they are simple, clear, consistent, transparent and amenable to automation. They are ideally applied in situations where:

  • data is released in a similar manner (for example when the same kind of dataset is regularly released)
  • transparency is important to both data custodian and users
  • limited opportunity or requirement exists for engagement between data custodians and users

However, there may be situations where non-disclosive data is treated unnecessarily, or where non-untreated data is in fact disclosive. 

Frequency rule

Also called the threshold rule, this sets a particular value for the minimum number of unweighted contributors to a cell. Application of this rule means that cells with counts below the threshold are defined as to pose an unacceptable disclosure risk and need to be protected. However, when using this rule you should also consider:

  • there is no strict statistical basis for choosing one threshold value over another
  • higher values increase protection against re-identification but also reduce the utility of the original data
  • lower threshold values may be appropriate for sampled datasets compared to population datasets
  • by implication, everything greater than or equal to the threshold is defined as an acceptable disclosure risk
  • it may be that a cell below the threshold value is not a disclosure risk, while a cell value greater than the threshold value is disclosive

In Table 1 a frequency rule of 4 is chosen, with any cell of less than 4 contributors considered a disclosure risk. The 25-29 year old age group has 3 contributors in the Low income cell and therefore the cell needs to be protected.

Table 1: Example of the frequency rule in aggregate counts (threshold value = 4)
Age (years)Low incomeMedium incomeHigh incomeTotal
15-19160016
20-24810725
25-29381122
30-34451827
Total31233690

Cell dominance rule

This applies to tables that present magnitude data such as income or turnover. The rule is designed to prevent the re-identification of units that contribute a large percentage of a cell's total value. Also called the cell concentration rule, the cell dominance rule defines the number of units that are allowed to contribute a defined percentage of the total. 

This rule can also be referred to as the (n,k) rule, where n is the number of units which may not contribute more than k% of the total. For example, a (2, 75) dominance rule means that the top 2 units may not contribute more than 75% of the total value to a cell. As with the frequency rule there is no strict statistical basis for choosing values for the dominance rule.

P% rule

The P% rule can be used in combination with the cell dominance rule to protect the attributes of contributors to magnitude data. It aims to protect the value of any contributor from being estimated to within P% and it is especially helpful where a user may know that a particular person or organisation is in a table. Applying this rule limits how close, in percentage terms, a contributor's estimated value can be from its true value. For example, a rule of P%=20%, means that the estimated value of any contributor must differ from its true value by at least 20%. 
 

Using the cell dominance and P% rules together

 Here is an example of using the dominance and P% rules together where (n, k)=(2, 75) and P%=20%. Table 2a shows profit for Industries A-D. 

Table 2a: Example of profit by industry
IndustryProfit ($m)
A267
B302
C212
D34
Total815

Although initially the table does not appear to contain a disclosure risk, there is a risk if information about the companies contributing the data to each industry is known.

This information could include the data in Table 2b which shows the companies contributing to Industry B. In this table, the two companies S and T contribute (150 + 93)/302 = 80.5% of the total. This exceeds the dominance rule value of k=75%, which means that this cell (Industry B) needs to be protected. Had the P% rule been applied on its own, the disclosure risk in Table 2a would not have been identified. If company S or T subtracted their own contribution to try to estimate the contribution of the other main contributor, in both cases the result would be more than 20% different from the actual value. For example, if company S subtracted its own $150m contribution from the $302m total, the resulting $152m would differ from company T's contribution by (152 – 93)/93 = 63%. Therefore the P% rule would not be broken.

Table 2b: Contributors to Industry B
CompanyProfit ($m)
S150
T93
U21
V13
W8
X8
Y6
Z3
Total302

This shows that a combination of approaches may be necessary to identify at-risk cells. Even though Table 2b data should never be released (because it refers to individual companies), anyone familiar with Industry B may know that Companies S and T are the two main contributors. Therefore the summary data in Table 2a requires treatment.

Tabular data treatment techniques

Due to the diverse nature of data there is no one solution for managing all re-identification risks through the treatment of aggregate data. However, the key is to only treat the cells that have been assessed as being to be an unacceptable disclosure risk.

The most common techniques are:

  • data reduction, which decreases the detail available to the user
  • data modification, which makes small changes to the data

How effective these techniques are depends on:

  • the structure of the dataset
  • the requirements of the data users
  • legislative or policy requirements
  • the available infrastructure to treat and disseminate the data

When releasing aggregate data into the public domain, all reasonable efforts should be made to prevent disclosure from the data. While it won't be possible to guarantee confidentiality, this effort must satisfy legislative requirements. For example, ABS legislation states that data must not be released in a manner that is likely to enable the identification of a particular person or organisation. This is important because once a table is made public, there are no further opportunities to control how the data will be used or to apply other confidentiality controls using the Five Safes framework.

Data reduction techniques are generally easy to implement but can have significant confidentiality issues. By contrast, data modification techniques are harder to implement but generally produce tables that are less disclosive (Table 3). These methods can be used to allow data custodians to release data that would otherwise remain inaccessible. 

It is recommended that data custodians start by using simple techniques (data reduction) and only proceed to more complex ones (data modification) if an objective assessment shows the results of the simple techniques have an unacceptable risk of disclosure. This assessment could be made at an organisational level, rather than repeated for every table. For example, the ABS has assessed that TableBuilder's perturbation algorithm, despite its complex methodology, is safer than data reduction for frequency tables. Though potentially time-consuming, these methods can be used to allow data custodians to release data that would otherwise remain inaccessible. 

Table 3: Comparison of techniques for treating tabular data
Confidentiality techniqueAdvantagesDisadvantages
Data reduction
  • relatively easy to implement
  • requires minimal education of users
  • does not reliably protect individuals from differencing between multiple overlapping tables
  • may reduce the data's usefulness
  • the data custodian chooses what data to remove without necessarily knowing what is most important to the data users
  • requires secondary suppression to protect the original primary suppressed cells
  • even with secondary suppression, some suppressed cells may still be estimated
Data modification
  • generally does not affect the data's overall utility
  • generally protects against differencing, zero cells and 100% cells
  • may be automated, requiring minimal human input
  • does not provide additivity within tables unless secondary modifications are applied
  • requires some education of users
  • may require significant setup time and costs
  • may reduce the data's usefulness, particularly when analysing small areas/populations

Data reduction

Data reduction protects tables by either combining categories or suppressing cells. It limits the risks of individual re-identification because cells with a low number of contributors or dominant contributors are not released.

Data reduction: involves:

  • combines variable categories
  • suppresses counts with a small number of contributors (as per the frequency rule), and it considers suppressing higher counts for sensitive items
  • suppresses cells with dominant contributors (as per the cell dominance rule)

Combining categories

This approach can be applied to Table 4a, where the value 3 does not meet a frequency threshold of 4. This cell can be protected in two ways:

  • combine the 20-24 and the 25-29 year old age groups to create a 20-29 year old range (Table 4b)
  • combine the Low and Medium income categories to create a single Low–medium category (Table 4c)
Table 4a: Unprotected income and age data (threshold value = 4)
Age (years)Low incomeMedium incomeHigh incomeTotal
15-19160016
20-24810725
25-29381122
30-34451827
Total31233690
Table 4b: Treatment applied - age groups combined (20-29 years)
Age (years)Low incomeMedium incomeHigh incomeTotal
15-19160016
20-2911181847
30-34451827
Total31233690
Table 4c: Treatment applied - income categories combined (Low-medium income)
Age (years)Low - medium incomeHigh incomeTotal
15-1916016
20-2418725
25-29111122
30-3491827
Total543690

The choice of which categories to combine depends on how important they are for further analysis and what the purpose is in releasing the data. For example, if researchers specifically want to analyse the 'Low' income range, collapsing it into a 'Low-Medium' range prevents this. There is no one answer when choosing which categories to combine: in either case, the data's utility may be affected.

Combining categories is often appropriate, but doesn't work in all situations. For example, if another table were produced containing the 20-24 year old row, the 25-29 year old values could be determined by subtracting the 20-24 row (in the new table) from the 20-29 row (in Table 4b). 

This can be a complex process. Data custodians must carefully consider any other tables that include the same contributors, that:

  • have already been released
  • are being released at the same time
  • are likely to be released in the future

Suppression

Suppression involves removing cells considered to be a disclosure risk from a table. 

In Table 4a, for example, the 3 could be replaced with not provided or np. This is called primary suppression. Sometimes secondary, or consequential, suppression is also required. For example, in addition to suppressing the 3 cell, other cells need to be suppressed to prevent the primary suppressed cell from being calculated. The data in Table 4a could be treated by:

  • using primary and consequential suppression by not providing data from other rows and columns (Table 5a) provides an example of how primary and consequential suppression can be used to treat 
  • suppressing cells in the totals (Table 5b)

Other cell suppression combinations can be employed. The key is to ensure that the suppressed cells cannot be derived from the remaining information.

Table 5a: Primary and secondary suppression to protect tabular data
Age (years)Low incomeMedium incomeHigh incomeTotal
15-19160016
20-24810725
25-29npnp1122
30-34npnp1827
Total31233690
Table 5b: Suppression of totals to protect tabular data
Age (years)Low incomeMedium incomeHigh incomeTotal
15-19160016
20-24810725
25-29np811>19
30-34451827
Total>282336>87

It is not usually recommended to suppress cells that contain a zero. For example if the two zeros in Table 5a were suppressed, it is apparent from the row total and the Low income 15-19 year old cell that they are zeros. In addition, it may be the case that a value must be zero by definition (often called structural zeros, for example the number of pregnant males in a population).

There is a large body of literature as well as commercially available software for determining the optimal pattern of secondary suppression.

Limitations of data reduction

As with combining categories, it is possible that other available tables may contain cell values or grand totals that could be used to break the suppression. For example, if tables 5a and 5b were both publicly available, the suppressed 'Medium' income values in Table 5a could be replaces with the unsuppressed values from Table 5b. Table 5a would then contain enough information to deduce all of the remaining suppressed values.

Table 6a: Original data - income by age
Age (years)IncomeTotal
LowMediumHighVery High
15-19123511
20-24632718
25-29278421
30-3441115434
Total1323282084
Table 6b: Suppressed data - income by age after applying threshold = 4
Age (years)IncomeTotal
LowMediumHighVery High
15-19npnpnp511
20-246npnp718
25-29np78np21
30-34np1115np34
Total1323282084
Table 6c: Income by age with variables assigned
Age (years)IncomeTotal
LowMediumHighVery High
15-19abc511
20-246de718
25-29f78g21
30-34h1115i34
Total1323282084

In some cases, the results of data suppression may appear safer than they truly are. Tables 6a–c illustrate how suppressed cell values can be deduced using techniques such as linear algebra. That is, While the data in Table 6b appears safe, someone could assign a variable to each of the 'np's to create Table 6c. 

With variables assigned to Table 6c, the values in rows 1-2 and columns 2-3 can be used to generate the following equation:

(a+b+c+5) + (6+d+e+7) – (b+d+7+11) – (c+e+8+15) = 11 + 18 – 23 – 28

The variables b, c, d and e cancel out in this equation to give:

a – 23 = –22

Therefore, a = 1

This is the correct and disclosive value (from Table 6a) that the suppression (in Table 6b) was meant to protect. Thus linear algebra and other techniques such as integer linear-programming can be used to calculate or make reasonable estimates of the missing values in tables. In fact, the larger the table (or set of related tables available) the more accurately the results can be estimated. As computing power increases and more data is released, these sorts of attacks on data become easier.

Problems with suppression: 

  • The usefulness of the data is reduced (for example, Table 5a has lost one third of its data). There are cells that aren't disclosive, but that have been suppressed nonetheless. 
  • It can be difficult and time consuming to select the best cells for secondary suppression for large tables especially. Software packages are available that optimise the suppression pattern in the table or set of tables.
  • The data is no longer machine readable because the table now includes symbols ['>'] or letters ['np'].

Although it can be relatively simple to suppress cells or combine categories, data custodians must be confident their outputs are not disclosive.

Data modification

In contrast to data reduction techniques, data modification is generally applied to all non-zero cells in a table - not just those with a high disclosure risk. Data modification assumes that all cells could either be disclosive themselves or contribute to disclosure. It aims to change all non-zero cells by a small amount without reducing the table's overall usefulness for most purposes.

The two methods discussed below are:

  • rounding
  • perturbation (global or targeted)

Rounding

The simplest approach to data modification is rounding to a specified base, where all values become divisible by the same number (often 3, 5 or 10). For example, Table 7b shows how original data values (Table 7a) would look with its values rounded to base 3.

Table 7a: Income by age original values (from Table 1)
Age (years)LowMediumHighTotal
15-19160016
20-24810725
25-29381122
30-34451827
Total31233690
Table 7b: Income by age with rounding to base 3
Age (years)LowMediumHighTotal
15-19150015
20-2499624
25-29391221
30-34361827
Total30243690

Using this technique, the data is still numerical (containing no symbols or letters) which is a practical advantage for users requiring machine readability.

Though the data loses some utility, rounding brings some significant advantages from a confidentiality point of view:

  • Users won't know whether the rounded value of '3' in Table 7 is actually a 2, 3 or 4. 
  • Users won't know whether the zeros are true zeros. This mitigates the problem of group disclosure, whereas the original values showed that all 15-19 year olds were on a low income. 
  • Even if the true grand total or marginal totals were known from other sources, the user is still unable to calculate the true values of the internal cells. 

These advantages are not universally true and there may be situations where rounding can still lead to disclosure. For example, if the count of 15-19 year olds in Low Income was 14 in Table 7, then the rounded counts for this row would still be 15, 0, 0 (total = 15). Now, if the true total was somehow known from other sources to be 14, then the true value of the low income category must also be 14 (it can't be 15 or 16 since this would necessitate a negative number in one or both of the other two income categories).

The main disadvantage to rounding is that there can be inconsistency within the table. For example in Table 7b, the internal cells of the 25-29 year-olds row sum to 24, whereas the total for that row is 21. Although controlled rounding can be used to ensure additivity within the table (where the totals add up), it may not provide consistency across the same cells in different tables. 

Graduated rounding can also be used to round magnitude tables, which means the rounding base varies by the size of the cell. Table 8 shows how data could be protected by rounding the original values to base 100 (Industry A, B, C and Total) or 10 (Industry D). Again, the total is not equal to the sum of the internal cells, but it is now much harder to estimate the relative contributions of units to these industry totals.

Table 8: Example of profit by industry with graduated rounding
IndustryOriginal valuesRounded values
Profit ($m)Profit ($m)
A267300
B302300
C212200
D3430
Total815800

Perturbation

Perturbation is another data modification method. This is where a change (often with a random component) is made to some or all non-zero cells in a table. 

  • For count data (frequency tables), a randomised number is added to the original values. This is called additive perturbation.
  • For magnitude tables, the original values are multiplied by a randomised number. This is called multiplicative perturbation.

For both table types, this can be further broken down into targeted or global approaches. 

Targeted perturbation is the approach taken when only those cells that are considered to be a disclosure risk are treated. Often this requires the application of secondary perturbation in order to maintain additivity within a table. 

This approach is used by the ABS to release some economic data. An example (as shown in Table 9) would be to remove $50m from Industry B and add $25m to Industries A and C.

Table 9: Example of profit by industry with targeted perturbation
IndustryOriginal valuesRounded values
Profit ($m)Profit ($m)
A267292
B302252
C212237
D3434
Total815815

There are two key advantages of this approach:

  • the total does not change (this is an important feature when the ABS releases economic data which then feed into National Accounts)
  • generally, there is minimal loss of information

A disadvantage is that some data that are not disclosive per se is being altered to protect another cell (for example, the value of Industries A and C in Table 9). An alternative is to place the $50m that was taken from Industry B and place it in a new sundry category. Although some of the processes for targeted perturbation are amenable to automation (such as using a constrained optimisation approach), there is often a significant manual effort required (more so, when it is considered that all other tables produced need to have matching perturbed values). 

This requirement for manual treatment means targeted perturbation often requires a skilled team to maintain the usefulness of the data (such as those specialising in releasing economic trade data in the ABS). If this is not feasible for data custodians, and particularly when the data is not economically sensitive, then a global perturbation approach may be more appropriate, as it can be automated. An important feature of global perturbation is that each non-zero cell (including totals) is perturbed independently. Perturbation can also be applied in a way that ensures consistency between tables, but it cannot guarantee consistency within a table. The marginal totals may not be the same as the sum of their constituent cells. This methodology is applied in TableBuilder, an ABS product used for safely releasing both Census and survey data.
 
Tables 10a and 10b show how data might look before and after additive perturbation is applied.

Here the user has no chance of determining that the count of low income 15-19 year olds is '1'. They can still make reasonable estimates of the true value, but they are unable to confirm their guesses. Because perturbation only applies small changes and applies changes to every cell, the results are unbiased and for most purposes, the overall value of the table is retained.
 

Table 10a: Income data by age (original)
Age (years)LowMediumHighVery HighTotal
15-19123511
20-24632718
25-29278421
30-3441115434
Total1323282084
Table 10b: Income data by age with additive perturbation
Age (years)LowMediumHighVery HighTotal
15-19043710
20-24460421
25-29079521
30-3471016432
Total1225252183

Perturbation can also be applied to magnitude tables (assuming targeted perturbation was not appropriate or viable), but in these cases it is multiplicative. This is because adding or subtracting a few dollars to a company's income is unlikely to reduce the disclosure risk in any meaningful way. With multiplicative perturbation, the largest contributors to a cell total have their values changed by a percentage. The total of these perturbed values then becomes the new published cell total. 

Table 11 shows how data might look before and after multiplicative perturbation is applied. 

The total has been changed by only about 6%, but the individual values of companies S and T have been protected. An attacker does not know the extent to which each contributor's profit has been perturbed or therefore how close the perturbed total is to the true total. 

Table 11: Top 3 contributing companies to an industry (Industry B) with multiplicative perturbation
CompanyOriginal valuesRounded values
Profit ($m)Profit ($m)
S150123
T93104
U2118
V1313
W88
X88
Y66
Z33
Total302283

Table 12 shows how data from Table 11 might be combined with other industry totals in a released table. As with rounding the table no longer adds up, but there is robust protection of its individual contributors. For magnitude tables it may also be necessary to suppress low counts of contributors (or perturb all counts) as an extra protection.

Table 12: Example of profit by industry with multiplicative perturbation
IndustryOriginal valuesRounded values
Profit ($m)Profit ($m)
A267296
B302283
C212185
D3438
Total815821

An important issue to keep in mind with all perturbation is that the perturbation itself may adversely impact on the usefulness of the data. Tables 10 and 12 show perturbation where the totals are quite close to the true value, but individual cells can be changed by 100%. When data custodians are releasing data, they need to clearly communicate the process by which they have perturbed the data (although they shouldn't provide exact details on key parameters of perturbation that could be used to undo the treatment). This communication should also include a caveat that further analysis of the perturbed data may not give accurate results. Further analysis done on the unperturbed data should be carefully checked before being released into the public domain to ensure the analysis and the perturbed table cannot be used in conjunction to breach confidentiality. On the other hand, it should also be recognised that small cell values may be unreliable either due to sampling or that responses are not always accurate. 

Hierarchical data treatment techniques

All of the methods described above are limited in how they deal with hierarchical datasets (datasets with information at different levels). For example, a file may contain records for each family at one level and in another file separate records for individual family members. The same structure could apply to datasets containing businesses and employees, or to health care services, providers and their clients.

In hierarchical datasets, contributors at all levels must be protected. A table that appears to be non-disclosive at one level may contain information that a knowledgeable user could use to re-identify a contributor at another level. For example, a table cell that passes the cell frequency rule at one level may not pass the same rule at a higher level.

The following example shows a summary table (Table 13), which is derived from detailed information in Table 14.

On the surface, Table 13 appears non-disclosive, for example, it doesn't violate a frequency rule of 4. However, a closer look at the source data reveals the following disclosure risks: 

  • The summary count of 'Pathology' in the 'Private' sector in 'East' location (61) is based on only 2 providers (Lisa and Stu). Thus, both Lisa and Stu could subtract their own contribution to determine the income of the other. 
  • Only one company (Clinic D) is represented by this same cell.
  • The summary count of 'Surgery' in the 'Public' sector in 'West' location (5) is from only 2 patients and 1 provider (Pru). So other companies can use the service fee value to estimate Clinic E's income for 'Surgery'.

All of these scenarios are confidentiality breaches. In a real-world situation Table 14 would not be released because it contains names. But it should be assumed that someone familiar with the sector would know the contributors and may even be one of them.

 

Table 13: Summary of health care service counts (based on Table 14 data)
Service typePublicPrivateTotal
EastWestEastWest
Treatment047950142
Surgery050209214
Pathology01061778
Total062156216434
Table 14: Counts of health care patients and services
SectorLocationCompanyClinicServiceProviderPatientsServiceBulk bill serviceService fee ($)
PrivateEastQDPathologyLisa1529 165
PrivateEastQDPathologyStu1832 160
PrivateEastQBTreatmentJoe35 90
PrivateEastQBTreatmentJan831 95
PrivateEastQBTreatmentDeb622 105
PrivateEastQBTreatmentEm531 98
PrivateEastQBTreatmentFred36 85
PrivateWestQCPathologyIan77 180
PrivateWestQCSurgeryBill38 95
PrivateWestRESurgeryTess36201 105
PublicWestPAPathologyMeg4106140
PublicWestPESurgeryPru25 120
PublicWestPATreatmentRob37680
PublicWestPATreatmentAl1440780
Total235 1412743319 

The example above shows the need to protect information at all levels. With hierarchical data, data treatment must be applied at every level, and any consequences of these changes need to be followed through to other levels. For example, if the number of providers in Table 14 is reduced to zero, then the count of services (in Tables 13 and 14) also needs to be zero.