1160.0 - ABS Confidentiality Series, Aug 2017  
Latest ISSUE Released at 11:30 AM (CANBERRA TIME) 23/08/2017  First Issue
   Page tools: Print Print Page Print all pages in this productPrint All RSS Feed RSS Bookmark and Share Search this Product

MANAGING THE RISK OF DISCLOSURE: TREATING AGGREGATE DATA

This page contains the following:
Tables and disclosure risks
Rules to identify at-risk cells
Data Treatment techniques: tabular data
1. Data reduction
2. Data modification
Data Treatment techniques: hierarchical data


TABLES AND DISCLOSURE RISKS

Aggregate data are usually presented as tables, though may also appear in map or graph forms. In the following discussion ‘aggregate data’ and ‘tables’ will be used interchangeably. Different types of tables can contain different disclosure risks and further disclosure risks may exist if users can access multiple tables containing common elements. For example, users could potentially re-identify a person by differencing the outputs of two separate tables to which the person contributes.

Frequency tables

Each cell in a table contains the number of contributors (e.g. individuals, households or organisations). Disclosures could be made from a table where one or more cells have a low count (i.e. a small number of contributors). Assessments of whether a cell’s value is too low must be based on the underlying count of its contributors, not its weighted estimate.

Magnitude tables

Each cell contains summarising information of the numerical contributions to that cell (often in the form of a total or mean, but may be median, mode or range). A typical example of magnitude tables are those reporting total turnover for groups of businesses. Disclosures can occur when the values of a small number of units (e.g. one or two businesses with extremely high turnover) dominate the cell value.


RULES TO IDENTIFY AT-RISK CELLS
Some common rules can help identify which cells may constitute a disclosure risk. By using these rules, a data custodian makes an implicit decision that all cells which break a rule are deemed an unacceptable disclosure risk, while all other cells are deemed non-disclosive. Each rule below provides protections against attackers trying to re-identify or disclose an attribute about a contributor. They also mitigate attacks where one contributor tries to discover information about another.

Recognising that strict applications of these rules may decrease the data’s usefulness, data custodians will need to set appropriate rule values that are informed by legislation and organisational policy, disclosure risk assessment and statistical theory. The advantage of using rules is that they are simple, clear, consistent, transparent and amenable to automation. They are ideally applied in situations where:

  • Data are released in a similar manner (for example when the same kind of dataset is regularly released).
  • Transparency is important to both data custodian and users.
  • Limited opportunity or requirement exists for engagement between data custodians and users.

However, strict application means that there may be situations where non-disclosive data are treated unnecessarily, or where non-treated data are in fact disclosive. The ABS is able to provide advice on these rules.

Frequency rule

Also called the threshold rule, this sets a particular value for the minimum number of unweighted contributors to a cell. Application of this rule means that cells with counts below the threshold are defined to pose an unacceptable disclosure risk and need to be protected. However, some things need to be kept in mind when using this rule:
  • There is no strict statistical basis for choosing one threshold value over another.
  • higher values will increase protection against re-identification but also degrade the utility of the original data.
  • A lower threshold value may be appropriate for sampled datasets compared to population datasets.
  • By implication, everything greater than or equal to the threshold is defined as posing an acceptable disclosure risk.
  • It may be that a cell below the threshold value is not in fact a disclosure risk, while a cell value greater than the threshold value is disclosive.

Judgement needs to be exercised when considering cells with no contributors (i.e. ‘zero cells’) or where all contributors to a row or column are concentrated in one cell (i.e. ‘100% cells’), as these may also pose a disclosure risk.
A hypothetical application of the frequency rule is presented in Table 1. Because a frequency rule of 4 is chosen for this example, any cell with fewer than 4 contributors is deemed a disclosure risk. In this example, the 25–29 year old age group has only 3 contributors in the Low Income cell, and therefore the cell needs to be protected.


TABLE 1: EXAMPLE OF THE FREQUENCY RULE IN AGGREGATE COUNTS (THRESHOLD VALUE = 4)
Age (years)
Income
Low
Medium
High
Total

15-19
16
0
0
16
20-24
8
10
7
25
25-29
3
8
11
22
30-34
4
5
18
27
Total
31
23
36
90



Cell dominance rule
This applies to tables that present magnitude data such as income or turnover. The rule is designed to prevent the re-identification of units that contribute a large percentage of a cell’s total value. Also called the cell concentration rule, the cell dominance rule defines the number of units that are allowed to contribute a defined percentage of the total.

This rule can also be referred to as the (n,k) rule, where n is the number of units which may not contribute more than k% of the total. For example, a (2, 75) dominance rule means that the top 2 units may not contribute more than 75% of the total value to a cell. As with the frequency rule there is no strict statistical basis for choosing values for the dominance rule.

P% rule

The P% rule can be used in combination with the cell dominance rule to protect the attributes of contributors to magnitude data. It aims to protect the value of any contributor from being estimated to within P% and it is especially helpful where a user may know that a particular person or organisation is in a table. Applying this rule limits how close, in percentage terms, a contributor’s estimated value can be from its true value. For example, a rule of P%=20%, means that the estimated value of any contributor must differ from its true value by at least 20%.

Using the Cell dominance and P% rules together

Below is an example of the dominance and P% rules used together where (n, k)=(2, 75) and P%=20%. Consider the data in Table 2a, which shows profit for Industries A–D.


TABLE 2a: EXAMPLE OF PROFIT BY INDUSTRY
Industry
Profit ($m)

A
267
B
302
C
212
D
34
Total
815



Although at first glance the table does not appear to contain a disclosure risk, one does exist if information about the companies contributing the data to each industry is known.

This information could include the data in Table 2b which shows the companies contributing to Industry B. In this table, the two companies S and T contribute (150 + 93)/302 = 80.5% of the total. This exceeds the dominance rule value of k=75%, which means that this cell (Industry B) needs to be protected. Had the P% rule been applied on its own, the disclosure risk in Table 2a would not have been identified. If company S or T subtracted their own contribution to try to estimate the contribution of the other main contributor, in both cases the result would be more than 20% different from the actual value. For example, if company S subtracted its own $150m contribution from the $302m total, the resulting $152m would differ from company T’s contribution by (152 – 93)/93 = 63%. Therefore the P% rule would not be broken.

This shows that a combination of approaches may be necessary to identify at-risk cells. Even though Table 2b data should never be released (because it refers to individual companies), anyone familiar with Industry B may know that Companies S and T are the two main contributors. The summary data in Table 2a would therefore require treatment, even if at first glance it appears safe.

TABLE 2b: CONTRIBUTORS TO INDUSTRY B
Industry
Profit ($m)

S
150
T
93
U
21
V
13
W
8
X
8
Y
6
Z
3
Total
302



DATA TREATMENT TECHNIQUES: TABULAR DATA

Due to the diverse nature of data there is no ‘one size fits all’ solution for managing re-identification risks through the treatment of aggregate data. The key however, is to only treat the cells that have been deemed to be an unacceptable disclosure risk.

The most common techniques are:
  • Data reduction (which decreases the detail available to the user).
  • Data modification (which makes small changes to the data).

How effective these techniques are depends on:
  • The structure of the dataset.
  • The requirements of the data users.
  • Legislative or policy requirements.
  • The available infrastructure to treat and disseminate the data.

When releasing aggregate data into the public domain, all reasonable efforts should be made to prevent disclosure from those data. While it won’t be possible to guarantee the confidentiality, this effort must satisfy legislative requirements (for example ABS legislation states that data may not be released “in a manner likely to identify”). This is important because once a table is made public, there are no further opportunities to control how the data will be used (e.g. to apply other confidentiality controls using the Five Safes Framework).

Data reduction techniques are generally easy to implement but can have significant confidentiality issues. By contrast, data modification techniques are harder to implement but generally produce tables that are less disclosive. These issues are highlighted in Table 3.

It is recommended that data custodians start by using simple techniques (i.e. data reduction) and only proceed to more difficult ones (i.e. data modification) if an objective assessment shows the results of the simple techniques have an unacceptable risk of disclosure. This assessment could be made at an organisational level, rather than repeated for every table. For example, the ABS has assessed that TableBuilder’s perturbation algorithm, despite its complex methodology, is safer than data reduction for frequency tables.

TABLE 3: COMPARISON OF TECHNIQUES FRO TREATING TABULAR DATA
AdvantagesDisadvantages

Data reduction
  • Relatively easy to implement
  • Requires minimal education of users
  • Does not reliably protect individuals from differencing between multiple overlapping tables
  • May reduce the data’s usefulness
  • The data custodian chooses what data to remove without necessarily knowing what is most important to the data users
  • Requires secondary suppression (to protect the original primary suppressed cells)
  • Even with secondary suppression, some suppressed cells may still be estimated
.
Data modification
  • Generally does not affect the data’s overall utility
  • Generally protects against differencing, zero cells and 100% cells
  • May be automated, requiring minimal human input
  • Does not provide additivity within tables unless secondary modifications are applied
  • Requires some education of users
  • Entails significant setup costs
  • May reduce the data’s usefulness (particularly when analysing small areas/populations)



Methods for applying these different disclosure control methods are outlined below. Though potentially time-consuming, these methods can be used to allow data custodians to release data that would otherwise remain inaccessible.


Data reduction

Data reduction protects tables by either combining categories or suppressing cells. It limits the risks of individual re-identification because cells with a low number of contributors or dominant contributors are not released.

Data reduction involves:
  • Combining variable categories.
  • Suppressing counts with a small number of contributors (as per the frequency rule), and it considers suppressing higher counts for sensitive items.
  • Suppressing cells with dominant contributors (as per the cell dominance rule).

Combining categories:

This approach can be applied to Table 4a, where the value ‘3’ does not meet a frequency threshold of 4. This cell can be protected in either of two ways:
  • Combine the 20–24 and the 25–29 year old age groups to create a 20–29 year old range (see Table 4b).
  • Combine the Low and Medium income categories to create a single Low–Medium category (see Table 4c).

TABLE 4a: UNPROTECTED INCOME AND AGE DATA (THRESHOLD VALUE = 4)
Age (years)
Income
Low
Medium
High
Total

15-19
16
0
0
16
20-24
8
10
7
25
25-29
3
8
11
22
30-34
4
5
18
27
Total
31
23
36
90


TABLE 4b: TREATMENT APPLIED - AGE GROUPS COMBINED
Age (years)
Income
Low
Medium
High
Total

15-19
16
0
0
16
20-29
11
18
18
47
30-34
4
5
18
27
Total
31
23
36
90


TABLE 4c: TREATMENT APPLIED - INCOME CATEGORIES COMBINED
Age (years)
Income
Low - Medium
High
Total

15-19
16
0
16
20-24
18
7
25
25-29
11
11
22
30-34
9
18
27
Total
54
36
90



The choice of which categories to combine will depend on how important they are for further analysis and what the purpose is in releasing the data. For example, if researchers specifically want to analyse the ‘Low’ income range, collapsing it into a ‘Low–Medium’ range will prevent this. There is therefore no ‘right answer’ when choosing which categories to combine: in either case, the data’s utility may be affected.

Combining categories is often appropriate, but it is not foolproof. For example, if another table were produced containing the 20–24 year old row, the 25–29 year old values could be determined by subtracting the 20–24 row (in the new table) from the 20–29 row (in Table 4a).

Data custodians must carefully consider other tables, about the same contributors, that:
  • Are being released at the same time.
  • Are likely to be released in the future.
  • Have already been released.

This is not a trivial process and the ABS can provide advice on how these techniques can best be applied.

Suppression:

This approach involves removing cells deemed to be a disclosure risk from a table.

In Table 4a, for example, the ‘3’ could be replaced with ‘not provided’ or ‘np’. This is called primary suppression. Sometimes secondary, or consequential, suppression will also be required. For example, in addition to suppressing the ‘3’ cell, other cells will need to be suppressed to prevent the primary suppressed cell from being calculated. Table 5a provides an example of how primary and consequential suppression can be used to treat Table 4a data.

Other cell suppression combinations could be employed. Table 5b, for example, suppresses cells in the totals. The key is to ensure that the suppressed cells cannot be derived from the remaining information.

TABLE 5a: PRIMARY AND SECONDARY SUPPRESSION TO PROTECT TABULAR DATA
Age (years)
Income
Low
Medium
High
Total

15-19
16
0
0
16
20-24
8
10
7
25
25-29
np
np
11
22
30-34
np
np
18
27
Total
31
23
36
90


TABLE 5b: SUPPRESSION OF TOTALS TO PROTECT TABULAR DATA
Age (years)
Income
Low
Medium
High
Total

15-19
16
0
0
16
20-24
8
10
7
25
25-29
np
8
11
>19
30-34
4
5
18
27
Total
>28
23
36
>87



Care needs to be taken when choosing the pattern of secondary suppression. In particular, the suppression of cells that contain a zero is generally not recommended. For example if the two zeros in Table 5a were suppressed, then it is quite apparent from the row total and the Low Income 15-19 year old cell that they are in fact zeros. In addition, it may be the case that a value must be zero by definition (often called structural zeros, for example the number of pregnant males in a population).

There is a large body of literature as well as commercially available software for determining the optimal pattern of secondary suppression.

Limitations of data reduction

As with combining categories, it is possible that other available tables may contain cell values or grand totals that could be used to break the suppression. For example, if tables 5a and 5b were both publicly available, the suppressed ‘Medium’ income values in Table 5a could be replaces with the unsuppressed values from Table 5b. Table 5a would then contain enough information to deduce all of the remaining suppressed values.

In some cases, the results of data suppression may appear safer than they truly are. Tables 6a–c illustrate how suppressed cell values can be deduced using techniques such as linear algebra. That is, while the data in Table 6b appear safe, someone could assign a variable to each of the ‘np’s to create Table 6c—the consequences of which are explained below.

TABLE 6a: INCOME DATA BY AGE (BEFORE DATA SUPPRESSION)
Age (years)
Income
Low
Medium
High
Very High
Total

15-19
1
2
3
5
11
20-24
6
3
2
7
18
25-29
2
7
8
4
21
30-34
4
11
15
4
34
Total
13
23
28
20
84


TABLE 6b: INCOME DATA BY AGE (AFTER APPLYING THRESHOLD = 4)
Age (years)
Income
Low
Medium
High
Very High
Total

15-19
np
np
np
5
11
20-24
6
np
np
7
18
25-29
np
7
8
np
21
30-34
np
11
15
np
34
Total
13
23
28
20
84


TABLE 6c: INCOME DATA BY AGE (WITH VARIABLES ASSIGNED)
Age (years)
Income
Low
Medium
High
Very High
Total

15-19
a
b
c
5
11
20-24
6
d
e
7
18
25-29
f
7
8
g
21
30-34
h
11
15
i
34
Total
13
23
28
20
84



With variables assigned to Table 6c, the values in rows 1–2 and columns 2–3 can be used to generate the following equation:

(a+b+c+5) + (6+d+e+7) – (b+d+7+11) – (c+e+8+15) = 11 + 18 – 23 – 28

The variables b, c, d and e cancel out in this equation to give:

a – 23 = –22

Therefore, a = 1

This is the correct and disclosive value (from Table 6a) that the suppression (in Table 6b) was meant to protect. Thus linear algebra and other techniques such as integer linear-programming can be used to calculate or make reasonable estimates of the missing values in tables. In fact, the larger the table (or set of related tables available) the more accurately the results can be estimated. As computing power increases and more data are released, these sorts of attacks on data will become easier.

Three other practical problems arise with suppression:
  • The usefulness of the data is reduced (for example, Table 5a has lost one third of its data). Specifically, there are cells that aren’t disclosive per se, but have been suppressed nonetheless.
  • It can be difficult and time consuming to select the best cells for secondary suppression (for large tables especially). Software packages are available that will optimise the suppression pattern in the table or set of tables that are provided to these packages.
  • The data is no longer machine readable (because the table now includes symbols ['>'] or letters ['np']).

Although it can be relatively simple to suppress cells or combine categories, data custodians must still take care that they are confident their outputs are not disclosive.


Data modification

In contrast to data reduction techniques, data modification is generally applied to all non-zero cells in a table—not just those with a high disclosure risk. Data modification assumes that all cells could either be disclosive themselves or contribute to disclosure. It aims to change all non-zero cells by a small amount without reducing the table’s overall usefulness for most purposes.

The two methods discussed below are:
  • Rounding
  • Perturbation (global or targeted).

Rounding

The simplest approach to data modification is rounding to a specified base, where all values become divisible by the same number (often 3, 5 or 10). For example, Table 7 shows how original data values (from Table 1) would look with its values rounded to base 3.

TABLE 7: EXAMPLE OF INCOME BY AGE (WITH ROUNDING TO BASE 3)
Age (years)
Income (original values)
Income (rounded to base 3)
Low
Medium
High
Total
Low
Medium
High
Total


15-19
16
0
0
16
15
0
0
15
20-24
8
10
7
25
9
9
6
24
25-29
3
8
11
22
3
9
12
21
30-34
4
5
18
27
3
6
18
27
Total
31
23
36
90
30
24
36
90



It should be apparent that the data are still numerical (i.e. containing no symbols or letters) which is a practical advantage for users requiring machine readability.

Though the data loses some utility, rounding brings some significant advantages from a confidentiality point of view:
  • Users won’t know whether the rounded value of ‘3’ in Table 7 is actually a 2, 3 or 4.
  • Users won’t know whether the zeros are true zeros. This mitigates the problem of group disclosure, whereas the original values showed that all 15–19 year olds were on a low income.
  • Even if the true grand total or marginal totals were known from other sources, the user is still unable to calculate the true values of the internal cells.

These advantages are not universally true and there may be situations where rounding can still lead to disclosure. For example, if the count of 15-19 year olds in Low Income was in fact 14 in Table 7, then the rounded counts for this row would still be 15, 0, 0 (total = 15). Now, if the true total was somehow known from other sources to be 14, then the true value of the low income category must also be 14 (it can’t be 15 or 16 since this would necessitate a negative number in one or both of the other two income categories).

The main disadvantage to rounding is that there can be inconsistency within the table (e.g. in Table 7, the internal cells of the 25–29 year-olds row sum to 24, whereas the total for that row is 21). Although controlled rounding can be used to ensure additivity within the table (i.e. that the totals add up), it may not provide consistency across the same cells in different tables.

Graduated rounding can also be used to round magnitude tables, which means the rounding base varies by the size of the cell. Table 8 shows how data could be protected by rounding the original values to base 100 (Industry A, B, C and Total) or 10 (Industry D). Again, the total is not equal to the sum of the internal cells, but it is now much harder to estimate the relative contributions of units to these industry totals.

TABLE 8: EXAMPLE OF PROFIT BY INDUSTRY (WITH GRADUATED ROUNDING)
Industry
Original values
Rounded values
Profit ($m)
Profit ($m)


A
267
300
B
302
300
C
212
200
D
34
30
Total
815
800



Perturbation

A second approach to data modification is perturbation. This is where a change (often with a random component) is made to some or all non-zero cells in a table.
  • For count data (frequency tables), a randomised number is added to the original values. This is called additive perturbation.
  • For magnitude tables, the original values are multiplied by a randomised number. This is called multiplicative perturbation.

For both table types, this can be further broken down into targeted or global approaches.

Targeted perturbation is the approach taken when only those cells that are deemed a disclosure risk are treated. Often this will require the application of secondary perturbation in order to maintain additivity within a table.

This approach is used by the ABS to release some economic data. An example (as shown in Table 9) would be to remove $50m from Industry B and add $25m to Industries A and C.

TABLE 9: EXAMPLE OF PROFIT BY INDUSTRY (WITH TARGETED PERTURBATION)
Industry
Original values
Rounded values
Profit ($m)
Profit ($m)


A
267
292
B
302
252
C
212
237
D
34
34
Total
815
815



There are two key advantages of this approach:
  • The total does not change (this is an important feature when the ABS releases economic data which then feed into National Accounts).
  • Generally, there is minimal loss of information

A disadvantage is that some data that are not disclosive per se are being altered to protect another cell (for example, the value of Industries A and C in Table 9). An alternative then is to place the $50m that was taken from Industry B and place it in a new sundry category. Although some of the processes for targeted perturbation are amenable to automation (such as using a constrained optimisation approach), there is often a significant manual effort required (more so, when it is considered that all other tables produced need to have matching perturbed values).

This requirement for manual treatment means targeted perturbation often requires a skilled team to maintain the usefulness of the data (such as those specialising in releasing economic trade data in the ABS). If this is not feasible for data custodians, and particularly when the data are not economically sensitive, then a global perturbation approach may be more appropriate (as it can be automated). An important feature of global perturbation is that each non-zero cell (including totals) is perturbed independently. Perturbation can also be applied in a way that ensures consistency between tables, but it cannot guarantee consistency within a table (i.e. the marginal totals may not be the same as the sum of their constituent cells). This methodology is applied to TableBuilder, an ABS product used for safely releasing both Census and survey data.

Table 10 shows how data might look before and after additive perturbation is applied.

Here the user has no chance of determining that the count of low income 15–19 year olds is ‘1’. They can, of course, still make reasonable estimates of the true value, but they will be unable to have confidence in their guesses. Because perturbation only applies small changes and applies changes to every cell, the results will be unbiased and for most purposes, the overall value of the table is retained.

TABLE 10: INCOME DATA BY AGE (WITH ADDITIVE PERTURBATION)
Age (years)
Income (original)
Income (with additive perturbation)
Low
Medium
High
Very High
Total
Low
Medium
High
Very High
Total


15-19
1
2
3
5
11
0
4
3
7
10
20-24
6
3
2
7
18
4
6
0
4
21
25-29
2
7
8
4
21
0
7
9
5
21
30-34
4
11
15
4
34
7
10
16
4
32
Total
13
23
28
20
84
12
25
25
21
83




Perturbation can also be applied to magnitude tables (assuming targeted perturbation was not appropriate or viable), but in these cases it is multiplicative. This is because adding or subtracting a few dollars to a company’s income is unlikely to reduce the disclosure risk in any meaningful way. With multiplicative perturbation, the largest contributors to a cell total have their values changed by a percentage. The total of these perturbed values then becomes the new published cell total.

Table 11 shows how data might look before and after multiplicative perturbation is applied.

The total has been changed by only about 6%, but the individual values of companies S and T have been protected. An attacker will not know the extent to which each contributor’s profit has been perturbed or therefore how close the perturbed total is to the true total.

TABLE 11: TOP 3 CONTRIBUTING COMPANIES TO AN INDUSTRY (INDUSTRY B) - WITH MULTIPLICATIVE PERTURBATION
Company
Original values
Rounded values
Profit ($m)
Profit ($m)


S
150
123
T
93
104
U
21
18
V
13
13
W
8
8
X
8
8
Y
6
6
Z
3
3
Total
302
283



Table 12 shows how data from Table 11 might be combined with other industry totals in a released table. As with rounding the table no longer adds up, but there is robust protection of its individual contributors. For magnitude tables it may also be necessary to suppress low counts of contributors (or perturb all counts) as an extra protection.

TABLE 12: EXAMPLE OF PROFIT BY INDUSTRY - WITH MULTIPLICATIVE PERTURBATION
Industry
Original values
Rounded values
Profit ($m)
Profit ($m)


A
267
296
B
302
283
C
212
185
D
34
38
Total
815
821



An important issue to keep in mind with all perturbation is that the perturbation itself may adversely impact on the usefulness of the data. Tables 10 and 12 show perturbation where the totals are quite close to the true value, but individual cells can be changed by 100%. When data custodians are releasing data, they need to clearly communicate the process by which they have perturbed the data (although they shouldn’t provide exact details on key parameters of perturbation that could be used to undo the treatment). This communication should also include a caveat that further analysis of the perturbed data may not give accurate results. Of course further analysis done on the unperturbed data should be carefully checked before being released into the public domain to ensure the analysis and the perturbed table cannot be used in conjunction to breach confidentiality. On the other hand, it should also be recognised that small cell values may be unreliable (either due to sampling or that responses are not always accurate).


DATA TREATMENT TECHNIQUES: HIERARCHICAL DATA

All of the methods described above are limited in how they deal with hierarchical datasets (these are datasets comprising information at different levels). For example, a file may comprise records for each family at one level and below that separate records for individual family members. The same structure could apply to datasets containing businesses and employees, or to health care services, providers and their clients.

In hierarchical datasets, contributors at all levels must be protected. A table that appears to be non-disclosive at one level may contain information that a knowledgeable user could use to re-identify a contributor at another level. For example, a table cell that passes the cell frequency rule at one level may not pass the same rule at a higher level.

The following hypothetical situation shows a summary table (Table 13), which is derived from detailed information in Table 14.

On the surface, Table 13 appears non-disclosive (for example, it doesn’t violate a frequency rule of 4). However, a closer look at the source data reveals the following disclosure risks:
  • The summary count of ‘Pathology’ in the ‘Private’ sector in ‘East’ location (61) is based on only 2 providers (Lisa and Stu). Thus, both Lisa and Stu could subtract their own contribution to determine the income of the other.
  • Only one company (Clinic D) is represented by this same cell.
  • The summary count of ‘Surgery’ in the ‘Public’ sector in ‘West’ location (5) is from only 2 patients and 1 provider (Pru). So other companies will be able to use the service charge value to estimate Clinic E's income for ‘Surgery’.

All of these scenarios are confidentiality breaches. In a real-world situation Table 14 would not be released because it contains names. But it should be assumed that someone familiar with the sector would know the contributors and may even be one of them.


TABLE 13: SUMMARY OF HEALTH CARE SERVICE COUNTS (BASED ON TABLE 14 DATA)
Service type
Public
Private
Total
East
West
East
West

Treatment
0
47
95
0
142
Surgery
0
5
0
209
214
Pathology
0
10
61
7
78
Total
0
62
156
216
434


TABLE 14: COUNTS OF HEALTH CARE PATIENTS AND SERVICES
Sector
Location
Corporation
Clinic
Service Type
Provider
Patients
Services
Bulk Bill Services
Service Charge ($)

Private
East
Company Q
Clinic D
Pathology
Lisa
15
29
165
Private
East
Company Q
Clinic D
Pathology
Stu
18
32
160
Private
East
Company Q
Clinic B
Treatment
Joe
3
5
90
Private
East
Company Q
Clinic B
Treatment
Jan
8
31
95
Private
East
Company Q
Clinic B
Treatment
Deb
6
22
105
Private
East
Company Q
Clinic B
Treatment
Em
5
31
98
Private
East
Company Q
Clinic B
Treatment
Fred
3
6
85
Private
West
Company Q
Clinic C
Pathology
Ian
7
7
180
Private
West
Company Q
Clinic C
Surgery
Bill
3
8
95
Private
West
Company R
Clinic E
Surgery
Tess
36
201
105
Public
West
Company P
Clinic A
Pathology
Meg
4
10
6
140
Public
West
Company P
Clinic E
Surgery
Pru
2
5
120
Public
West
Company P
Clinic A
Treatment
Rob
3
7
6
80
Public
West
Company P
Clinic A
Treatment
Al
14
40
7
80
Total
2
3
6
14
127
433
19



The example above shows the need to protect information at all levels. With hierarchical data, data treatment must be applied at every level, and any consequences of these changes need to be followed through to other levels. For example, if the number of providers in Table 14 is reduced to zero, then the count of services (in Tables 13 and 14) also needs to be zero.