Confidentiality in ABS microdata

The ABS’s practices in enabling data access under the Five Safes framework

Released
19/03/2021

Abstract

The competing obligations of providing useful statistical information while protecting the privacy and secrecy of data providers are key considerations for many National Statistical Organisations (NSOs). Meeting these obligations involves balancing data utility and disclosure risk of data products. However, decisions about access and use of data are subjective and context-dependent. The Australian Bureau of Statistics (ABS) meets these obligations with an approach based on the internationally recognised Five Safes framework. Our approach enables us to make informed decisions guided by continual methodological research and extensive experience. This paper uses ABS cross-sectional microdata as an example to share our practices in using the Five Safes to design different data access solutions for different data products. This paper describes different unit risk measures for microdata. It argues that different unit risk measures are useful tools to assist decisions on releasing datasets. The decisions on releasing datasets should be considered with the context and mode of data access in mind, to ensure that data utility and disclosure risk are appropriately balanced.

Edwin Lu, Rian Jenkins

methodology@abs.gov.au

1 Introduction

There is growing recognition that data is a strategic resource, and that realising the full potential of data will unlock insights that benefit Australian society. The ABS enables this by making ABS data as accurate and accessible as possible. Further, we are committed to protecting the privacy and secrecy of data providers. This is to meet our legal obligation under the Census and Statistics Act 1905 and to maintain public trust so that data providers are happy to continue supplying us with quality data.

Providing useful statistical information while protecting the privacy and secrecy of data providers is challenging. The more detailed the statistical information, the greater the risk that the information will disclose private information about a data provider. This is the risk-utility trade-off. To meet our competing obligations, the ABS uses the Five Safes framework to make informed decisions about access and use of data, with the aim of maximising data utility without compromising the privacy and secrecy of data providers (Australian Bureau of Statistics, 2017a). The internationally recognised framework was developed by the United Kingdom Office of National Statistics. Statistics New Zealand, Statistics Canada, the Australian Institute of Health and Welfare, and many other National Statistical Organisations and research organisations also use the framework.

Decisions about data access are context-dependent and need to be made after careful consideration of a number of dimensions. This paper shares the ABS’s practices in using the Five Safes to guide decisions about access and use of microdata, and to optimally balance data utility with disclosure risk. Our decisions are informed by continual methodological research and extensive experience. We illustrate our data confidentiality approach with ABS cross-sectional microdata, beginning with an overview of controls we apply under the Five Safes. We then elaborate on the Safe Data dimension of the Five Safes by discussing our choice of unit risk measures. Finally, we introduce differential privacy and synthetic data methods as potential additions to our suite of methods under the Five Safes.

We structure the paper as follows: Section 2 describes the overarching considerations when enabling access to ABS microdata. Section 3 explains why and how we apply the Five Safes to ABS microdata. Section 4 elaborates on the Safe Data dimension of the Five Safes by describing several unit risk measures and discussing our choice of unit risk measures for ABS microdata. Section 5 describes differential privacy and synthetic data methods as potential additions to our suite of methods under the Five Safes. Section 6 concludes by summarising our key points.

2 Considerations when enabling microdata access

The overarching consideration when enabling access to public sector microdata is providing useful microdata while protecting the privacy and secrecy of data providers. The ABS aims to enhance data utility while ensuring low disclosure risk.

Statistical agencies must protect the privacy and secrecy of data providers and also ensure that microdata is accessible so that it can be used to inform decisions that have significant impact on the public interest (Chien, Welsh, Moore, 2020). The fact that utility and disclosure risk are dependent complicates this. For example, microdata with high utility would be very detailed and therefore potentially have high disclosure risk if not subjected to controls on the user, access environment, etc. Further, reducing disclosure risk through statistical disclosure control (SDC) could also reduce the utility of the microdata. Thus, it is not possible to simultaneously maximise utility and minimise disclosure risk. A data confidentiality approach that focuses solely on mitigating disclosure risk without concern for preserving utility will diminish the value of the microdata to users.

To provide good data utility, we must address the question of what utility means. A simple definition is that utility is the value of a dataset for analytical and research purposes, referring to both the completeness of the dataset and the accuracy of the values within. However, not all variables and values in a microdata set are equally important – it depends on the context. The relative importance of the variables and values in the microdata set ultimately depends on the intended purpose and intended users of the data. For example, research aimed at understanding the relationship between education and employment outcomes would place more weight on preserving the completeness and accuracy of education and employment variables. This causes difficulty in developing a satisfactory measure of utility that is adequate for all contexts.

To ensure low disclosure risk, we must determine what level of risk is acceptable. This inevitably depends on legal and business contexts and is subjective. For example, the legislation might describe requirements in non-quantitative terms which are subject to interpretation. The business environment in which a data custodian operates might influence the risk appetite. These contextual factors vary geographically and differ from one data custodian to another. This causes difficulty in developing a satisfactory measure of disclosure risk that is adequate for all contexts.

Therefore, decisions about access and use of data are context-dependent and need to be made with careful consideration of a number of dimensions, not just based on objective mathematical calculations. Even seemingly objective measures of utility and disclosure risks are based on subjective assumptions and, sometimes, hypothetical risk scenarios (Desai et al., 2016; Halfner et al., 2015; Skinner, 2012). It is partly for this reason that the ABS uses the Five Safes framework. As Desai et al. (2016) point out, the real value of the Five Safes framework is in identifying different solutions based on what matters most in different contexts. It is not designed to deliver a confidentiality approach that fits all contexts.

3 Enhancing utility using the Five Safes framework

The ABS uses the Five Safes framework to guide decisions about access and use of data, and to expand the limits of what we can achieve in terms of enhancing utility while ensuring low disclosure risk. The Five Safes framework introduces a holistic approach to managing disclosure risk that incorporates context in the decision-making, especially decisions about what and how much SDC to apply to the data.

Section 3.1 introduces the Five Safes framework and gives two key reasons why the ABS uses it. Section 3.2 describes how the ABS applies the Five Safes framework to ABS microdata.

3.1 What is the Five Safes framework and why do we use it

The ABS views the Five Safes framework as a multi-dimensional approach for managing disclosure risk and for making informed decisions about access and use of data (Australian Bureau of Statistics, 2017a). Table 1 (adapted from Ritchie, 2017) describes the five dimensions of the framework. For an in-depth discussion, we refer readers to Desai et al. (2016) and Ritchie & Green (2020). The ABS uses the Five Safes framework for two key reasons.

Table 1. The Five Safes framework
DimensionDescriptionType of control
Safe ProjectsIs the data to be used for an appropriate purpose?Managerial control
Safe PeopleIs the researcher authorised to access and use the data appropriately?Managerial control
Safe SettingsDoes the access environment limit unauthorised use?Managerial control
Safe DataHas appropriate and sufficient protection been applied to the data?Statistical control
Safe OutputsAre the statistical results non-disclosive?Statistical control

Reason 1: to guide decisions regarding data access and use

Decisions about access and use of data are inherently subjective as they depend on the context, including intended purpose of the data, intended users of the data, and the legal and business contexts. Even objective measures of utility and disclosure risks are based on subjective assumptions and, sometimes, hypothetical risk scenarios (Desai et al., 2016; Hafner et al., 2015; Skinner, 2012). Relying solely on measures that are based on arbitrary statistical models leads to spurious objectivity (Hafner et al., 2015). Recognising the limitations of a data-only approach to data confidentiality, which relies only on utility and risk measures and SDC, was the original motivation for the ABS to use the Five Safes framework.

The Five Safes framework explicitly acknowledges that decisions about access and use of data are subjective (Ritchie & Green, 2020). With additional dimensions that reflect managerial controls (see Table 1), the framework ensures that contextual considerations are front of mind when making decisions. The framework does not prescribe a method or a solution. Rather, it allows flexibility for users of the framework to choose the types and levels of control to apply in each dimension, and they can be tailored to the particular context of a data access activity. The controls under the five dimensions are jointly evaluated to decide whether a particular data access activity is acceptable and, if not, what controls should be strengthened or introduced to enable the activity. While decisions are ultimately subjective, the ABS makes informed decisions based on continual methodological research and extensive experience.

Reason 2: to expand the limits of what is achievable in terms of risk-utility trade-off

The Five Safes framework expands the limits of what we can achieve in terms of enhancing utility while ensuring low disclosure risk. This allows us to produce data products with higher utility than is possible with a data-only approach which relies only on SDC. To explain this, we contrast a data-only approach against a Five Safes approach using a conceptual risk-utility (R-U) map (Figures 1 and 2). It is ‘conceptual’ because although its axes are ‘disclosure risk’ and ‘utility’, our view is that there is no definitive measure of disclosure risk or utility. The axes are presented to illustrate a concept.

Consider the case where we use a data-only approach. A simple way to eliminate all disclosure risk from a data product would be to suppress every unit so that we have an empty dataset. However, such a product would have no utility. Conversely, to produce a detailed dataset with no protections would mean high utility, but also unconscionably high disclosure risk. There is a trade-off between these two extremes: greater level of SDC tends to reduce both disclosure risk and utility. Figure 1 depicts the trade-off in the form of a R-U trade-off frontier, which is a production possibilities frontier in economic terms. Each data product corresponds to a point in the map, either on or to the left of the frontier. The frontier represents the theoretical limits of what is achievable with a data-only approach, in the following sense: even if we exhaustively try all methods of SDC available with our current knowledge and technology, we are unable to produce a product to the right of the frontier. If we can specify a maximum acceptable disclosure risk, then a sensible choice is to produce products on the frontier at that level of risk.

Now consider the case where we use the Five Safes approach. Managerial controls from Safe Projects, Safe People and Safe Settings reduce disclosure risk without reducing the utility of the data that users can access, although there are costs from reduced accessibility, fixed set up costs for restricted facilities, and ongoing costs from people, project and output assessment (Desai et al., 2016). In Figure 2, managerial controls shift a data product vertically downwards (e.g. shifts point A down), resulting in a product with lower disclosure risk but with utility unchanged. These controls have a similar downward-shifting effect on any product in the R-U map. This ‘expansion’ of the frontier allows us to produce products in the desirable lower right region of the map, which is unreachable with the data-only approach. Consequently, we can produce products with higher utility using the Five Safes approach.

    Line graph titled ‘Risk-utility trade-off frontier’. The Y axis is the level of disclosure risk. The X axis is the level of utility. The graph explains that the maximum acceptable disclosure risk is ‘low’. A product created with a ‘data-only’ approach might have a medium level of utility while maintaining low disclosure risk.

    Figure 1. The R-U trade-off frontier represents the theoretical limits of what is achievable, given all methods of SDC currently available. Data products correspond to points on or to the left of the frontier. If a maximum acceptable disclosure risk can be specified, we aim to create products on the frontier at that level of risk to enhance utility.

      Line graph titled ‘Five Safes expands risk-utility trade-off frontier’. The Y axis is the level of disclosure risk. The X axis is the level of utility. The graph explains that the maximum acceptable disclosure risk is ‘low’. A product created with a ‘data-only’ approach might have a medium level of utility while maintaining low disclosure risk. A product created with a Five Safes approach could have higher utility while maintaining the same level of low disclosure risk.

      Figure 2. The Five Safes expands the R-U trade-off frontier. For each level of utility, we can achieve lower disclosure risk. E.g. we cannot create product A with the data-only approach as it is above the maximum acceptable disclosure risk, but we can create it if we apply managerial controls under the Five Safes approach.

      3.2 Five Safes applied to ABS microdata

      ABS microdata products include basic and detailed microdata, in which each row corresponds to a population unit and each column corresponds to a variable (categorical or numeric). Microdata products may contain data for all units in a population, or only for units that are selected in a survey. Basic microdata is for simple modelling and analysis, and is accessible by a wide range of users. Detailed microdata is for complex modelling and analysis, and is aimed at facilitating research that cannot be done using basic microdata alone. Detailed microdata offers higher utility to safe users for vetted research projects but may have higher disclosure risk due to the level of detail in the data.

      Detailed microdata and basic microdata differ mainly in the level of detail in the data and are consequently subject to different controls on users, projects, outputs and access environment. Table 2 describes controls placed on ABS microdata products through the Five Safes framework (Australian Bureau of Statistics, 2017b, 2019).

      The Five Safes framework allows the ABS to balance disclosure controls across five dimensions, rather than only focusing on making data safe. Designing controls based on context allows the ABS to offer detailed microdata with high utility to safe users for vetted research projects, while still providing basic microdata with good utility for more general users.

      Table 2. How confidentiality in ABS microdata is maintained by the Five Safes
       Basic microdataDetailed microdata
      Safe Settings
      • Users must store data in their own secure environment.
      • Users must limit access to approved users within their organisation.
      • Access is via the ABS DataLab, which has a two-factor authentication for user login.
      • All user access and actions are logged and may be audited.
      • No data can be removed from the ABS DataLab and external data cannot be brought into the ABS DataLab without being checked and approved by the ABS.
      Safe Projects
      • Data must be used for statistical or research purposes.
      • Projects are assessed by the ABS, with the assessment considering the statistical purpose, public value and type of analysis.
      Safe People
      • The head of the user’s organisation must sign legally binding undertaking and agree to conditions of use.
      • Breaches of protocols or disclosure of information may be subject to legal sanctions and/or legal proceedings.
      • Users must be authorised by the ABS, sign legally binding undertaking, declaration of compliance, and agree to conditions of use.
      • Users must undertake training related to data confidentiality and roles and responsibilities.
      • Breaches of protocols or disclosure of information may be subject to sanctions and/or legal proceedings.
      Safe Outputs
      • Users must comply with certain rules for outputs.
      • A clearance process is applied to the final output before it can be taken outside the ABS DataLab.
      • Outputs may be compared for consistency with the original project proposal.
      Safe Data
      • Data is survey data or Census sample.
      • Direct identifiers are removed.
      • Data is available at broad levels only (e.g. state, ‘capital city’/‘balance of state’).
      • Variables are aggregated (e.g. 5 year age groupings).
      • Units with high disclosure risk may be treated by having their values rounded, perturbed, top/bottom-coded, suppressed, or swapped with another unit, or the unit may be removed.
      • Data is mostly survey data or Census sample, but complex integrated administrative data is also available.
      • Direct identifiers are removed.
      • For more sensitive data, only the variables needed for a project are provided to a user.
      • Units with high disclosure risk may be treated by having their values rounded, perturbed, top/bottom-coded, suppressed, swapped with another unit, or the unit may be removed.

      4 Unit risk measures for microdata

      Under the Safe Data dimension of the Five Safes, we apply unit risk measures to assist in managing disclosure risks. Unit risk measures measure the disclosure risk of records or units in a microdata set. However, we emphasise that all risk measures are based on subjective assumptions and hypothetical intruder scenarios, even if they are based on objective mathematical calculations. Thus, it is important to avoid using unit risk measures as the sole driver of all decisions about access and use of data. Further, we choose unit risk measures and apply SDC with controls under Safe Project, Safe People and Safe Settings in mind. For example, if we plan to enable access to a detailed microdata set through the DataLab, which is subject to strong managerial controls (see Table 2), we choose unit risk measures that are appropriate for that context and apply less SDC compared to basic microdata sets, which are subject to more relaxed managerial controls (see Table 2). This ensures we apply appropriate SDC to preserve utility while reducing residual disclosure risk that has not been mitigated by the other three Safes.

      In this section, we focus on unit risk measures for assessing re-identification risk in cross-sectional microdata, and on measuring rather than treating risks. These unit risk measures inform the SDC methods described in the last row of Table 2. The risk measures in this section are chosen to illustrate the thinking used by the ABS to choose the measures we currently use. We are not attempting to give an overview of all risk measures in the literature, nor are we attempting to formally describe the risk measures. For the sake of clarity and brevity, we may use terminology that is different from the terminology in other research papers.

      Section 4.1 describes some unit risk measures for cross-sectional microdata in both plain and mathematical language, with an example for illustration. Section 4.2 compares the risk measures against a set of criteria that is important to the ABS and explains our choice of risk measures for ABS microdata.

      4.1 Unit risk measures

      4.1.1 Cross-tabulation analysis

      Cross-tabulation analysis finds how many units have a particular combination of attributes in the population, e.g. married male earning $1,000-$1,499 income. If this number is low, then this combination of attributes is rare and units in the microdata set with these attributes potentially have high re-identification risk.

      Let \(M\) and \(P\) be the set of units in the microdata set and population respectively, where \(M\subset P\). Let \(Q\) be a set of quasi-identifiers of interest, e.g. \(Q=\left\{Sex,\ Marital\ Status,\ Income\right\}\). Quasi-identifiers are variables that alone do not lead to re-identification of an individual, but when considered together may allow re-identification as they are often public information that could be known by an external entity. For cross-tabulation analysis, numeric quasi-identifiers such as \(Income\) are grouped into ranges that match those in the data product. For each cell under \(Q\), compute the population cell count in \(P\) (or the sample cell count in \(M\) if population data is not available). Cells with cell count less than a threshold \(C\), where \(C>0\), are considered rare and potentially have high re-identification risk. Note, this is equivalent to applying a threshold \(\frac{1}{C}\) to the inverse cell count, where cells with inverse cell count greater than the threshold are considered rare. Thus, the inverse cell count of a cell can be viewed as a re-identification risk measure for all units belonging to the cell. Cross-tabulation analysis can be repeated with different sets of quasi-identifiers.

      Example

      We illustrate cross-tabulation analysis using the Privatopia population dataset and its extract (see Appendix). Suppose we wish to enable access to the extract but need to assess its re-identification risk using the population dataset. First modify the aggregation levels in the population dataset to match the extract, e.g. \(Income\) is grouped into ranges matching the extract. Guided by context, choose a set of quasi-identifiers that is likely to expose risky units, e.g. \(Q=\left\{Sex,\ Marital\ Status,\ Income\right\}\). To determine if unit 2 has high risk, find the population cell count for the cell \(\left(M,\ married,\ $500-599\right) \) in the population dataset, then check if this cell count is below a threshold. E.g. if the population cell count is \(5\) and the threshold is \(C=10\), then all units belonging to that cell (including unit 2) potentially have high risk. We can view \(\frac{1}{5}\) as a measure of re-identification risk of units in the cell (including unit 2), and all units in the cell potentially have high risk because their risk is above the threshold \(\frac{1}{C}=\frac{1}{10}\). If the population dataset is not available, check the sample cell count in the extract instead of the population cell count.

      4.1.2 Special uniques detection algorithm (SUDA)

      SUDA measures the ‘special uniqueness’ of each unit and uses it as a proxy for re-identification risk. There are different versions of this measure (Elliot et al., 2005) but all are based on the same idea. We describe one version. A unit that is unique on a set of quasi-identifiers, e.g. \(\left\{Sex,\ Marital\ Status,\ Age,\ Income\right\}\), may also be unique on some subsets, e.g. \(\left\{Age,\ Income\right\}\) and \(\left\{Sex,\ Marital\ Status,\ Income\right\}\). Special uniqueness reflects two factors that contribute to the risk of a unit:

      1. Risk increases if a unit is unique on a greater number of subsets.
        E.g. uniqueness on two subsets \(\left\{Age,\ Income\right\}\) and \(\left\{Sex,\ Marital\ Status,\ Income\right\}\) contributes more to risk than uniqueness on only one subset \(\left\{Age,\ Sex,\ Marital\ Status\right\}\).
      2. Risk increases if a unit is unique on subsets with fewer variables.
        E.g. uniqueness on \(\left\{Age,\ Income\right\}\) contributes more to risk than uniqueness on \(\left\{Sex,\ Marital\ Status,\ Income\right\}\) because the former has fewer variables.

      For a given set of quasi-identifiers \(Q\), SUDA first performs a cross-tabulation of the population microdata set (or sample microdata set if the population microdata set is not available) corresponding to each nonempty subset of \(Q\) with cardinality greater than a specified integer. This leads to a total of up to \(2^{\left|Q\right|}-1\) cross-tabulations. For each unit, SUDA computes a weighted tally of the number of times that unit is unique among the cross-tabulations. Double-counting is eliminated in the tallying process, i.e. if a unit is unique on subset \(Q^\prime\subset Q\), it is also unique on subset \(Q''\supset Q^\prime\), but its uniqueness on \(Q''\) is excluded from the tallying process. By ‘weighted’, we mean uniqueness on larger subsets of \(Q\) are assigned lower weight, equal to \(\left(\left|Q\right|-\left|Q^\prime\right|\right)!\) for subset \(Q^\prime\subset Q\). For each unit, SUDA also measures how much each quasi-identifier in \(Q\) contributes to the risk of that unit. SUDA can be repeated for different sets of quasi-identifiers.

      Example

      We illustrate SUDA using the Privatopia population. Suppose we wish to enable access to the extract but need to assess its re-identification risk using the population dataset. First modify the aggregation levels in the population dataset to match the extract, e.g. \(Income\) is grouped into ranges matching the extract. Guided by context, choose a set of quasi-identifiers that is likely to expose risky units, e.g. \(Q=\left\{Sex,\ Marital\ Status,\ Age,\ Income\right\}\). To calculate the risk of unit 2, perform the cross-tabulations shown in Table 3 on the population dataset (or the extract, if the population dataset is not available). Suppose unit 2 is a unique on the five cross-tabulations in Table 4. Disregard some cross-tabulations to eliminate double-counting. For example, since unit 2 is unique on \(\left\{Age,\ Income\right\}\), it is also unique on \(\left\{Sex,\ Age,\ Income\right\}\) , so we disregard its uniqueness on \(\left\{Sex,\ Age,\ Income\right\}\). Unit 2 is then only unique on two cross-tabulations. Uniqueness on \(\left\{Age,\ Income\right\}\) and \(\left\{Sex,\ Marital\ Status,\ Income\right\}\) have weights \(2!\) and \(1!\) respectively because the former is a smaller set and adds more risk. Unit 2’s risk is \(2!+1!=3\).

      Table 3. Cross-tabulations performed on the Privatopia population dataset to measure unit 2’s risk
      Number of variables at a timeCross-tabulations
      1{Sex}, {Marital Status}, …
      2{Sex, Marital Status}, {Sex, Age}, …
      3{Sex, Marital Status, Age}, {Sex, Marital Status, Income}, …
      4{Sex, Marital Status, Age, Income}
      Table 4. Cross-tabulations where unit 2 is unique. Disregard some cross-tabulations in the calculation of unit 2’s SUDA score to eliminate double-counting
      Number of variables at a timeCross-tabulations where unit 2 is unique
      2{Age, Income}
      3{Sex, Age, Income} [disregarded], {Marital Status, Age, Income} [disregarded]
      {Sex, Marital Status, Income}
      4{Sex, Marital Status, Age, Income} [disregarded]

      4.1.3 Nearest neighbour (NN) algorithm

      The nearest neighbour algorithm measures how dissimilar each unit is to its nearest neighbour and uses the dissimilarity as a proxy for re-identification risk. Note that this is different to the \(k\)NN algorithm used for predictive modelling. To avoid confusion, we refer to the algorithm we discuss in this section as the NN algorithm. The rationale behind the NN algorithm is as follows: Suppose two units are similar in their attributes so that a typical user cannot determine with certainty which unit is which individual without referring to information external to the dataset. Those units may reduce each other’s risk. In contrast, if a unit has no units similar to it (i.e. even the most similar unit – the nearest neighbour – has very different characteristics), then that unit potentially has high risk.

      Choose a dissimilarity \(d\left(x,y\right)\) (e.g. Gower dissimilarity) that measures dissimilarity between any two units \(x\) and \(y\). Choose a threshold \(C>0\) that determines whether two units are sufficiently similar to reduce each other’s re-identification risk. For each unit \(x\in M\) that is unique under \(Q\), find its nearest neighbour \(y\in P\) (or \(y\in M\) if population data is not available) using the chosen dissimilarity and use \(d\left(x,y\right)\) as a measure of risk of \(x\). Compare this measure with \(C\) to determine whether \(x\) potentially has high risk. The algorithm can be repeated with different sets of quasi-identifiers.

      Example

      We illustrate the NN algorithm using the Privatopia population. Suppose we wish to enable access to the extract but need to assess its re-identification risk using the population dataset. Guided by context, choose a set of quasi-identifiers that is likely to expose risky units, e.g. \(Q=\left\{Sex,\ Marital\ Status,\ Age,\ Income\right\}\). To determine if unit 2 has high risk, find its nearest neighbour in the population dataset (or in the extract if the population dataset is not available). The nearest neighbour is a population unit that is most similar to unit 2 with respect to attributes under \(Q\), according to some dissimilarity measure. Suppose the nearest neighbour is a unit with attributes \(\left(M,\ married, \ 34\ years,\ $540\right)\). If unit 2’s dissimilarity with this nearest neighbour is greater than a threshold \(C\), then unit 2 potentially has high risk.

      4.1.4 Personal information factor (PIF)

      The personal information factor (PIF) has been proposed as a measure of the amount of personal information in microdata, which is then a proxy for disclosure risk. The PIF method introduces a unit risk measure called the row information gain (RIG). Development of the PIF is still ongoing. Our discussion relates to the latest version of the PIF at the time of writing (Australian Computer Society Inc., 2019). As the PIF method is complicated to describe concisely, we illustrate it using an example instead of providing a mathematical explanation.

      Example

      Using the Privatopia population, suppose we wish to enable access to the extract (reproduced in Table 5) but need to assess its disclosure risk with the help of the population dataset. First modify the aggregation levels in the population dataset to match the extract, e.g. \(Income\) is grouped into ranges matching the extract.

      Table 5. The Privatopia extract, reproduced from the Appendix. The marital status of unit 2 is used as an example to illustrate calculations in the PIF method
      IDSexMarital StatusEmployment StatusAge (years)Income ($)
      2Mmarriedemployed30-34500-599
      54Fwidowedunemployed75-79400-499
      345Fdivorcedemployed45-59500-599

      For each entry in the extract, the PIF method first calculates the cell information gain (CIG). The CIG measures the amount of personal information that a user can learn about an entry using the extract. Suppose record 2 is John. Assume a hypothetical intruder knows John’s sex, employment status, age and income but does not know his marital status, i.e. the intruder knows all but one attribute of John in the extract.

      Suppose the intruder’s prior knowledge of John’s marital status (before accessing the extract), based on the distribution across the population, is:

      • 30% chance of being never married
      • 40% chance of being married
      • 15% chance of being divorced
      • 15% chance of being widowed

      After accessing the extract, suppose the intruder examines the subset of employed male aged 30-34 years earning $500-599, in an attempt to learn about John’s marital status. Suppose the distribution of \(Marital\ Status\) across this subset is:

      • 45% never married
      • 50% married
      • 3% divorced
      • 2% widowed

      The extract gives the intruder the posterior knowledge that John is most likely married and almost definitely not divorced or widowed.

      The CIG of John’s marital status (i.e. record 2 in Table 5) measures the amount of personal information gained by the intruder. It is given by the Kullback-Leibler divergence \(D_{KL}(P\ ||\ Q)\), where \(P\) is the distribution for the intruder’s posterior knowledge of John’s marital status and \(Q\) is distribution for the prior knowledge. The CIG is:


      \(CIG=0.45\log_2{\frac{0.45}{0.30}}+0.50\log_2{\frac{0.50}{0.40}}+0.03\log_2{\frac{0.03}{0.15}}+0.02\log_2{\frac{0.02}{0.15}}=0.30\)


      Repeat this type of calculation for each entry in the extract. If the extract contains 2,000 records and 10 variables, there will be 2,000×10=20,000 CIG values for the extract.

      The row information gain (RIG) gives the risk of a record in the extract. It is the sum of CIG values for the record (i.e. the marginal row totals of CIG values). The feature information gain (FIG) gives the risk of a variable in the extract. It is the sum of CIG values for the variable (i.e. the marginal column totals of CIG values). The PIF gives the overall disclosure risk of the extract. Suppose, in the extract, the first record has an RIG of 0.354 and there are five records with the same RIG, the second record has an RIG of 0.536 and there are eight records with the same RIG value, the third record has an RIG of 0.132 and there are three records with the same RIG value, etc. Then:


      \(PIF=\max{\left\{\frac{0.354}{5},\frac{0.536}{8},\frac{0.132}{3},\ \ldots\right\}}\)

      4.2 Unit risk measures for ABS microdata

      ABS microdata products are subject to controls under the Five Safes framework to manage disclosure risk (see section 3). This includes controls placed on users, the access environment, the data, the projects for which the data are used, and outputs produced from the projects. These correspond to the five dimensions of the framework. Controls under Safe Data and Safe Outputs are applied last and are aimed at reducing residual disclosure risk that has not been mitigated by the other three Safes.

      We carefully design controls under Safe Data (i.e. unit risk measures, SDC) with managerial controls under the Five Safes in mind, to avoid reducing the utility of the data more than is necessary. Unit risk measures should aim to detect units in the microdata set with potentially high disclosure risk. We manually review those units to decide which units have high risk and require SDC. The review process requires unit risk measures to be easy to interpret. This motivates a set of criteria for assessing the usefulness of unit risk measures for ABS microdata. Table 6 compares the unit risk measures in section 4.1 against this set of criteria.

      Table 6. Comparison of several unit risk measures against criteria that are important for the ABS
       Cross-tabulation analysisSUDANN algorithmPIF
      Types of variables allowed
      • Nominal
      • Ordinal
      • Interval (turned into ordinal)
      • Ratio (turned into ordinal)
      • Nominal
      • Ordinal
      • Interval (turned into ordinal)
      • Ratio (turned into ordinal)
      • Nominal
      • Ordinal
      • Interval
      • Ratio
      • Nominal
      • Ordinal
      • Interval (turned into ordinal)
      • Ratio (turned into ordinal)
      Allows variables to be weighted based on importance

      No

      No

      Yes

      No

      Interpretability

      The unit risk measure has a simple relationship with cell count, which simply reflects rarity of a combination of attributes.

      The unit risk measure is a weighted tally of the number of times a unit is a sample unique. Additional measures are available with the SUDA method to indicate which attributes cause uniqueness.

      The unit risk measure is abstract but we can compare a unit with its nearest neighbour to decide if SDC is needed.

      The unit risk measure is information gain, which is abstract and difficult to interpret.

      Applicability to microdata sets that contain a sample from a population

      Applicable, better if population data is also available.

      Applicable, better if population data is also available.

      Applicable, better if population data is also available.

      Not applicable if population data is unavailable. Distributions of variables across the population are needed as proxies for the intruder’s prior knowledge.

      Cross-tabulation analysis, SUDA and the NN algorithm are strictly better than the PIF method when assessed using the set of criteria in Table 6. The three methods each have their relative advantages which make them useful for assessing disclosure risk in ABS microdata. The ABS considers these three methods as part of a suite of methods under Safe Data for managing residual disclosure risk.

      Detailed microdata products which are accessible only through the DataLab are subject to strong managerial controls and strong controls under Safe Outputs (see Table 2). These controls significantly reduce a data user’s ability and likelihood to match detailed microdata with external data to re-identify records. This means DataLab users can analyse detailed microdata with high utility for complex modelling and analysis. In contrast, basic microdata products are also subject to strong managerial controls but these are more relaxed than those for detailed microdata (see Table 2). Thus, stronger controls are needed under Safe Data for basic microdata than for detailed microdata. We might apply multiple unit risk measures and more SDC to a basic microdata set than to a detailed microdata set (see Table 2). This allows users of basic microdata to perform simple modelling and analysis. The ABS tailors controls under the Five Safes to provide different data access solutions.

      5 Future research: differential privacy and synthetic data

      Many approaches to maintaining data confidentiality can be viewed through the lens of the Five Safes framework. The ABS is currently exploring differential privacy methods and synthetic data methods as potential additions to our suite of methods under Safe Data and Safe Outputs dimensions of the Five Safes.

      Differential privacy aims to ensure that the presence or absence of an individual record in the microdata does not significantly affect statistical outputs produced from the microdata (Dwork & Roth, 2014). Since statistical outputs are insensitive to the presence or absence of individual records, differential privacy limits how much information data users can learn from the statistical outputs about any individual record. Variations of the concept of differential privacy exist with similar aims. In some researchers’ views (e.g. Culnane, Rubinstein, Watts, 2020), differential privacy removes the need for the Five Safes. However, the value of differential privacy relies on implicit assumptions about the data generation process (Kifer & Machanavajjhala, 2011). In light of this, differential privacy can benefit from additional controls applied to data users, to the projects the data is used for and to the data access environment. We view differential privacy methods as part of a suite of methods we can apply under Safe Data and Safe Outputs.

      We are exploring one such method (Australian Bureau of Statistics, 2020) that is based on the Pufferfish framework for differential privacy (Kifer & Machanavajjhala, 2014). Differential privacy methods need to be applied with care because they implicitly assume some form of worst-case intruder scenario, which could result in protections that destroy or significantly reduce data utility (Bambauer, Muralidhar, Sarathy, 2013). In contexts where such scenarios are very unlikely, for example, due to other controls under the Five Safes, we currently do not recommend differential privacy methods.

      Synthetic data methods aim to create synthetic versions of a microdata set that preserve important aggregate level information about the data. While doing so, they reduce disclosure risk by replacing real records in the data with synthetic ones. Researchers can use synthetic microdata for their research without ever accessing the real microdata. We view such methods as potential additions to our suite of methods under Safe Data and plan to explore them more in the near future.

      6 Conclusion

      The goal of enhancing data utility while protecting privacy and secrecy of data providers cannot be achieved by any attempts at developing a prescriptive standardised data confidentiality approach. While the pursuit of quantitative measures for disclosure risk assessment is useful, decisions about data access and use are inherently subjective and necessarily context-dependent. The ABS enhances utility by using the Five Safes framework to design different data access solutions for different data products. Data-only methods are just one of five dimensions. Our approach considers context when making decisions about access and use of data, and allows us to expand the limits of what we can achieve in terms of enhancing utility while ensuring low disclosure risk. Continual methodological research and extensive experience ensures that our decisions, while subjective, are well-informed. We also continue to research new methods such as differential privacy and synthetic data to incorporate into the Five Safes. Through diligent development and implementation of our approach, the ABS provides access to useful data, with confidence that the privacy and secrecy of data providers is protected.

      Appendix

      Table A1 is a snippet of a hypothetical Privatopia population dataset, which contains a record for every unit in the population of Privatopia. Table A2 is a snippet of a Privatopia extract produced from a sample of records in the Privatopia population dataset. The extract groups \(Age\) into 5 year ranges and \(Income\) into $100 ranges. The extract includes record 2, which we use as a recurring example throughout this paper to illustrate various unit risk measures. The variables and the values they can take in the population dataset are shown in Table A3. Since every unit has at most one record both in the population dataset and in the extract, we use the terms ‘unit’ and ‘record’ interchangeably in the illustrative examples throughout the paper.

      Table A1. Snippet of a hypothetical Privatopia population dataset. We use record 2 to illustrate various unit risk measures discussed in this paper
      IDSexMarital StatusEmployment StatusAge (years)Income ($)
      1Mnever marriedunemployed210
      2Mmarriedemployed33554
      3Fmarriedemployed28635
      Table A2. Snippet of a Privatopia extract containing a sample of records from the Privatopia population dataset. The sample includes record 2
      IDSexMarital StatusEmployment StatusAge (years)Income ($)
      2Mmarriedemployed30-34500-599
      54Fwidowedunemployed75-79400-499
      345Fdivorcedemployed45-59500-599
      Table A3. The variables and values in the Privatopia population dataset
      VariableValues/attributes
      Sex{male (M), female (F), intersex (I)}
      Marital Status{never married, married, divorced, widowed}
      Employment Status{employed, unemployed}
      Age{0, 1, 2, …, 120}
      Incomeintegers

      Glossary

      Show all

      Cell

      A combination of attributes under a given set of quasi-identifiers. E.g. given the set of quasi-identifiers \(\left\{Age,\ Marital\ Status,\ Occupation\right\}\), cells include \(\left(20-24\ years,\ married,\ farmer\right)\) and \(\left(25-29\ years,\ never\ married,\ nurse\right)\).

      Cell count

      The frequency of units belonging to a given cell, either in the population or in a microdata set. The former is the population cell count and the latter is the sample cell count (assuming the microdata set only contains a sample of population units). For example, if there are 2,600 married farmers who are aged 20-24 years old in the population but only 130 of them are included in the sample that makes up the microdata set, then the population cell count for the cell \(\left(20-24\ years,\ married,\ farmer\right)\) is 2,600 while the sample cell count is 130.

      Confidentiality

      The protection of privacy and secrecy of information collected from individuals and organisations. For the ABS, this means ensuring no data is released in a manner likely to enable their identification. The form of protection could include alterations of the data, controls placed on the user, controls placed on the access environment, controls placed on the project for which the data is being used, and/or controls placed on the outputs from the project.

      Cross-sectional microdata

      Microdata collected at and pertaining to a point in time. Each population unit corresponds to at most one record in the cross-sectional microdata set.

      De-identified

      Personal information is de-identified if the information is no longer about an identifiable individual or an individual who is reasonably identifiable. This is different to ‘unidentified’. Personal information is unidentified when direct identifiers are removed or altered into an unidentifiable form. Unidentified data often requires further controls to be considered de-identified, such as controls from the Five Safes.

      Direct identifiers

      Variables that unambiguously identify individuals in microdata set. E.g. name, address, tax file number.

      Disclosure

      The identification of a person or organisation in a supposedly de-identified dataset, or the attribution of information in the data to them. The former is called re-identification and the latter is called attribute disclosure. Disclosure risk is the probability that disclosure occurs.

      Intruder

      A person who deliberately attempts to breach protection measures that have been applied to some data, with the aim of gaining information about a person or organisation to which the data relates.

      Microdata set

      Dataset in which each row is a record belonging to a population unit (usually an individual or an organisation) and each column is a variable that contains information about an attribute of the population units. Multiple records may belong to the same population unit. Not all population units are necessarily included in the dataset (i.e. the dataset may contain data for only a sample of population units).

      Personal information

      Information or an opinion about an identified individual, or an individual who is reasonably identifiable, whether the information or opinion is true or not, and whether the information or opinion is recorded in material form or not.

      Population

      The set of real-world units from which a dataset is drawn.

      Population unique

      A population unit that has a unique combination of attributes in the population under a given set of quasi-identifiers.

      Quasi-identifiers

      Variables in a microdata set that alone do not lead to re-identification of an individual, but when considered together may allow re-identification. These variables are often considered public information that could be known by an external entity. E.g. age, marital status, occupation.

      Re-identification

      The identification of a person or organisation in a supposedly de-identified dataset. Re-identification risk is the probability that this occurs.

      Sample

      A subset of units from a population.

      Sample unique

      A population unit that has a unique combination of attributes in a given sample under a given set of quasi-identifiers.

      Statistical disclosure control (SDC)

      Statistical methods that alter the data by changing or suppressing some values in the data or in the outputs produced from the data, for the purpose of maintaining data confidentiality. Units with high disclosure risk may be treated by having their values rounded, perturbed, top/bottom-coded, suppressed, swapped with another unit, or the unit may be removed, among other possible treatments.

      Unit

      An entity that could be an individual, a household, an organisation, etc. For cross-sectional microdata, each unit corresponds to at most one record in the microdata set, so ‘unit’ and ‘record’ can be used interchangeably.

      Utility

      The value of a dataset for analytical and research purposes, referring to both the completeness of the dataset and the accuracy of the values within.

      References

      Show all

      Australian Bureau of Statistics, 2017, 1160.0 – ABS Confidentiality Series, Aug 2017, Australian Bureau of Statistics. Available at:
      https://www.abs.gov.au/ausstats/abs@.nsf/Latestproducts/1160.0Main%20Features4Aug%202017?opendocument&tabname=Summary&prodno=1160.0&issue=Aug%202017&num=&view= [Accessed 4 June 2020]
      https://www.abs.gov.au/ausstats/abs@.nsf/Latestproducts/1160.0Main%20Features7Aug%202017?opendocument&tabname=Summary&prodno=1160.0&issue=Aug%202017&num=&view= [Accessed 4 June 2020]

      Australian Bureau of Statistics, 2019, Compare access options, Australian Bureau of Statistics. Available at:
      https://www.abs.gov.au/websitedbs/D3310114.nsf/89a5f3d8684682b6ca256de4002c809b/c00ee824af1f033bca257208007c3bd5!OpenDocument [Accessed 4 June 2020]

      Australian Bureau of Statistics, 2020, Pufferfish differential privacy, Australian Bureau of Statistics. Available at:
      https://www.abs.gov.au/ausstats/abs@.nsf/Latestproducts/1504.0Main%20Features2Dec%202020?opendocument&tabname=Summary&prodno=1504.0&issue=Dec%202020&num=&view= [Accessed 2 Feb 2021]

      Australian Computer Society Inc., 2019, Privacy Preserving Data Sharing Frameworks, Australian Computer Society Inc., Sydney.

      Chien, C.-H., Welsh, A. H., Moore, J. D., 2020, ‘Synthetic Business Microdata: An Australian Example’, Journal of Privacy and Confidentiality, 10 (2). doi: 10.29012/jpc.733.

      Culnane, C., Rubinstein, B., Watts, D., 2020, ‘Not fit for Purpose: A critical analysis of the ‘Five Safes’’, ArXive article. Available at:
      https://arxiv.org/ftp/arxiv/papers/2011/2011.02142.pdf [Accessed 1 January 2021]

      Desai, T., Ritchie, F., Welpton, R., 2016, ‘Five Safes: designing data access for research’, Economics Working Paper Series 1601.

      Dwork, C., Roth., A, 2014, ‘The Algorithmic Foundations of Differential Privacy’, Foundations and Trends in Theoretical Computer Science, vol. 9, no. 3-4, pp. 211-407.

      Elliot, M. J., Manning, A., Mayes, K., Gurd, J., Bane, M., 2005, ‘SUDA: A Program for Detecting Special Uniques’, Joint UNECE/Eurostat work session on statistical data confidentiality.

      Hafner, H., Lenz, R., Ritchie, F., Welpton, R., 2015, ‘Evidence-based, context-sensitive, user-centred, risk-managed SDC planning: designing data access solutions for scientific use’, Joint UNECE/Eurostat work session on statistical data confidentiality.

      Kifer, D., Machanavajjhala, A., 2014, ‘Pufferfish: A Framework for Mathematical Privacy Definitions’, ACM Transactions on Database Systems (TODS), vol. 39, no. 1.

      Ritchie, F., 2017, ‘The “Five Safes”: a framework for planning, designing and evaluating data access solutions’. Paper presented at Data for Policy 2017, London, UK.

      Ritchie, F., Green, E., 2020, ‘Frameworks, principles and accreditation in modern data management’. Available at:
      https://www2.uwe.ac.uk/faculties/BBS/BUS/Research/BCEF/Frameworks.pdf [Accessed 20 August 2020]

      Skinner, C., 2012, ‘Statistical Disclosure Risk: Separating Potential and Harm’, International Statistical Review, vol. 80, issue 3, pp. 349-368.

      Acknowledgements

      We thank our colleagues at the ABS for providing input to this paper, including colleagues in the following sections: Data Access and Confidentiality Methodology Unit, Microdata Resources, National Data Commissioner Support, Data Services, Data Integration Delivery Assurance and Strategy, Census Futures, Strategic Partnership Managers, Strategic Communications, and Census Futures. In particular, we thank Michelle Gifford (ABS) and Lachlan MacRae (ABS) for formatting this paper for publishing, and our director, Joseph Chien (ABS), for reviewing this paper many times throughout the multiple rounds of edits. We also thank Felix Ritchie from UWE Bristol for providing helpful comments.