Glossary

Data confidentiality guide

Explanation of statistical and confidentiality terms used in this guide

Released

8/11/2021

Administrative data

information (including personal information) collected by agencies for the administration of programs, policies or services
can be microdata (unit-record data) or macrodata (aggregate data)
may be used for statistical or research purposes

Aggregate data

produced by grouping information into categories and combining values within these categories
example: a count of the number of people of a particular age (obtained from the question 'In what year were you born?').
also known as tabular data or macrodata
aggregate data is often presented in tables

Attribute disclosure

occurs when previously unknown information is revealed about an individual, group or organisation (without necessarily formally re-identifying them)

Big Data

extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions

Cell concentration rule

used to assess whether a table cell may enable re-identification or attribute disclosure
also called the cell concentration rule
finds cells where a small number of data providers contribute a large percentage to the cell. iIf a cell fails this rule, further investigation or data treatment is needed to ensure the attributes of predominant data providers are not disclosed
aggregate data treatment method

Confidentiality

protecting the secrecy and privacy of information collected from individuals and organisations, and ensuring that no data is released in a manner likely to enable their identification

Confidentiality rules

applied to all cells in an aggregate dataset in order to identify elements that pose a risk of disclosure
frequency rule and cell dominance rule are two common rules applied to aggregate data
microdata treatment rules may also be applied to unit record data

Data access context

the environment and manner in which data is released
data custodians need to consider who will have access to the data, the purpose for which the data will be used and the release environment itself (whether physical, IT or legal)

Data custodian

organisation or agency responsible for the collection, management and release of data
they have legal and ethical obligations to keep the information they are entrusted with confidential

Data laboratory

secure data environment where researchers can perform detailed analysis of microdata
also known as secure research centres
can be accessed virtually (remotely) or on-site
ABS data laboratory is called the DataLab

Data modification

technique used to treat data to limit re-identification or other disclosure
changes all non-zero cells by a small amount while aiming to maintain the table's overall usefulness
examples include rounding and perturbation

Data provider

an individual, household, business or other entity that supplies data for statistical or administrative purposes
also known as a respondent

Data reduction

a technique for statistical disclosure control
methods to control or limit the amount of detail available in a table to prevent individuals or organisations from being re-identified
methods include combining variables or categories, or suppressing (removing) information in unsafe cells
can be applied to aggregate data or microdata

Data rounding

slightly altering cells in a table to make them all divisible by the same number
common numbers used for rounding are 3, 5 or 10
may be random or controlled
prevents the original data values from being known with certainty while ensuring the usefulness of the data is not significantly affected
aggregate data treatment method

Data swapping

process of moving the values of one or more variables from one microdata record to another record, so it no longer poses a disclosure risk
microdata treatment method

Differencing or differencing attack

where someone with access to multiple tables can deduce the true values of cells that had been modified or suppressed
individual tables may be non-disclosive, but when the tables are compared, the difference between cells across the tables may be disclosive
example: if a user accessed a table with information on 20-25 year olds and then accessed a subsequent table with information on 20-24 year olds, the difference between the two tables will reveal information about 25 year olds only

Direct identification

when the data includes an identifier (such as name or address) that can be used, without any additional information, to establish the identity of a person, group or organisation

Disclosure or disclosive

a breach of confidentiality, where a person, group or organisation is identified or has previously unknown characteristics (attributes) associated to them as a result of releasing data

Disclosure control

the process of limiting the risk of an individual or organisation being directly or indirectly identified
can be via statistical (data focused) or non-statistical (data context-focused) techniques or processes

Disclosure risk management

In the context of confidentiality, determining whether released datasets (or sections of released datasets) constitute a risk of disclosure or re-identification, and then putting in place controlling mechanisms to mitigate those risks
the Five Safes framework provides a way of assessing risk within the constraints provided by policies and legislation

Five Safes framework

multi-dimensional approach to managing disclosure risk, consisting of safe people, safe projects, safe settings, safe data and safe outputs
each safe is considered both individually and in combination to determine disclosure risks and to put in place mitigation strategies for releasing and accessing data

Frequency rule

sets a particular value for the minimum number of unweighted contributors (such as people, households or businesses) to any cell in the table
cells with very few contributors (small cells) may pose a disclosure risk
common threshold values are 3, 5 or 10
if a cell fails this rule, further investigation or action is needed to ensure the cell is adequately protected
also called the threshold rule

Hierarchical data

datasets that contain more than one level
example: a dataset containing unit records with information about individual people (such as personal income) may also contain information about the families these people are part of (such as household income)

Identified data

data that includes information that refers directly to an individual or organisation, such as name or address, ABN, Medicare number

Identifier(s)

information that directly establishes the identity of an individual or organisation
examples include name, address, driver's licence number, Medicare number and ABN
also known as direct identifiers

Indirect identification

occurs when the identity of an individual, group or organisation is disclosed due to a unique combination of characteristics (that are not direct identifiers) in a dataset
example: a famous individual may be identifiable on the basis of their age, sex, occupation, geography and income

List matching

where a user compares records from one dataset with records from another in an attempt to find records that have corresponding information, so that it may be concluded that the two records belong to the same individual
this is a clear breach of the Privacy Act and other legislation governing data access where this is done in an attempt to re-identify that individual

Macrodata

see aggregate data

Microdata

datasets of unit records where each record contains information about a person, organisation or other type of unit
can include individual responses to a census, survey or administrative form

Open data

Data that is made available with no restriction on access or use (excluding possible copyright or licensing requirements). In terms of the Five Safes framework, the only control is on safe data.
Data on data.gov.au is open data as any researcher can download files
Data underlying ABS TableBuilder is not considered open data as there is a safe setting control - users cannot directly access the underlying microdata
Aggregate output (tables, graphs or maps) from TableBuilder are open data

Outlier

an unusual record that, because it has an extreme value for one or more data items, stands out from the rest of the population or sample because it has an extreme value for one or more data items
outliers are potentially risky for confidentiality

P% rule

statistical disclosure control rule that prevents any user from estimating the value of a cell contributor to within P% (where P is defined by the data custodian)
aggregate data treatment method

Personal information

information that identifies, or could identify, a person
can include not only names and addresses, but also medical records, bank account details, photos, videos, and even information about what a person likes or where they work
information can still be personal without having a name attached to it
example: idate of birth and postcode may be enough to identify someone
see also Sensitive information

In the Privacy Act 1988, personal information is "information or an opinion about an identified individual, or an individual who is reasonably identifiable:

whether the information or opinion is true or not true; and
whether the information or opinion is recorded in a material form or not."

Perturbation

a statistical disclosure control technique used for count or magnitude data (aggregate data) or for microdata
data modification method that involves changing the data slightly to reduce the risk of disclosure while retaining as much data content and structure as possible
data rounding is a type of perturbation

Privacy

not specifically defined in the Privacy Act
an individual's right to have their personal information kept confidential unless informed consent has been given to release the information, or a legal authority exists - this is in accordance with the requirements of the Privacy Act 1988

Re-identification

the act of determining the identity of a person or organisation using publicly or privately held information about that individual or organisation

Remote analysis facility

remote access facilities are used by agencies around the world
enables approved researchers to submit data queries from their desktops through a secure online interface
requests are run against microdata that is securely stored within the data custodian's control

Remarkable characteristics

rare characteristics or attributes in the data that can pose an identification risk, depending on how extraordinary or noticeable they are
may include unusual jobs, very large families or very high income
remarkable characteristics (or remarkable combinations of characteristics) can lead to re-identification of individuals, households or organisations

Respondent

see data provider

Response knowledge

information that is publicly or privately known about a respondent
may be used to breach confidentiality

Rounding

see data rounding

Safe data

one of the Five Safes, safe data poses the question: has appropriate and sufficient protection been applied to the data?
at a minimum, direct identifiers such as name and address must be removed or encrypted
further statistical disclosure control may be needed depending on the context in which data is released

Safe outputs

one of the Five Safes, safe outputs poses the question: are the statistical results non-disclosive?
the final check, aiming for negligible risk of disclosure
all data made available outside of the data custodian's IT environment must be checked for disclosure
example: statistical experts may check all outputs for inadvertent disclosure before the data leave a secure data centre

Safe people

one of the Five Safes, safe people poses the question: is the researcher appropriately authorised to access and use the data?
by placing controls on the way data is accessed, the data custodian requires the researcher to take some responsibility for preventing re-identification
as the detail in the data increases, so should the level of user authorisation required

Safe projects

one of the Five Safes, safe projects poses the question: is the data to be used for an appropriate purpose?
before users can access detailed microdata, they may need to demonstrate to the data custodian that their project has a valid research aim, public benefit and/or statistical purpose
depends on the context in which the data is accessed

Safe settings

one of the Five Safes, safe settings poses the question: does the access environment prevent unauthorised use?
can be considered in terms of both the IT and physical environment
in some data access contexts, such as open data, safe settings are not applicable
at the other end of the spectrum, sensitive information is accessed through secure research centres

Secure Research Centre

see data laboratory

Security

safe storage of, and access to, data held by organisations or individuals
covers both IT security and the physical security of buildings

Sensitive information (data)

sensitive information is considered a subset of personal information
under the Privacy Act, is of greater importance in terms of confidentiality (in particular where it leads to worse consequences for a re-identified individual)
the Office of the Australian Information Commissioner lists a number of characteristics about an individual that are defined as sensitive [link]
community and ethical expectations may not consider this list to be exhaustive (example: financial information is not present)
all personal information can be potentially sensitive depending on the context and the individual concerned
businesses may consider much of their information to be sensitive, but only personal data applicable under the Privacy Act

Statistical or research purposes

purposes which support the collection, storage, compilation, analysis and transformation of data for the production of statistical outputs, the dissemination of those outputs and the information describing them
statistical or research purposes may be distinguished from administrative, regulatory, compliance, law enforcement or other purposes that affect the rights, privileges or benefits of particular individuals or organisations

Suppression

not releasing information that is considered a disclosure risk
aggregate data:
1. removing specific values from a table so that people and organisations cannot be re-identified from the released data
2. initial suppression is known as primary suppression
3. additional cells needing suppression are known as consequential or secondary suppression
microdata:
1. removing specific records from the microdata file
2. removing specific data items for all records on the microdata file

Tabular data

see aggregate data

Threshold rule

see frequency rule

Uniqueness

where an individual has a characteristic or combination of characteristics that are different to all other members in a population or sample
determined by the size of the population or sample, the degree to which it is segmented (for example by geographic information), and the number and detail of characteristics provided for each unit in the dataset
records that are unique are not necessarily re-identifiable, as this also depends on the remarkability of the characteristics and the availability of other information or knowledge held by the researcher (response knowledge)

Unit record data

see microdata

APA