1160.0 - ABS Confidentiality Series, Aug 2017  
Latest ISSUE Released at 11:30 AM (CANBERRA TIME) 23/08/2017  First Issue
   Page tools: Print Print Page Print all pages in this productPrint All RSS Feed RSS Bookmark and Share Search this Product
  • Glossary

GLOSSARY


Administrative data
Information (including personal information) collected by agencies for the administration of programs, policies or services and with the potential to be used for statistical purposes. Administrative data can be microdata (unit-record data) or macrodata (aggregate data).

Agrregate data
Aggregate data is produced by grouping information into categories and combining values within these categories, for example, a count of the number of people of a particular age (obtained from the question ‘In what year were you born?’). Also referred to as tabular data or macrodata, aggregate data is typically presented in tables.

Attribute disclosure
This occurs when previously unknown information is revealed about an individual, group or organisation (without necessarily formally re-identifying them).

Cell concentration rule
See cell dominance rule.

Cell dominance rule
A rule used to assess whether a table cell may enable re-identification or attribute disclosure. The cell dominance rule (also called the cell concentration rule) finds cells where a small number of data providers contribute a large percentage to the cell. If a cell fails this rule, further investigation or data treatment is needed to ensure the attributes of predominant data providers are not disclosed.
For more information see Part 4: Managing the risk of disclosure: Treating aggregate data.

Confidentialised Unit Record Files (CURFs)
Files produced by the ABS which consist of microdata that have had direct identifiers (such as name and address) removed and further data treatment applied to limit the risk of information about individuals or organisations being disclosed.

Confidentiality
Confidentiality refers to protecting the secrecy and privacy of information collected from individuals and organisations, and ensuring that no data is released in a manner likely to enable their identification.
For more information, see Part 1: What is confidentiality and why is it important?

Confidentiality rules
Rules applied to all cells in a dataset in order to identify elements that pose a risk of disclosure. Two common rules applied to aggregate data are the frequency rule and the cell dominance rule.

Data access context
The environment and manner in which data are released. To effectively manage the risk of disclosure, data custodians need to consider who will have access to the data, the purpose for which the data will be used and the release environment itself (whether physical, IT or legal).

Data custodian
The organisation or agency responsible for the collection, management and release of data. Data custodians have legal and ethical obligations to keep the information they are entrusted with confidential.

Data laboratory
A secure data environment where researchers can perform detailed analysis of microdata. Data laboratories (also known as secure research centres) can be accessed virtually (i.e. remotely) or on-site. The ABS data laboratory is called the DataLab.

Data modification
A technique used to treat data to limit re-identification or other disclosure. Data modification changes all non-zero cells by a small amount while aiming to maintain the table’s overall usefulness. The methods discussed in this series are rounding and perturbation.
For more information, see Part 4: Managing the risk of disclosure: Treating aggregate data.

Data provider
An individual, household, business or other entity that supplies data for statistical or administrative purposes.

Data reduction
A technique for statistical disclosure control. Data reduction methods control or limit the amount of detail available in a table to prevent individuals or organisations from being re-identified. Methods include combining variables or categories, or suppressing (i.e. removing) information in unsafe cells.
For more information, see:

  • Part 4: Managing the risk of disclosure: Treating aggregate data
  • Part 5: Managing the risk of disclosure: Treating microdata.

Data rounding
This involves slightly altering cells in a table to make them all divisible by the same number. (Common numbers used for rounding are 3, 5 or 10.) Data rounding may be random or controlled. It prevents the original data values from being known with certainty while ensuring the usefulness of the data is not significantly affected.
For more information, see Part 4: Managing the risk of disclosure: Treating aggregate data .

Data swapping
A process of moving the values of one or more variables in a record of a Unit Record File to where that record will not pose a disclosure risk.
For more information, see Part 5: Managing the risk of disclosure: Treating microdata.

Differencing, or differencing attack
This is where someone with access to multiple tables can deduce the true values of cells that had been modified or suppressed. The individual tables may be non-disclosive, but when the tables are compared, the difference between cells across the tables may be disclosive.
For example, if a user accessed a table with information on 20–25 year olds and then accessed a subsequent table with information on 20–24 year olds, the difference between the two tables will reveal information about 25 year olds only.

Direct identification
This occurs when the data includes an identifier (e.g. name or address) that can be used, without any additional information, to establish the identity of a person, group or organisation.

Disclosure
A breach of confidentiality, where a person, group or organisation is identified or has previously unknown characteristics (attributes) associated to them as a result of releasing data.

Disclosure control
The process of limiting the risk of an individual or organisation being directly or indirectly identified. This can be via statistical (i.e. data focussed) or non-statistical (i.e. data context-focussed) techniques or processes.

Disclosure risk management
In the context of confidentiality, this involves determining whether released datasets (or sections of released datasets) constitute a risk of disclosure or re-identification, and then putting in place controlling mechanisms to mitigate those risks. The Five Safes Framework provides a way of assessing risk within the constraints provided by policies and legislation.
For more information, see Part 3: Managing the risk of disclosure: the Five Safes Framework.

Five Safes Framework
A multi-dimensional approach to managing disclosure risk, consisting of Safe People, Safe Projects, Safe Settings, Safe Data and Safe Outputs. Each safe is considered both individually and in combination to determine disclosure risks and to put in place mitigation strategies for releasing and accessing data.
For more information, see 3. Managing the risk of disclosure: the Five Safes Framework.

Frequency rule
Also called the threshold rule, this sets a particular value for the minimum number of unweighted contributors (e.g. people, households or businesses) to any cell in the table. Cells with very few contributors ('small cells') may pose a disclosure risk. Common threshold values are 3, 5 and 10. If a cell fails this rule, further investigation or action is needed to ensure the cell is adequately protected.

Hierarchical data
Datasets that contain more than one level. For example, a dataset containing unit records with information about individual people (eg personal income) may also contain information about the families these people are part of (eg household income).

Identified data
Data that includes information that may refers directly to an individual (e.g. name or address, ABN, Medicare number).

Identifier(s)
An identifier, or direct identifier, is information that directly establishes the identity of an individual or organisation. The following are examples of identifiers: name, address, driver's licence number, Medicare number and Australian Business Number.

Indirect identification
This occurs when the identity of an individual, group or organisation is disclosed due to a unique combination of characteristics (that are not direct identifiers) in a dataset. For example, a famous individual may be identifiable on the basis of their age, sex, occupation and income.

List matching
This is where a user compares records from one dataset with records from another in an attempt to find records that have corresponding information, such that it may be concluded that the two records belong to the same individual. Where this is done in an attempt to re-identify that individual, there is a clear breach of the Privacy Act and other legislation governing data access.

Macrodata
See aggregate data.

Microdata
Datasets of unit records where each record contains information about a person or organisation. This information can include individual responses to questions on surveys or administrative forms.

Open data
Situations where data is made available with no restriction on access or use (excluding possible copyright or licensing requirements). In terms of the Five Safes Framework, the only control is on Safe Data. Thus data on data.gov.au would be considered open data. On the other hand, data underlying the ABS TableBuilder product would not be (as there is a Safe Setting control); however once tables are produced they are considered open data.

Outlier
An unusual record that, because it has an extreme value for one or more data items, stands out from the rest of the population or sample.

P% rule
A statistical disclosure control rule that prevents any user from estimating the value of a cell contributor to within P% (where P is defined by the data custodian).

Personal information
According to the Privacy Act 1988, personal information is 'information or an opinion about an identified individual, or an individual who is reasonably identifiable:
(a) whether the information or opinion is true or not true; and
(b) whether the information or opinion is recorded in a material form or not’.
In other words, personal information is information that identifies, or could identify, a person. This can include not only names and addresses, but also medical records, bank account details, photos, videos and even information about what a person likes and where they work. Information can still be personal without having a name attached to it. For example, in some cases, date of birth and postcode may be enough to identify someone.

Perturbation
A statistical disclosure control technique used for count or magnitude data. Perturbation is a data modification method that involves changing the data slightly to reduce the risk of disclosure while retaining as much data content and structure as possible. Perturbation techniques include data rounding.
For more about statistical disclosure control techniques, see Part 4: Managing the risk of disclosure: Treating aggregate data.

Privacy
Although not specifically defined in the Privacy Act, privacy is generally considered as an individual’s right to have their personal information kept confidential unless informed consent has been given to release the information, or a legal authority exists. This is in accordance with the requirements of the Privacy Act 1988.

Re-identification
Re-identification is the act of determining the identity of a person or organisation using publicly or privately held information about that individual or organisation.

Remote analysis facility
Remote access facilities are used by agencies around the world. These facilities enable approved researchers to submit data queries from their desktops through a secure online interface. Requests are run against a Confidentialised Unit Record File (CURF) that is securely stored inside the data custodian's computing environment.

Remarkable characteristics
Rare characteristics or attributes in the data that can pose an identification risk, depending on how extraordinary or noticeable they are. These might include unusual jobs, very large families or very high income. Remarkable characteristics can lead to re-identification of individuals, households or organisations.

Respondent
See data provider.

Response knowledge
Information that is publicly or privately known about a respondent. This information may be used to breach confidentiality.

Rounding
See data rounding.

Safe Data
One of the Five Safes, Safe Data poses the question: has appropriate and sufficient protection been applied to the data? At a minimum, direct identifiers such as name and address must be removed or encoded. Further statistical disclosure control may also need to be applied depending on the context in which data is released.

Safe Outputs
One of the Five Safes, Safe Outputs poses the question: are the statistical results non-disclosive? This is the final check, which aims for negligible risk of disclosure. All data made available outside of the data custodian’s IT environment must be checked for disclosure. For example, statistical experts may check all outputs for inadvertent disclosure before the data leave a secure data centre.

Safe People
One of the Five Safes, Safe People poses the question: is the researcher appropriately authorised to access and use the data? By placing controls on the way data is accessed, the data custodian invests some responsibility for preventing re-identification in the researcher. The general rule is that as the detail in the data increases, so should the level of user authorisation required.

Safe Projects
One of the Five Safes, Safe Projects poses the question: is the data to be used for an appropriate purpose? Before users can access detailed microdata, they may need to demonstrate to the data custodian that their project has a valid research aim, public benefit or statistical purpose. Again the requirements under Safe Projects will depend on the context in which the data is accessed.

Safe Settings
One of the Five Safes, Safe Settings poses the question: does the access environment prevent unauthorised use? The environment here can be considered in terms of both the IT and physical environment. In some data access contexts, such as Open Data, Safe Settings are not applicable. At the other end of the spectrum, sensitive information is accessed through secure research centres.

Secure Research Centre
See data laboratory.

Security
Safe storage of, and access to, held data. Security covers both IT security and the physical security of buildings.

Sensitive information (data)
Sensitive information is considered a subset of personal information and, within the Privacy Act, is afforded a greater importance in terms of confidentiality (in particular leading to worse consequences for a re-identified individual). The Office of the Australian Information Commissioner lists a number of characteristics about an individual that are defined as sensitive.
Community expectations and even ethical considerations, however, may not consider this list to be exhaustive (eg. financial information is not present). Indeed, it could be argued all personal information can be potentially sensitive depending on the context and the individual concerned. In addition, businesses may consider much of their information to be sensitive.

Spontaneous recognition
Where a user inadvertently recognises an individual or organisation in a dataset, without deliberately attempting to identify them. Data custodians should expect that this may occur in the normal course of data analysis. Generally this risk only applies to microdata, but it could also apply to aggregate data) if the outputs have not been checked rigorously enough (see Safe Outputs).

Statistical purposes
Purposes which support the collection, storage, compilation, analysis and transformation of data for the production of statistical outputs, the dissemination of those outputs and the information describing them. Statistical purposes may be distinguished from administrative, regulatory, compliance, law enforcement or other purposes that affect the rights, privileges or benefits of particular individuals or organisations.

Suppression
This means not releasing information deemed to be a disclosure risk. Data suppression involves removing specific values from a table so that people and organisations cannot be re-identified from the released data. For more information, see Part 4: Managing the risk of disclosure: Treating aggregate data .

Tabular data
See aggregate data.

Threshold rule
See frequency rule.

Uniqueness
The situation where an individual can be distinguished from all other members in a population or sample. The existence of uniqueness is determined by the size of the population or sample, the degree to which it is segmented (e.g. by geographic information), and the number and detail of characteristics provided for each unit in the dataset.

Unit record data
See microdata.

If you have questions or feedback, please email: microdata.access@abs.gov.au