Managing Confidentiality and Learning about SEIFA
Since the release of data from the 2001 Census the ABS has taken the opportunity to investigate potential areas for improvement in the delivery and usefulness of Census data. Over previous editions of Census Update we’ve provided you with details of many of these changes.
¨ Mesh Blocks
¨ Census Data Enhancement
With about 15 months to go until the first release of data from the 2006 Census we’ve been very busy refining our plans. If you’d like to keep abreast of developments or gain some further insight into the technical details of our work then the Census Articles link from the Census home page (www.abs.gov.au/census) will provide you with easy access to the information you’re interested in.
At present there are two major pieces of work we are continuing to progress to the point were we can present some useful information to our users. The first is a new perturbation algorithm to protect the confidentiality of data from the 2006 Census. The second is our plans for calculation and dissemination of the SEIFA indexes for the 2006 Census. Some details are included below with further information available from our website.
New Perturbation Algorithm
The ABS is proposing to apply a new confidentiality algorithm to the release of data from the 2006 Census. Many users of Census data have expressed an interest in understanding more about how the algorithm will be applied and the effect it will have on data. It’s very important that users:
¨ Understand the need for it,
Under the Census and Statistics Act it is an offence to release any information collected under the Act that is likely to enable identification of any particular individual or organisation. In accordance with that requirement the ABS has, since the early 1980s, deployed an algorithm designed to protect the confidentiality of its tabular data. The existing algorithm removes values of 1 and 2 from tables prior to their release.
In the past the ABS has released static tables of data and the existing confidentiality algorithm has worked well in this environment with a few minor shortcomings. However the dissemination plans for the 2006 Census include provision for the ABS to release data in a more dynamic environment. This will allow users better access to data and will allow them to interact with data and design and populate their own tables of Census data.
This new environment exposes the ABS to increased levels of risk. In particular the capacity for clients to design and populate their own tables greatly increases the risk of clients obtaining unrandomised data by ‘differencing’ the contents of two separate tables. Differencing occurs when a user is able to access two tables which vary only slightly in content. By subtracting the values in one table from those in another will often reveal unrandomised data. The risk from differencing is so great that the ABS would not be able to continue with its plans for a more dynamic user environment without implementing a more sophisticated protection mechanism.
Methodologists within the ABS have come up with a relatively simple yet elegant solution to this problem. The existing perturbation algorithm, i.e. removing values of 1 and 2, alters very few cells in relatively few tables. The new perturbation algorithm is designed to potentially alter every cell in every table by a small amount. In doing so it adds sufficient ‘noise’ to each cell so that by differencing, users would end up with more noise than real data.
Having decided to adopt this approach there are several options that can be used to help ensure the outcomes are the best possible. For example, it is desirable that the same table is always ‘randomised’ in exactly the same way. It is also desirable that table totals are preserved between tables with common geographies. For example, one table of Age by Occupation and one of Religion by Sex for the same geographical area should produce an identical table total.
For many users of Census Data the change of algorithm will be unnoticeable. The new algorithm has no greater impact on the quality of the data than our existing algorithm. However some of our more sophisticated users will notice some changes in the data particularly in tables that are sparsely populated.
¨ Be confident that it is not unnecessarily corrupting data and,
¨ Understand the extent of its ‘influence’ and ‘allow’ for it when using Census data
Socio-Economic Indexes for Areas (SEIFA)
SEIFA indexes continue to grow in popularity amongst Census users. In the past we’ve published a number of articles in Census Update relating to SEIFA indexes. The feedback we received from users regarding the release of SEIFA indexes has highlighted a number of issues we need to resolve before we release the next round of SEIFA indexes.
The first and most important issue is the release date for SEIFA indexes. Due to a few last minute technical issues the release of SEIFA indexes for the 2001 Census was delayed. We are currently working to resolve all of our technical issues so that once the data becomes available the indexes can be calculated and released without undue delay. The planned release date for SEIFA indexes for the 2006 Census is March 2008. This is six months earlier in the Census cycle than our previous release.
Another important issue is the manner in which SEIFA indexes are compared and the conclusions that are drawn. Unlike many other numbering systems SEIFA indexes are based on ordinal rather than cardinal numbers. The SEIFA indexes are designed solely to indicate relative order. They are not designed to provide a direct comparison in the level of disadvantage or advantage. For example an area with a disadvantage index of 600 is not necessarily twice as disadvantaged as an area with a disadvantage index of 1200. To limit the misuse of these statistics we are investigating the feasibility of releasing the indexes ranked by the decile in which they lie. For example, an area with an index in the lowest 10 percent of the distribution would be given a ranking of 1 while an area with an index in the highest 10 percent of the distribution would be given a ranking of 10.
Lastly, we are investigating the release of single scores for aggregated area. The larger the aggregation, for example data at the Statistical Local Area or Local Government Area level, the more likely that there are large differences in advantage and disadvantage within the area. The distribution of indexes within the area is likely convey as much useful information as the single index for the area as a whole. To better communicate this level of information we are examining the release of additional data to indicate the distribution of index scores within larger aggregated areas.
Over the coming months we’ll be providing more information to users about the progress of our work on the next generation of SEIFA indexes.