ASSESSING A DATA SOURCE
When assessing a potential data source it is important to understand and document the aspects of the source which are critical to the design of the system which will be used to process the acquired data. It is also important to provide information which assists sound use of the data by helping users understand their quality and limitations. This includes the information likely to be extracted for statistical analysis and those features of the environment in which the administrative system exists which can influence quality e.g. the way the information is collected, public sensitivities about the collector or the system.
Similar to those questions that would be used for helping to clarify a data need, a list of questions for assessing a data source can also be compiled using the ABS Data Quality Framework.
An example of some questions for assessing a data source include, but are not limited to the following:
- Institutional Environment
- What type of organisation collected the data?
- Under what authority/legislation was the data collected?
- Does the data get compiled by a different organisation from the data custodian that collected it? If so, what type of organisation is it?
- Can confidentiality be maintained?
- How was the data collected? e.g. self-report or proxy or automatic/observation.
- Population - does it match user requirements? Is it close enough?
- Do the data need transforming to match the required population e.g. aggregation to different units, or need adjustments or supplementary collection to cover the missing population? i.e. decide on scope and coverage and determine which records are in and which out.
- Will longitudinal analysis be of value for meeting user needs? If so, some form of unique identifier will need to be used for each record to allow information to be built up over time.
- Who or what was the data collected about?
- Do the data meet the demographic requirements for the statistical purpose (i.e. age, sex, employment status, geographic locations, etc.)?
- What was the original purpose for collecting the data?
- Have standard classifications been used?
- Are all records (census) required or will a sample be adequate?
- How often are the data collected?
- When are the data available?
- Over what period are the data collected?
- Are update or revisions to the data likely after release?
- What cut off dates should apply for obtaining records? How will late or missing records be dealt with? Will amendments to the data be made overtime (will revisions to series be made?)
- Do the data items collected adequately fit within the reference period? Has the reference period been consistently applied to the collection of all data?
- How often will the data be extracted / transferred to the receiving agency for use?
- Have the data been adjusted in any way? e.g. imputed, substituted, edited. If so, what adjustments were made?
- What is the collection size?
- What are the rates of non-response or under reporting?
- Have any parts of the population been unaccounted for?
- What has been done to manage under counts or over counts, if any were present?
- Are there sensitive or biased questions or topics collected? e.g. age or income.
- Are there hierarchical records? Are all levels of the hierarchy required or do the records require some transformation into a flat record? Do the records require some grouping or consolidation of reporting units into a statistical unit?
- Can the receiving agency request the data custodian to do any follow up work in regards to missing data?
- Are the data consistent over time and across organisations (i.e. can data from different jurisdictions be compared or are the data incomparable)?
- Are rates or percentages calculated using the same data source? If not, then further information needs to be provided on how these figures have been created (i.e. what differences affect the comparability and what impact do the difference have?).
- What standards and classifications of data items are used?
- Is a time series of the data available?
- Have there been changes to the underlying data collection?
- Are there any real world events (e.g. new laws, tourist events or disease outbreaks) that could impact on the data?
- Are there any contextual issues that need to be raised?
- Is other information available to help users understand the data source?
- Are there ambiguous or technical terms that may require further explanation?
- Can data that are not released be requested? If so, what is the correct process for the request?
- What format are the data available in? e.g. spreadsheet, column specifications etc.
- Are there any privacy or confidentiality issues to be considered?
- What security will be in place for the transfer of the information from the data custodian to the receiving agency?
For more examples of questions for assessing administrative data sources please see "http://www.nss.gov.au/dataquality/PDFs/DQO_Admin.pdf"
on the National Statistical Service website.
In conjunction with the list of questions for assessing administrative data sources at the acquisition stage, it would be useful to also compile a list of quality measures (or indicators) that will help in an assessment of the data. Once again using the ABS Data Quality Framework
as a guide to ensure all aspects of quality are measured. These measures can be qualitative (e.g. Yes / No) or quantitative (e.g. 97%). This assessment could be repeated each time the data custodian provides the administrative data to the receiving agency in order to provide the most up to date quality information about the statistics. This information may then be included in the report that accompanies the statistical outputs derived from the administrative data.
Some examples of quality measures for assessing a data source could include, but are not limited to:
- Institutional Environment
- Changes in the data custodian's underlying legislation;
- Frequency distribution for key variables;
- Outstanding returns;
- Coverage of population;
- Date of receipt of records;
- Number of missing values for key variables;
- Population counts (which can be monitored overtime by themselves and/or against predicted likely counts);
- Business births since last period;
- Business deaths since last period;
- Changes to units;
- Changes to definitions; and
- Number of new staff directly involved in the managing of the administrative data at the data custodian agency.