Data Processing

DATA PROCESSING

Introduction
Data processing involves translating the answers on a questionnaire into a form that can be manipulated to produce statistics. In general, this involves coding, editing, data entry, and monitoring the whole data processing procedure. The main aim of checking the various stages of data processing is to produce a file of data that is as error free as possible. Adopting a methodical and consistent approach to each processing task is important to ensure that the processing is completed satisfactorily and on schedule. The following discussion will give a brief introduction to the main stages of data processing.

Stages in Data Processing
Functions involved in data processing include: despatch and collection control; data capture and coding; and editing.

Despatch and Collection Control
A despatch and collection control (DACC) system can be employed to organise the distribution and receipt of the survey forms. It is also used to employ mark-in functions, employ reminder action, and, importantly, generate management information reports on the collection's progress.

All of the above tasks could be performed manually but are more likely to be automated using some kind of generalised system if the survey is large. The functionality required in the DACC phase can vary significantly, depending on the collection technique, i.e. whether a mail survey, personal interview, telephone survey or a combination of techniques are used.

DACC segments include population file set-up and maintenance, label production, hardcopy/electronic collection control registers, mark-in, reminder control, intensive follow-up control and management information. Management information covers response rate progress, returns to sender, timing and effects of each phase such as reminder action.

Data Capture and Coding
The questionnaire can be used as a working document for the transfer of data on to a computer file, therefore removing the need for a separate data input form. This also removes a stage that could produce transcription errors. Consequently, it is important to design the questionnaire to facilitate data entry.

Coding
Unless all the questions on a questionnaire are "closed" questions, some degree of coding is required before the survey data can be sent for punching. The appropriate codes should be devised before the questionnaires are processed, and are usually based on the results of pilot tests.

Most surveys are too large and complex to be analysed by checking questionnaires and counting responses. Surveys usually require a system by which the responses can be transferred onto a computer file for analysis. This system would involve translating responses from "closed" questions into numerical codes.

Coding consists of labelling the responses to questions in a unique and abbreviated (using numerical codes) way in order to facilitate data entry and manipulation. Codes should be formulated to be simple and easy, for example, if Question 1 has four possible types of responses then those four responses could be given the codes 1, 2, 3, and 4. The advantage of coding is the simplistic storage of data as a few-digit code, compared to lengthy alphabetical descriptions which almost certainly will not be easy to categorise.

Coding is a relatively expensive task in terms of resource effort. However, improvements are always being sought by developing automated techniques to cover this task.

The coding frame for most questions can be devised before the main interviewing begins or forms are despatched. That is, the likely responses are obvious from previous similar surveys or through pilot testing, allowing those responses and relevant codes to be printed on the questionnaire. An "Other" answer code is often added to the end of a coding frame with space for respondents or interviewers to write the answer.

A major function of the ABS is the development of statistical standards, classifications and frameworks. These classifications cover a wide range of subjects, including the:

ASGC Australian Standard Geographical Classification (Cat. No. 1216.0);
ASCO Australian Standard Classification of Occupations (Cat. No. 1222.0);
ANZSIC Australian & New Zealand Standard Industrial Classification (Cat. No. 1292.0); and
ABS Standards for Statistics on the Family (Cat. No. 1286.0).

ABS classifications should always be considered when designing questions for a questionnaire because they are easily coded and allow for comparison with ABS data.

Data Capture
Data capture methods are very expensive in terms of both staff and time. Some of the advances in recent times that reduce the number of errors introduced by data capture are given below.

In a mail-based survey errors may occur when clerical staff enter the data using a computer terminal. These errors can often be reduced by improving the design of the data entry system. This is increasingly possible with the introduction of computer aided data entry (CADE) systems which can be tailored to the form being entered. A good CADE system would present screen images of pages from the paper form, making it easy for the clerk to enter the respondent's values at the appropriate points.

Another approach to minimise data entry is by using machine-readable returns. Optical mark reading (OMR) can be used to read form identifiers, and in some cases to read responses consisting of marks. Optical character recognition (OCR) allows the computer to read hand written forms directly.

In interviewer-based surveys errors may occur as the interviewer fills out the form and also later when the data is entered into the computer. If the double transfer of data can be avoided the opportunity for errors can be reduced. Approaches to this are computer assisted telephone interviewing (CATI) and computer assisted personal interviewing (CAPI) which involve the interviewer entering data directly into a computer terminal rather than onto a form.

Factors to Consider
When changing to a new data capture system, there are several issues that need to be considered. These include the accuracy of results, the time in which results can be output, the staffing required for the new system and the costs of the changeover. In particular, some of the considerations might be:

in on-going surveys, a change in system can cause a drift in the results, not allowing comparability between different survey periods;
a reliance on a small number of data capture machines can cause timeliness problems when breakdowns occur;
there may be a decrease in the need for data punchers as far as staff levels are concerned if the data capture machines process forms more quickly;
when changing systems, there are large up-front costs in introducing hardware, but improvements in the capture system should lower staff costs.

Editing
What is editing?
Editing is the process of correcting faulty data, in order to allow the production of reliable statistics.

Data editing does not exist in isolation from the rest of the collection processing cycle and the nature and extent of any editing and error treatment will be determined by the aims of the collection. In many cases it will not be necessary to pay attention to every error.

Errors in the data may have come from respondents or have been introduced during data entry or data processing. Editing aims to correct a number of non-sampling errors, which are those errors that may occur in both censuses and sample surveys; for example, non-sampling errors include those errors introduced by misunderstanding questions or instructions, interviewer bias, miscoding, non-availability of data, incorrect transcription, non-response and non-contact. But editing will not reveal all non-sampling errors - for example, while an editing system could be designed to detect transcription errors, missing values and inconsistent responses, other problems such as interviewer bias may easily escape detection.

Editing should aim:

to ensure that outputs from the collection are mutually consistent, for example, a component should not exceed an aggregate value; two different methods of deriving the same value should give the same answer;
to detect major errors, which could have a significant effect on the outputs;
to find any unusual outputs and their causes.

The required level of editing
The function of editing is to help achieve the aims of a collection so, before edits are created or modified, it is important to know these aims - since these have a major say in the nature of the editing system created for the given collection. We need to know about features such as:

the outputs from the collection
the level at which outputs are required
their required accuracy
how soon after the reference period the outputs are needed
the users and uses of the collection. A collection may be simple (with limited data collected) and designed to meet the requirements of only one type of user (e.g., Survey of new Motor Vehicle Registration and Retail Trade) or it may collect more complex data and aim to meet the needs of many different types of users (e.g. Agricultural Finance Survey, Household Expenditure Survey, etc.). If there are many types of users there is a likelihood of conflicting requirements amongst the users, which can lead to stresses on the collection.
the reliability of each item (eg. is the definition easily understood or is the item sensitive?)

While the goal of editing is to produce data that represent as closely as possible the activity being measured, there are usually a number of constraints (such as the time and number of people available for data editing) within which editing is conducted. These constraints will also influence the design of the editing system for the given collection.

The structure of an edit
An edit is defined by specifying:

the test to be applied,
the domain, which is a description of the set of data that the test should be applied to, and
the follow-up action if the test is failed.

The test
This is a statement of something that is expected to be true for good data. A test typically consists of data items connected by arithmetic or comparison operators. Ideas for suitable tests may arise from people with a knowledge of the subject matter, the aims of the collection or relationships that should hold between items.

Examples

this item should not be missing
the sum of these items equals that item

The Domain
The domain is defined by specifying the conditions which the data must satisfy before the test can be applied.

Example

A test may only be relevant to those businesses in a certain industry and the domain will therefore consist of all records belonging to that industry.

The Follow-up
The edit designer must also think about the appropriate follow-up action if a test is failed. Some edits will be minor failures that simply require human attention, but do not need to be amended. Other edits identify major failures that require human attention and an amendment. The sort of treatment given to an edit failure is commonly done by classifying edits to a grade of severity, such as fatal, query and warning.

Example

Where a record lacks critical information which is essential for further processing a fatal error should be displayed.

It is important to note that even if we go through comprehensive editing processes errors may still occur, as editing can identify only noticeable errors. Information wrongly given by respondents or wrongly transcribed by interviewers can only be corrected when there are clues that point to the error and provide the solution. Thus, the final computer file will not be error-free, but hopefully should be internally consistent.

Generally different levels of editing are carried out at several stages during data processing. Some of the stages involved are provided below.

Clerical Coding
This stage includes mark-in of the forms as they are returned, all manual coding (eg. country and industry coding) and manual data conversion (eg. miles to kilometres).

Clerical Editing
This stage includes all editing done manually by clerks before the unit data are loaded into a computer file.

Input Editing
Input editing deals with each respondent independently and includes all "within record" edits which are applied to unit records. It is carried out before any aggregates for the production of estimates are done. An important consideration in input editing is the setting of the tolerances for responses. Setting low tolerances will result in the generation of large numbers of edit failures and impact directly on resources and in the meeting of timetables.

Ideally, an input edit system has been designed after carefully considering and setting edit tolerances, clerical scrutiny levels, resource costs (against benefits), respondent load and timing implications.

Output Editing
Output editing includes all edits applied to the data once it has been weighted and aggregated in preparation for publication. If a unit contributes a large amount to a cell total, then the response for that unit should be checked with a follow-up.

Output editing is not restricted to examination of aggregates within the confines of a survey. A good output edit system will incorporate comparisons against other relevant statistical indicators.

Types of Edits Commonly Used
Validation Edit
Checks the validity or legality of basic identification or classificatory items in unit data.

Examples

the respondent's reference number is of a legal form
state code is within the legal range
sex is coded as either M or F

Missing Data Edit
Checks that data that should have been reported were in fact reported. An answer to one question may determine which other questions are to be answered and the editing system needs to ensure that the right sequence of questions has been answered.

Examples

in an employment survey a respondent should report a value for employment
a respondent who has replied NO to the question: Do you have any children? should not have answered any of the questions about the ages, sexes and education of any children

Logical Edit
Ensures that two or more categorical items in a record do not have contradictory values.

Example

a respondent claiming to be 16 years old and receiving the age pension would clearly fail an edit

Consistency (or reconciliation) edits
Checks that precise arithmetical relationships hold between continuous numeric variables that are subject to such relationships. Consistency edits could involve the checking of totals or products.

Examples

totals: a reported total should equal the sum of the reported components
products: if one item is a known percentage of another then this can be checked (eg. land tax paid should equal the product of the taxable value and the land tax rate)

Range Edit
Checks that approximate relationships hold between numeric variables that are subject to such relationships. A range edit can be thought of as a loosening of a consistency edit and it's definition will include a description of the range of acceptable values (or tolerance range).

Examples

If a company's value for number of employees increases by more than a certain predefined amount (or proportion) then the unit will fail. Note that both the absolute change and the proportional change should be considered since a change from 500 to 580 may not be as useful as a change from 6 to 10. So, if the edit was defined to accept the record if the current value is within 20% of the previous value a change from 500 to 580 would be accepted and the change from 6 to 10 would be queried.
In a survey which collects total amount loaned for houses and total number of housing loans from each lending institution it would probably be sensible to check that the derived item average housing loan is within an acceptable range.

Basic Survey Design