|Page tools: Print Page Print All|
OPTICAL CHARACTER RECOGNITION (OCR) FORMS
Section 1: Introduction
This section of the manual is intended to serve several groups of users by:
These standards for OCR forms should be fully implemented for any new or significantly changed scanned forms as they come up. They consolidate several features that have already been implemented on an ad hoc basis.
Areas designing OCR forms should consult with Data Collection Methodology (DCM), the Collection Management Unit (CMU) and the Despatch and Collection Unit (DCU) as early as possible in their design planning process.
Changes in this update
The main changes in this update relate to:
These standards apply to all forms (OCR and non-OCR) and now appear in the general form design standards Front of Form and Layout :
Converting 'conventional' forms to OCR forms
Converting a form to OCR format can take longer than may, at first, be estimated. The InDesign work required to change instruction and data box formats, adjust spacing around data entry boxes, and accommodate the requirements of bar-coding and drop-out colours all take time and often result in page and question spacing having to be modified.
In general, OCR forms require somewhat more space on a page than non-OCR forms. This means that forms can appear to be longer, and may require more pages to accommodate the same number of data entry fields. OCR also requires colour printing (not black and white) to allow some form details to drop out (be invisible to OCR). Black is the standard text colour for OCR and non-OCR forms.
These factors may moderately increase printing and postage costs, but generally, any increases will be more than offset by savings in other aspects of processing.
In addition, system changes may be required to support modified mark-in, load or edit changes. It is important to involve the DCU early in the process. The DCU test the form for scanning and paint the recognition screens, and sufficient time must be allowed for these processes. The DCU requires six weeks, after receipt of metadata and printed forms, in order to create and test the IFP definitions.
Three months (elapsed) is a reasonable amount of time to allow for converting forms.
Density of forms
Forms which are sparse, in terms of the number of data items to be captured per page, will be less cost-effective for survey areas because OCR data capture through scanning is currently cost recovered on a per page basis. However, the temptation to make 'savings' by reducing the number of pages captured must be carefully assessed against the generally negative impact on data quality that arises from crowding questions on forms. Forms that are overcrowded with questions can lead to increased respondent burden and scanning errors, both which may impact negatively on the quality of the data. Placing instructions on pages separate from data capture fields to reduce OCR costs is not an acceptable practice.
'Label' and contact details changes and the Sample and Frame Maintenance Procedures (SFMP)
Current OCR technology does not allow for the capture of free text handwriting unless in segmented text boxes, especially any that may be written on or near the address/contact ('label') information on the front of the form. This presents a problem for following the SFMP in cases where there are changes other than in 'return to sender' situations (see Section 7: Text handling and capture).
Master pages used for OCR forms design in InDesign
The master pages 'OCR/Full page' and 'OCR/Split page' have been created for OCR forms. The grids displayed on both master pages are in exactly the same position as the master pages used for conventional forms except for the registration marks.
The registration marks have been created and positioned at the corners of the page. They are vital items on the page, as they act as reference points for OCR read areas, and must be printed in solid black ink (PMS 405U). See Section 4 for more on registration marks.
Summary of introduction
OCR is the primary mode of data capture for self-completed paper forms. There are a number of issues which will impact on the design and development of individual forms.
Areas designing OCR forms should consult with DCM, CMU and the DCU as early as possible in their design planning process.
As well as catering to the technical requirements of the OCR scanning equipment, it is very important that (as with non-OCR forms) design standards are applied consistently throughout the form, primarily to make the form as easy as possible for the provider to complete and to maximise the quality of reported data.
The standard methods for developing and evaluating forms, such as observational studies and error analysis (described in the Forms Development and Evaluation Manual), should be used to evaluate the effectiveness of OCR forms.
Section 2: OCR front of form
The main content of this section relates to the use of bar codes and patch codes. Other front of form standards are as for non-OCR paper forms, including address box and mandatory wording.
Standard patch code
The patch code is positioned on the front of OCR forms and non-OCR forms which are also imaged. Its purpose is to indicate to the scanner the start of a new form. Patch code position must be consistent across all forms which are required to be scanned together in a batch. Horizontal and vertical alignment of the patch code is critical for proper operation. If the patch code is placed improperly on the document, the patch sensors may fail to sense the patch and the start of the next form. Patch codes are to be printed in solid black ink [PMS 405U].
Patch code placement
Patches must appear with the bars parallel to the leading edge of the document. There must be at least 5mm of space between the patch code and any other printed information. For an example of patch code placement see the Front of Form standards.
All forms despatched from the DCU are barcoded, before the form is despatched, to facilitate mark-in and unit identification on receipt/scanning. The IFP barcode is based on the reference period (6 characters), form identifier and the unit identifier.
A significant horizontal space on the front of the form is required to contain the barcode. The barcode will always be printed horizontally across the page, usually below the address box. The size of the barcode is determined by the number of digits used in the survey identifier. As a general guide the barcode will require a blank area of dimensions: minimum 12.5 cm wide x 2 cm.
For booklets the barcode must be on the front page and cannot be at the very top or bottom of the front page because the printer does not have sufficient tension to successfully print a barcode in either extremity of the form.
During scanning, the barcode, as well as other information printed on the form, tends to show through to the other side of the page. To avoid barcode recognition problems, keep the top one third of the first inside page (i.e. the second page) of a form free of any data entry boxes.
Information that is collected in the 'Contact details' section of the front of form is covered in the Front of Form standards. For OCR forms, the standard is for all contact details response boxes (except for the signature box) to be segmented, to allow the information to be captured and loaded to PIMS (if this requirement is specified by the BSC). Text provided in non-segmented boxes is unable to be captured by IFP.
Section 3: Colour
Drop out colour
Several colours that are widely used for paper forms with manually entered data do not scan well because, despite instruction to the contrary, data providers use blue pens to complete blue or bluish forms. Hence, there are fewer standard form colours for use with OCR forms.
The Despatch and Collection Unit (DCU) decide the actual colour used for forms based on operational considerations, for example, to assist with the "streaming" process, where incoming forms are sorted for processing. Survey areas should consider advising the DCU if they have several similar surveys in the field at the same time so different colours can be considered to avoid confusing respondents.
The following colours (generally in the red end of the spectrum) have been chosen for OCR forms because they are produced by a simple mix of two basic ink colours. This will make quality control more reliable. This standard uses the Pantone® Matching System (PMS), shown by the letters 'PMS' followed by a three or four digit number (for example, PMS 405, PMS 2735).
Black (PMS 405U) is the only acceptable colour for the main text of OCR forms. Particular care should be taken to ensure that the 3mm clear zone is maintained around OCR fields to avoid black text interfering with scanning. PMS 405U should also be used for patch codes and any other 'black' elements printed on the form.
Other text (i.e. where near or part of OCR fields) should be in 100% screen of the background colour.
Section 4: Layout
This section covers broad OCR layout and is closely related to Section 5 (Other OCR Design Elements) which covers specific OCR-design elements and objects.
Avoid data entry boxes on reverse of front page barcode position
To avoid barcode problems, keep the top one third of the first inside page (i.e. the second page) of a form free of any data entry boxes. This is not generally a problem because this page usually contains the 'Please read this first' and 'General Instructions' sections.
The front of form templates used by the CMU when creating OCR forms incorporate the standard address box and provide space for the barcode directly below the address information (see Diagram 1).
Exemptions from the standard front of form, or variations from the standard need to be discussed with CMU, DCM and DCU, prior to approval.
Measuring zones /'no go' area
Eye-guides and other form elements that are not in the drop-out colour are not to be within 3mm vertically or 3mm horizontally of any data entry box.
In some cases a need for 5mm horizontal clear space may be identified in form scanning testing. These cases include over-long lines of black text and those with long response spaces such as 'name' or some address fields.
A minimum of 3mm vertical spacing between data entry boxes is to be used throughout OCR forms. Separating scannable areas by a minimum of 3mm reduces the chance of scanning overlap, which reduces the frequency of scanning errors and subsequent repair costs. Exemption from this standard may be granted in the case of sparse matrices (where it is known that most respondents will only report a small number of data items). Requests for exemption from the standard must be discussed with CMU, DCU and DCM.
Data entry box borders and any lines inside text boxes must be in the drop-out colour.
To assist the OCR system in recognising and identifying forms, registration marks, page numbers and at least one other unique, numeric identifier are required on each page to act as reference points for OCR read areas and to identify the layout of the page being read.
On the front page where the page number is not available, the ABS logo and registration marks are used as reference points.
All other pages are to have page numbers at the top only, with two exceptions. These are:
It is important that these reference points are accurately located in the same position on each page. This also means that any trimming or guillotining of the form must be consistent for all pages of all copies of the form. Care needs to be taken to ensure that each sheet is an accurately sized sheet with undamaged edges available.
For OCR forms and non-OCR forms which are also imaged, registration marks must be printed in solid black ink and be placed near the four corners of each page (except for the front page, where the ABS logo provides the reference point in the top left corner and therefore a registration mark is not required here). A minimum 3 mm clear space should be provided around each registration mark. Each registration mark should be 5mm wide and 2mm high. All registration marks should be at least 5mm from the side edges and top (or bottom) of the page.
Registration marks are included in the OCR graphic templates used by the CMU.
The page number is also used as an identification mark for the OCR system. Page numbers are to be set in sans serif type as sans serif typeface (the line strokes of a letter without the varying thickness) is easily recognised by the scanner. Verdana is the standard sans serif type for ABS forms. A page number is required for scanning on every page except the front page.
The minimum space from the bottom of the page number to the top of the text in the first question/heading on a page is 5mm. Page recognition is compromised if the page number and question text are too close.
Intentionally blank pages must also include registration marks and page numbers.
Section 5: Other OCR design elements
This section covers specific OCR-friendly design elements and is closely related to section 4 (Layout) which covers broader OCR layout and design.
Data entry boxes
When creating forms, CMU staff use InDesign graphic libraries, which provide graphic objects such as answer boxes that are of an acceptable size, spacing and alignment. If the need for a new object is identified then the standard libraries will be updated by the CMU.
Numeric data entry boxes
Each digit is to have its own box drawn in the drop-out colour. If the number is large the fields are to be grouped by the use of commas (Diagram 2):
Black text is not to be used for scannable commas, decimals or trailing zeros in data entry boxes.
Several numeric data entry box formats suitable for OCR forms are included in the standard InDesign Templates, and where a need for additional formats is identified these will be included as necessary.
Numeric data entry boxes should allow a white space of 8mm x 4mm per character. Older forms currently using the previous standard of 6mm x 4mm may continue to do so, however any forms converting to OCR or new survey forms should use the 8mm x 4mm standard.
Commas and decimal points
Fields with decimal points impact on form definition and data files. If decimal places are to be reported they must be black if they are required to be visible on the image. They must occupy their own character space and must not touch the edge of an adjoining number box (Diagram 3).
Commas must be in drop-out colour in all cases.
Date separators ('/') are to be black (Diagram 4).
Check or 'ballot' boxes
Check or ballot boxes, for instance those used for 'Mark all that apply' or 'yes/no' responses, are to be 5mm square.
Where a negative value may be reported, an instruction to denote negatives with a sign and not brackets should be given above the box:
Free text answer spaces
Five different types of written in or free text occur on survey forms and these are treated differently in form design and processing. See Section 8 for details on text handling.
Section 6: Special completion instructions ('Please read first')
Special completion instructions are required to instruct respondents in how to complete scannable (or OCR) forms. These include instructions on the use of a black ball point pen to complete the form, printing within the boxes provided, correcting errors and reporting negative values.
Standard OCR completion instructions
OCR-specific instructions must be included at the start of all OCR forms in the 'Please read this first' box::
Points of special note
Capturing negative values is difficult with OCR forms. The Bureau's scanning system can read both negatives and brackets, but there is more risk of brackets being misread (e.g. as '1') and leading to recognition errors.
Where negatives may be reported, data entry box titles or data entry column headers are to include an appropriately worded instruction such as 'Show a loss (deficit) with a negative sign'. (Brackets will still be recognised by IFP but incidence of use should be reduced.)
The details of the completion instructions are to be modified where necessary to reflect the reporting requirements - e.g. some forms do not have any need for the 'reporting of negatives' provision and some collect data in units other than $ '000.
Forms can also incorporate changes that are more relevant to a particular collection to increase respondent recognition and to tailor the form for them. For example, the correction instruction for 'Income' could refer instead to 'Total Assets' or 'Total Turnover'.
Respondents are to be instructed to use a tick ('') as a check-mark.
If the form uses ballot or check-mark selection boxes, such as for 'mark all that apply' or 'yes/no' questions, an additional instruction to use a tailored to the actual
use of the boxes, can be included in the 'Please read this first' box, for instance:
'Mark with a tick the direction and size of any expected change for each item'.
The column titles for vertical lists with check boxes are to read either 'Tick one box' or 'Tick all that apply' as appropriate.
Section 7: Text handling and capture
Five different types of 'write-in' or free text occur on survey forms. These are 'Body-of-form text', 'General Comments', characters in numeric data entry boxes, front of form alterations, and miscellaneous text on the form. All are treated differently in form design and processing.
'Body-of-form-text' needs to be scanned and recognised for editing and coding and is captured in 8mm x 4mm segmented boxes.
The question must include the instruction '.... in BLOCK letters ...' with 'block' in upper case (See Diagram 7).
'Body-of-form' text will be scanned and recognised but will generally not be repaired because of the cost of repair. Using segmented text boxes means that sufficient letters are usually recognised to allow processing staff to make sense of the text in context even if some letters fail recognition (For example, a text response specifying capital expenditure as 'comput~rs' is recognisable and can be coded to 'computers'.)
All business forms are required to have a final 'Comments' box, and some forms also include boxes for comments in the body of the form. 'General comments' text can be quite lengthy, and segmented text boxes (as for body-of-form text) do not work as well for multiple lines. 'General comments' boxes are therefore not segmented, and this leads to respondents using free text handwriting in them.
Scanners can identify but not read free text handwriting and scanning will result in a recognition error which is flagged so that we are aware of the presence of text. Lists of forms with comments are then available to the relevant work areas who can check the image of the comment. For this to work, comments boxes on OCR'd forms use dropout colour for the boxes and the lines in the boxes. 'General comments' are split into two boxes and workflow routed to responsible areas (Diagram 8).
Non-numeric characters written in numeric data entry boxes
Non-numeric characters are often written in numeric data entry boxes. These include 'nil', 'n/a', 'negative' etc., and also include use of dashes ( - ) to indicate not in scope or a null response and brackets to denote negative values.
Completion instructions are not to use 'nil', 'n/a' or draw lines in data entry boxes because the scanning system can read common characters in numeric boxes, though with less reliability than for numbers.
There are no standard instructions for the use of numeric zero ('0'). Most BSCs discourage the use of zero to save respondent effort and recognition/repair effort. However, sometimes it may be desirable for respondents to record zero to indicate a particular result (e.g. crop failure), or to demonstrate that they have actually read a particular question if sequencing errors are anticipated. Survey-specific instructions can be used if appropriate (e.g. " ... complete failure of any crop ... denoted by '0' in the production column").
'Front of form changes'
Changes made by respondents to the address and ABN are identified on receipt. They are then scanned and manually updated in the Provider Integration Management System (PIMS).
Telephone numbers, fax numbers and contact names are only captured for those surveys where special arrangements have been made to have them scanned and loaded to processing systems. The BSC must determine whether they want the contact details information to be captured, and alert the DCU to this requirement during the forms definitions stage; otherwise, the information will not be captured.
Comments are sometimes written next to data entry boxes or in empty spaces anywhere on a form. If these are too close to a recognition area (i.e. a data entry box) then they will be picked up as scanning errors and can be manually repaired or inspected. Otherwise they will be missed. Providing adequate space at the end of the form for respondents to write comments may help reduce the number of comments written elsewhere on the form.
Section 8: Checklist
This checklist is for use when designing and constructing OCR forms to ensure that key design elements have been included.
Planning, conversion, overall design
Front and back of form
Data entry areas