# Raising Survey Response Rates by Using Machine Learning to Predict Gold Providers

The second research paper released as a part of the Australian Statistician's Technical Series

Released
27/04/2022

## Abstract

Maintaining high response rates for surveys is becoming more difficult for the Australian Bureau of Statistics (ABS) and other National Statistical Organisations. With limited budgets for data collection, this has led to a search for more effective strategies for following up respondents. This paper focuses on predicting survey respondents who will complete their survey without requiring any follow-up calls – hereinafter referred to as Gold Providers (GPs). Accurately predicting GPs enables follow up efforts to focus on the rest of the providers – those for whom follow up is likely to improve the likelihood they will respond. This responsive data collection protocol of allocating follow up resources is called the GP strategy.

This paper examines a live trial of this GP strategy for the 2018-19 cycle of the Rural Environment and Agricultural Commodities Survey (REACS), one of the ABS surveys struggling to achieve its target response rate. Two approaches were used to predict GPs: a rule based descriptive approach and a model based response propensity approach. The model based response propensity approach used a machine learning method called the random forests with regression trees method.

In the live trial, the machine learning approach outperformed the rule based approach by more accurately predicting GPs and non-GPs, and enabling more flexibility to set the required proportion of GPs in the full sample.

Key words: Gold Provider strategy; Intensive Follow-up; response propensity; machine learning; random forests method

## Introduction

Maintaining consistently high response rates for surveys is becoming more difficult for the ABS and other National Statistical Organisations. This fact, coupled with the increasing cost and constrained budgets for data collection, has led to these agencies searching for more effective follow-up strategies that aim to increase the response rates of those that will help to decrease non-response bias.

Where the target survey variables of interest are statistically independent of response propensity, this can be achieved through a responsive data collection protocol called Gold Provider (GP) strategy that strategically delays and redirects what is known in the ABS as Intensive Follow Up (IFU) efforts (i.e. telephone calls) otherwise spent on GPs to non-GPs during the GP strategy period. Here GPs are the survey respondents who self-respond, i.e. complete their survey without requiring any follow-up calls. During the GP strategy period, the non-GPs will have normal follow up calls being undertaken towards them, while the GPs will on purposely not be followed up, and the follow up calls saved from them will be re-allocated to the non-GPs. Once the GP strategy period has ended, the GPs that haven’t self-responded will be followed up as are the non-GPs. Given the key element of the GP strategy is to delay rather than stop or cancel IFU resources towards the GPs, conceptually it won’t cause any additional non-response bias to the estimates, and therefore won’t pose any significant statistical risk, if any, on data quality. On the other hand, the saved calls from delayed GP follow up can be used to follow up non-GPs, thus helping improve response rates and reduce non-response bias.  Therefore, this GP strategy ultimately aims to direct follow up efforts most efficiently to improve overall response rates with no reduction, or even improvement, on data quality.

Conducted annually, the Rural Environment and Agricultural Commodities Survey (REACS) is one of the ABS business surveys that have been faced with the difficulty in achieving their target response rates. For the REACS, the IFU period is around 3 months with 3 important milestones, namely the second and third Reminder Letter and the end of the IFU. We should note that the first Reminder Letter is not considered as a critical milestone due to its early occurrence. Throughout the IFU period, the key IFU strategy is to prioritise the IFU resources (i.e. calls) towards non-respondents in the Completely Enumerated (CEd) sector, followed by the sampled sector. Here the CEd sector refers to the one that contains the respondents that have a selection probability of 1 because they have a significant impact on the quality of the estimates. Nevertheless, within the CEd sector and likewise the sampled sector, there is an implicit assumption that all respondents have an identical propensity to respond and require equal resource to respond. Therefore, resource allocation intensity within the sectors is not differentiated for different respondents.

Clearly this is a less cost-effective IFU strategy than the GP strategy. To illustrate the efficacy of the GP strategy, we conducted a live trial on half of the sample for the 2018-19 cycle of REACS.

This paper provides an overview of the methodology used to predict and select the GPs from the 2018-19 sample, the setting up of the live trial, and assessment of the efficacy of the GP strategy in terms of prediction accuracy, cost saving as well as impact on data quality.

## Approaches to define GPs

The key to the success of a GP strategy is to accurately identify GPs. There are different ways to achieve that. A straightforward approach is to use a rule based descriptive (RBD) method to define and predict GPs based on their response behaviour in previous survey cycles. This has historically been used for a limited number of business surveys in the ABS.

Alternatively, since model-predicted survey response propensities have been used to develop data collection protocols (McCarthy et al., 2009; Peytchev et al., 2010; Earp et al., 2013; Buskirk et al., 2013; Phipps and Toth, 2012; Wilson et al., 2015; Plewis and Shlomo, 2017), a model based response propensity (MBRP) method could also be tailored to define and predict GPs. For the live trial, both the RBD and MBRP methods have been assessed to compare their performance in accurately predicting GPs.

To develop the rules for RDB or train the algorithm for MBRP to predict GPs for the REACS 2018-19 cycle, we used a dataset that pulled together survey information from 4 historical survey cycles ranging from 2014-15 to 2017-18. The consolidated data set comprised such survey information as the response status during each IFU stage (i.e. the second and third Reminder Letter and the end of the IFU), the final response status, the final response date, total number of Reminder Letters sent to a respondent before response, number of calls made to a respondent during each IFU milestone period, etc. After data consolidation and confrontation, a total of 157,000 observations from around 100,000 respondents (many of them occurred in more than 1 cycle) were included in the data set.

It is worth noting that only the survey respondents of 2018-19 cycle that have historical information in the consolidated data set will be predicted with their GP status, and the ones that were newly selected into the 2018-19 cycle or did not have complete historical information were automatically classified as non-GPs.

### The RBD approach

With the previously used GP strategy in the ABS, GPs were defined as survey respondents who completed their survey without requiring any follow up in the previous survey cycle (referred to as "Definition 1").

For this live trial, a couple of alternative definitions with a specific relaxation on the rule relating to the number of calls at different milestones were also explored. These were:

• Survey respondents that responded before Reminder Letter 3 with 2 calls or less in the previous survey cycle (referred to as "Definition 2")
• Survey respondents that responded by IFU end with 2 calls or less in the previous survey cycle (referred to as "Definition 3")

A retrospective analysis was conducted on all three definitions from the perspectives of prediction accuracy on GPs, and potential savings on the number of IFU calls.

From the perspective of prediction accuracy, the accuracy level of each definition in correctly predicting GPs was assessed. That is, to decide whether they are true GPs or not for the REACS survey cycle of 2015-16, the set of respondents that were predicted as GPs based on the chosen definition using information from survey cycle of 2014-15, were assessed against their actual response status and attempted IFU efforts. The results showed that Definition 3 provided the most accurate prediction of GPs with an accuracy rate over 80% against 75% and 70% of Definition 1 and 2 respectively. The assessment for the REACS survey cycles of 2016-17 and 2017-18 showed consistent results.

From the perspective of cost saving, potential call savings from not following up GPs that were predicted by each definition were assessed. The results showed that the number of calls that would have been saved by using Definition 3 are larger than those from using Definitions 1 and 2. More specifically, compared to Definition 1, the percentage of calls that would have been saved from Definition 3 is 6.3%, 10.5% and 8.5% more for cycle 2015-16, 2016-17 and 2017-18 respectively. And compared to Definition 2, the percentage of calls that would have been saved from Definition 3 is even bigger, namely, 7.8%, 15.4% and 13.2% more for cycle 2015-16, 2016-17 and 2017-18 respectively.

Based on results from the retrospective analysis, Definition 3 was chosen as the final RBD definition. That is, the GPs are survey respondents that responded by the end of IFU with 2 calls or less in the previous survey cycle.

### The MBRP approach

#### The random forests with regression trees method

A number of methods to model response propensities have been advocated in the literature. Traditionally, response propensities are estimated by fitting a logit or probit regression model (Black et al., 2010; Peytchev et al., 2010; Chen et al., 2012; Whiting and McNaughtan, 2013; Plewis and Shlomo, 2017). With a logit or probit regression model, there are multiple validating measures to test the robustness and fitness of the model results. However, Earp et al. (2013) pointed out that logit regression requires the analyst to hypothesise variables thought to be associated with non-response and then use the observed data to fit the model parameters. Therefore, there is a risk that these “explanatory” variables may be mis-specified or under-identified, and the logistic regression models are often difficult to interpret because of interactions between the characteristic variables.

Newer machine learning (ML) techniques for classification and predicting including classification and regression trees (CART) (Phipps and Toth, 2012; Valiiant et al., 2013; Earp et al., 2013; Toth and Phipps, 2012; Buskirk and Kolenikov, 2015; Lohr et al., 2015; Wilson et al., 2015) and random forests (Brieman, 2001; Buskirk et al., 2013; Buskirk and Kolenikov, 2015) have been proven to be powerful tools to predict survey response propensities. Whilst these ML techniques also suffer from model misspecification if the explanatory variables are not correctly and comprehensively identified, they are superior to logistic regression models because they do not require the assumption of linearity in the modelling. More importantly, the automatic interaction detection inherent in trees provides a straightforward method to account for and easily interpret interactions between auxiliary data and paradata and the propensity to respond (Earp et al., 2013; Toth and Phipps, 2014; Buskirk and Kolenikov, 2015).

Of these different ML techniques, the random forests method is an example of a nonparametric “ensemble” tree-based method because they generate estimates by combining the results of several classification or regression trees rather than using the results of a single tree. By aggregating estimates across many trees, random forests tend to generate more stable estimates and with less variance as compared to those generated from a single tree as it overcomes the associated problems of overfitting by using bootstrapped datasets and limiting the number of features selected by the algorithm at each node (Brieman, 2001).

A forest of classification trees and a forest of regression trees are the two main propensity estimation methods that have been developed in random forests. For this live trial, the random forests with regression trees method was chosen as the MBRP approach as it can generate continuous response propensities.

#### Selection of predictors

The predictors selected to be included in the random forests play a vital role in model's fitness and prediction accuracy. Therefore, efforts are required to select relevant predictors to further improve the prediction accuracy.

Standard response models which only include survey variables as predictive variables have been shown by the literature (Durrant et al., 2017) to perform poorly in terms of prediction. Instead, an ABS study (Black et al., 2010) recommends using a framework that covers 6 main categories, including area characteristics, business characteristics, survey design features, respondent characteristics, interviewer characteristics, and interviewer observations. The first three categories are survey variables, while the last three categories are referred to as paradata. Using this framework, descriptive analysis was conducted to determine which survey variables and paradata items to be adopted as predictors for the live trial. These included state, industry, size, significance level and weighting contribution towards estimates (as the survey variables) and number of calls being made and reminder letters being sent (as the paradata). Due to lack of data, characteristics from the categories of interviewer characteristics and interviewer observations were not able to be included as predictors.

#### Parameter tuning and the 10-fold cross validation

Parameters used in random forests can increase the predictive power of the model. Excess use of parameters can, however, “overfit” the model and cause prediction bias. Therefore, it is very important to tune the parameters during the modelling process to achieve optimal predictive performance of the model. The most common parameters we select to tune include ntree (the number of trees to grow in the forest), the mtry (the number of variables used to build each tree) as well as the nodesize (minimum size of terminal nodes).

To determine parameter choices for this work, we performed preliminary tests by running the random forests model on the data set of 15,700 observations as the training data set. We then chose the set of parameters that produced the least error rate in the prediction. All calculations for this work were performed using the R package RandomForest.

The tests indicated that for the regression trees method, stable error rates for the forests would be achieved using $$ntree = 300$$ with $$mtry=4$$ and $$nodesize = 20$$.

To estimate the “out of sample” prediction error rate, we used a n-fold cross validation approach to produce final response propensity scores, utilising the chosen set of parameters determined in the test model. The general idea of the n-fold cross validation is dividing the data into two parts: one is the training data set and is used to build the model, i.e. growing the trees; the other one is the testing data set which is used to validate and evaluate the model, i.e. assessing out-of-sample prediction accuracy. 10-fold cross validation is mostly commonly used in machine learning. With the 10-fold cross validation, the original data set is randomly separated into 10 subsets with equal sample size of 1/10. 1 subset is chosen as the testing set and the other 9 are the training sets. Each time we choose a different testing set and repeat the procedure 10 times, therefore all the observations are tested exactly once. The estimation is obtained by combining the results from each testing set. For this work 157,000 observations were randomly grouped into 10 subsets with approximately equal sample size of 15,700 for each subgroup. Each time we chose a different subset as the testing set and the rest 9 subsets combined as the training set. By repeating the procedure 10 times, all of the 157,000 observations were tested once. The response propensity was estimated by combining the results from each testing set. These estimated response propensity scores were used to predict GPs and the predictions were compared with the actual status of the respondents to determine error rates. For the purpose of the live trial, respondents with a predicted propensity of 0.85 and above were considered to be GPs.

As mentioned in previous section, by pulling together survey information from four cycles ranging from 2014-15 to 2017-18, a total of 157,000 observations from around 100,000 respondents were included in the training data set. Therefore, many of the respondents occurred in more than 1 cycle and would have been predicted with more than 1 response propensity. For these respondents, their predicted response propensities from different cycles were averaged to generate the final response propensities.

Similar to the RBD approach that explored using different stages of the IFU period to decide its GP definition, the MBRP approach also predicted response propensities at different stages, i.e., the third Reminder Letter and IFU end. It was found that the response propensities produced at end of IFU provides the most accurate prediction. The relaxation on the number of calls was also adopted by the MBRP approach as part of the definition.

Therefore, the final GPs definition for the MBRP approach is: the GPs are survey respondents whose averaged predicted propensities to respond before the end of IFU with two or less calls are above a certain cut-off threshold (0.85 for this live trial).

## Set-up of the live trial to implement the GP strategy

### Final Selection of The GPs

Based on the final definition for each approach, the total sample of 27,159 respondents for REACS 2018-19 cycle were predicted as GPs or non-GPs. It is worth noting that there were respondents whose GP status couldn't be predicted by either approach due to lack of historical information. These respondents were automatically classified as non-GPs.

The GPs predicted from both approaches for the 2018-19 cycle are listed as follows:

• 10,722 GPs were predicted under the RBD definition
• 12,452 GPs were predicted under the MBRP definition (with the averaged response propensity score of 0.85 as the cut-off threshold); and
• 7,955 GPs were commonly predicted across both definitions.

In order to simplify workflows and have a large enough treatment group, it was decided to classify 40% of the total sample size as GPs using a combination of RBD and MBRP approaches. GPs for the trial were thus comprise of:

• All of the 7,955 respondents that were commonly predicted as GPs under both definitions
• A random sample of 1,454 respondents uniquely predicted as GP by the RBD definition; and
• A random sample of 1,454 respondents uniquely predicted as GP by the MBRP definition.

This collective way of recognising GPs for the live trial is called the GP_final approach. The other 60% of the total sample, including 1,718 CEd respondents and 14,155 sampled respondents were treated as non-GPs for this live trial.

### Setting up control and treatment groups

Both the predicted GPs and non-GPs were evenly split into two homogenous sub-groups – a control group and a treatment group – making sure to balance the number of total respondents, taking into consideration the similarity of specifications including the number of total new respondents, the number of total CEd respondents, the number of total GPs, and the number of CEd respondents that are GPs. It is worth noting that the even split was conducted at stratum level and the REACS sample is stratified by characteristics including size and geographic location. Therefore, the distribution by these characteristics between control and treatment group should also be similar.

The detailed composition of the allocated control and treatment groups is provided in Tables 1 and 2 below.

Table 1: Detailed composition of control group
Non-GPs  GPs
ContinuingNewTotalContinuingNewTotal
Sampled65427477289465514656
CEd803558587750775
Total73458028147543015431
Table 2: Detailed composition of treatment group
Non-GPs  GPs
ContinuingNewTotalContinuingNewTotal
Sampled65427477289465064656
CEd804568607760776
Total73468038149542665432

### Set-up of the GP strategy

For the live trial, while the control group would have "normal" IFU follow up being undertaken towards them throughout the entire data collection period, the treatment group was set up to be implemented with the GP strategy, with the aim to increase the overall response rates using the same IFU resources. The components of this were as follows:

• The GP strategy period was set to last for about two months starting from the beginning of the IFU until the sending out of the third Reminder Letter
• During the GP strategy period, while the non-GPs will have "normal" IFU action being undertaken towards them similar to the ones in the control group, i.e. being followed up with calls, the GP units will not have any IFU action being undertaken. Additionally, the IFU resources saved from these GP units will be re-allocated to the non-GPs; and
• After the GP strategy period, all non-respondents, whether GPs or not, will be resumed with "normal" IFU action until the end of IFU period, the same as for the control group.

## Assessment on the efficacy of the GP strategy

The live trial of the GP strategy was successfully conducted under the guideline of a rigorous implementation and monitoring framework. Upon its completion, an assessment was made by comparing the performance of treatment and control groups from three main perspectives, namely the prediction accuracy, cost-effectiveness and the impact on data quality. The assessment aimed to evaluate whether the GP strategy was effective in increasing the overall response rate by delaying the IFU actions towards the GPs and re-allocating these additional IFU resources to non-GPs during the GP strategy period. Additionally, the assessment was critical in informing the decisions of whether and on what scale to adopt the GP strategy on an ongoing basis for REACS, and identifying any future improvements that could be implemented to enhance its efficacy.

### Prediction accuracy of respondents’ GP status

To evaluate the success of the GP strategy, the first and foremost aspect was to assess how accurately the GPs were predicted from a retrospective perspective.

Table 3 below presents an overview of GP status defined by different approaches, as compared to the actual GP status for the 2018-19 cycle (numbers in the final row). Here the column "Unclassified" refers to the situation where a respondent's GP status couldn't be predicted by either approach due to lack of historical information. From Table 3, we can see that the population of the actual GP was about 60% of the total sample, 20% higher than the 40% that we set up. And the GP_MBRP only approach was able to identify closer total GP numbers. Additionally, it had much smaller number of unclassified respondents compared to the RBD only approach.

Table 3: Overview of GP status defined by different approaches
GP defining approachGPNon-GPUnclassifiedTotal
GP_final1086316296NA27159
GP_RBD only1072236541278327159
GP_MBRP only124528713599427159
Actual 2018-19 GP status1628510874NA27159

Table 4 below presents the prediction accuracy of different approaches in rate terms. From the results we can see clearly that all approaches have achieved high accuracy in predicting GPs, with an accuracy rate ranging from 76% to 80% respectively. We can further observe that the GP_MBRP only approach outperformed the others in achieving high accuracy rates for predicting both GPs and non-GPs.

Table 4: Overall GP status prediction accuracy rate by different approaches
GP defining approachGPNon-GP & unclassified
GP_final79.55%53.10%
GP_RBD Only76.52%50.84%
GP_MBRP Only79.06%56.21%

To further compare the prediction accuracy between the RBD and MBRP approaches, the 10,863 respondents that were predicted as GPs using the GP_final approach were broken down by the approach from which they were predicted, and their prediction accuracies were assessed against their actual GP status. This breakdown is presented in Table 5 below, from which we can observe that of the 7,955 GPs that were commonly predicted by both approaches, 6,495 were actual GPs, resulting in an accuracy rate around 82%. While of the two sets of 1,454 GPs that were uniquely predicted by either approach, the one predicted by the MBRP approach achieved a much higher accuracy rate than the one of the RBD approach, which again proved the higher prediction accuracy of the MBRP only approach.

Table 5. Breakdown of the classification accuracy rates by GP subgroups
Number of actual GPs
Non-GPGPTotalAccuracy rate
GP_InCommon14606495795581.65%
GP_RBD only4451009145469.39%
GP_MBRP only3161138145478.27%

From the retrospective analysis conducted above, we concluded that the GP identification approach that was implemented for the live trial was a successful one.  In addition, it could potentially be further improved by adopting the MBRP only approach as it outperformed the others in predicting both the GPs and non-GPs. Moreover, the MBRP only approach was more flexible in allowing the user to specify the overall GP proportions by adjusting the cut-off threshold of the predicted response propensity scores, and thus was more adaptable to changes and improvements based on historical information. As mentioned above, the actual GP respondents were about 60% of the full sample, rather than the 40% that we set up for the live trial. Therefore, for next cycle, to reflect this actual GP status, we should lower the cut-off threshold from 0.85 to for example 0.75 to allow more GPs to be predicted by the MBRP approach.

### Cost-effectiveness of the GP Strategy

Given that the IFU resources remained the same and the main objective was to increase the overall response rate, it is critical to analyse the success of the GP strategy from the cost-effectiveness perspective, more specifically, the response rate achieved (i.e. effectiveness) and the IFU resources allocated as indicated by calls during the IFU period (i.e. costs) for the GP and non-GP respondents between the control and treatment group. From Table 6 below, we can see that the treatment group achieved a similar response rate compared to the control group with 708 less calls made to GPs. It was also an obvious observation that the GPs in the treatment group achieved a much smaller rate of average calls per response, compared to the GPs in the control group. These were strong indications of the effectiveness of the GP strategy in generating savings in IFU resources.  However, the live trial also demonstrated that diverting saved resources to non-GPs was not cost effective.  As seen in Table 6, there was hardly any improvement on the non-GP response rate even with an increase of over 900 calls to them.

Table 6: Response rate versus IFU resources by GP status and trial groups
GP status Number of responses      Response rate     Number of callsAverage calls/response
TreatmentControlTreatmentControlTreatmentControlTreatmentControl
GPs4961502691.33%92.45%52312310.110.24
Non-GPs5025501168.82%68.40%390630010.780.60
Total99861003778.17%78.39%442942320.440.42

Table 7 looks further at the split of non-GPs between the control and treatment groups in terms of the response status and the IFU resources allocated.  We can see that for the treatment group, an additional 875 calls (228 plus 647) unexpectedly resulted in an overall smaller number of responders of 24 (38 minus 14), as compared to the control group. We can also see average number of calls allocated to the respondents within the treatment group was bigger than those of the control group, without an overall higher response rate being achieved.

Table 7: Response rate versus IFU resources of the non-GP units by trial groups
Response Status Number of units  Number of calls Average calls/unit
TreatmentControlDiffTreatmentControlDiffTreatmentControl
Non-response227723153810097812280.440.34
Response5025501114273520886470.540.42
Out of scope (a)84782126162132300.190.16
Total814981472390630019050.480.37

(a)  Out of scope refer to the respondents that were identified as out of scope for the 2018-19 cycle after being selected

Results from both Table 6 and 7 suggested that, for the treatment group, the calls that were saved from the GPs and redirected to non-GPs did not result in a better response outcome. This may have been due to the ineffective re-allocation of these resources to non-GPs. Therefore, to conduct the re-allocation more efficiently, the additional resources should be targeted on specific non-GPs that are more likely to respond with more intense follow up calls, i.e. the ones with relatively higher predicted response propensities. To validate this proposed approach, the response rate achieved and the IFU resources allocated of the non-GPs between control and treatment groups were examined further by their response propensity score ranges as presented in Table 8 below.

From Table 8 we can see, as expected for both the treatment and control groups, there was strong correlation between  response propensities and response rates. This again was a strong indication of the high predictive power of the MBRP approach in predicting GPs. Similarly, for each response propensity score range, we can also observe a positive relationship between the percentage of respondents and calls for both the treatment and control groups. We note, however, with the treatment group, the calls that were allocated to the respondents within the response propensity score ranges that are annotated (a) and (b) respectively in Table 8 were disproportionally bigger or smaller as compared to those of the control group, without converting to a commensurate higher or lower response rate. This suggest that if these additional calls had been re-allocated towards the non-GPs based on their predicted response propensity score ranges from highest to lowest, the overall cost-effectiveness could have been improved as indicated by the results of the control group. It is worth noting that the result in Table 8 should be interpreted with the focus on the RP score ranges that have substantive number of units, i.e. the ones starting from (0.2, 0.3].

Table 8: Response rate versus the number of calls of the non-GPs by trial groups and response propensity scores ranges
RP score
range
Unit_Total   Unit_Respond             Call_Total             Call/unit        % of all units        % of all calls    Response rate
TreatControlTreatControlTreatControlTreatControlTreatControlTreatControlTreatControl
[0, 0.1]10089343340270.400.301.23%1.09%1.02%0.90%34.00%37.08%
(0.1, 0.2]8478283451400.610.511.03%0.96%1.31%1.33%33.33%43.59%
(0.2, 0.3]178174625186670.480.392.18%2.14%2.20%2.23%34.83%29.31%
(0.3, 0.4]303316100971681330.550.423.72%3.88%4.30%4.43%33.00%30.70%
(0.4, 0.5]5025211992032941860.590.366.16%6.39%7.53%(a)6.20%(a)39.64%38.96%
(0.5, 0.6]6426433373194382570.680.407.88%7.89%11.21%(a)8.56%(a)52.49%49.61%
(0.6, 0.7]6536833823694703240.720.478.01%8.38%12.03%(a)10.80%(a)58.50%54.03%
(0.7, 0.85]10039627136756545350.650.5612.31%11.81%16.74%(b)17.83%(b)71.09%70.17%
(0.85, 1]16701701134213834543690.270.2220.49%20.88%11.62%(b)12.30%(b)80.36%81.31%
Missing3014298018281847125110630.420.3636.99%36.58%32.03%(b)35.42%(b)60.65%61.98%
Total8149814750255011390630010.480.37100.00%100.00%100.00%100.00%61.66%61.51%

(a) Disproportionally smaller
(b) Disproportionally bigger

To test this assumption, potential new response rates for each score ranges of the treatment group were also simulated, assuming that the treatment group's response conversion per calls remain the same, and call distribution towards non-GPs was aligned to those of the control group. Simulation results as presented in Table 9 suggest that with the same total number of 3,906 calls being made towards the non-GPs within the treatment group, the overall responders could potentially be increased from 5,025 to 5,189, resulting in an overall response rate increase of 2%. It is worth noting that these simulation results were conservative as theoretically the new approach of targeted call re-allocation should be more effective than the one of the control group.

Table 9: Simulated results of number of calls and response rates for the non-GPs within treatment group
RP score rangeCalls/response conversionCalls_TotalUnits_RespondedResponse rate
[0, 0.1]1.176353029.87%
(0.1, 0.2]1.821522934.03%
(0.2, 0.3]1.387876335.32%
(0.3, 0.4]1.6817310334.01%
(0.4, 0.5]1.47724216432.64%
(0.5, 0.6]1.333525740.09%
(0.6, 0.7]1.2342234352.49%
(0.7, 0.85]0.91769675975.69%
(0.85, 1]0.338480142085.01%
Missing0.6841384202267.08%
Total0.7773906518963.68%

To summarise, the assessment of cost-effectiveness suggested that the GP strategy was effective in saving IFU resources from the GPs by delaying the commencement of IFU resources towards them, without causing a reduction in response rate for these respondents. This result was a strong indication of the success of the GP strategy in diverting resources away from respondents that didn't need IFU. However, the assessment also indicated that the IFU resources that were saved from the GPs and redirected to non-GPs did not result in a better response outcome due to ineffective re-allocation of these resources. A further simulation study showed that if these saved resources were directed to non-GPs in the same manner as the control group, a higher response rate by 2% could have be achieved. This suggests that, if the saved resources were directed to only non-GPs with the highest response propensity, even more gains in response may be achieved. Further research should be conducted to develop an effective deployment of the saved resources to non-GPs with a view to reducing non-response bias.

### Assessment on data quality

As already stated in the introduction section, given the key part of the GP strategy is to delay rather than stop IFU resources towards the GPs, conceptually it shouldn’t cause any non-response bias to the estimates, and therefore won’t pose any significant statistical risk if any on the data quality of the survey. To validate this statement quantitatively, we also analysed the estimates between the treatment and control group.

Based on the comparison between the control and treatment group on their estimates produced by states and variables of interest as presented in the scatter plot below, it was found that no systematic bias (i.e. over or under estimation issue) occurred one way or the other. Therefore, we can conclude that the implementation of the GP strategy in the way the live trial was conducted hasn't posed any additional non-response bias to overall estimates.

## Conclusion

To help increase the overall response rate while maintaining the same data collection budget, the GP strategy was conducted through a live trial for the REACS 2018-19 cycle. The GP strategy aims to increase response rates by redirecting unnecessary follow-up contacts from GPs to non-GPs, i.e. from the respondents that are more likely to self-respond, to the ones that are less likely to self-respond.

This paper provides an overview of a live trial of the GP strategy focusing on the aspects of predicting and selecting the GP respondents from the 2018-19 sample, setting up the live trial with control and treatment groups, and assessing the efficacy of the GP strategy in terms of prediction accuracy, cost saving and data quality.

To assess prediction accuracy, two methods were adopted to predict GPs. For the RBD approach, GPs were defined as survey respondents that responded by the end of IFU with 2 calls or less in the previous survey cycle. For the MBRP approach, GPs were defined as survey respondents whose average model-predicted propensities to respond before the end of IFU with two or less calls are above a certain cut-off threshold (0.85 for this live trial). Combining both approaches, a total number of 10,863 respondents (40% of the total sample size) were predicted as GPs, while the remaining 60% were classified as non-GPs. These GPs and non-GPs were evenly and randomly split into two subgroups - control and treatment groups, taking into consideration the similarity of specifications including the number of total new respondents, the number of total CEd respondents, the number of total GP respondents, and the number of CEd respondents that are GPs.

For the live trial, while the control group will have "normal" IFU action being undertaken towards them throughout the entire data collection period, the treatment group was set up to be implemented with the GP strategy. This included the following elements:

• The GP strategy period was set to start from the beginning of the IFU until the sending out of the third Reminder Letter
• During the GP strategy period, while the non-GP units will have "normal" IFU action being undertaken towards them similar to the ones in the control group, i.e. being followed up with calls, the GP units will not have any IFU action being taken. Additionally, the IFU resources saved from these GP units will be re-allocated to the non-GPs; and
• After the GP strategy period, all non-respondents, whether GPs or not, will be resumed with "normal" IFU action until the end of IFU period, the same as the control group.

Upon successful completion of the live trial, an assessment was made by comparing the treatment and control groups in terms of prediction accuracy, cost-effectiveness and the impact on data quality. The assessment aimed to evaluate whether the GP strategy was effective in increasing the overall response rate by delaying the IFU actions towards the GP respondents and re-allocating these additional IFU resources to the non-GPs during the GP strategy period. Additionally, the assessment was critical in informing the decisions on whether, and on what scale, to adopt the GP strategy on an ongoing basis for REACS, and identifying any future improvements that could be implemented to enhance its efficacy.

The assessment of prediction accuracy confirmed both the RBD and MBRP approaches in identifying GPs were effective.  In addition, it showed that the MBRP approach outperformed the RBD in predicting both the GPs and non-GPs. Additionally, the MBRP approach was more flexible in changing the overall GP proportions by adjusting the cut-off threshold of the predicted response propensity scores, and thus was more adaptable to changes and improvements based on historical information.

The assessment on cost-effectiveness suggested that the GP strategy was effective in saving IFU resources from the GPs by delaying the commencement of IFU resources towards them, without causing a reduction in response rate for these respondents. This result was a strong indication of the success of the GP strategy in diverting resources away from respondents that didn't need IFU. However, the assessment also indicated that the IFU resources that were saved from the GPs and redirected to non-GPs did not result in a better response outcome due to ineffective re-allocation of these resources in the live trial.

The assessment on data quality confirmed that no additional data quality risk has been posed by the implementation of the GP strategy.

Based on the findings from the assessment, it can be concluded that the MBRP based GP strategy was effective in identifying GPs and in saving IFU resources and can be adopted for REACS on an ongoing basis. However, further research would be needed to develop effective deployment strategies for these resources with a view to further improving response rates and reducing non-response bias for future REACS.

## References

### Show all

Black, M., Brent, G., Bell, P., Starick, R. and Zhang, M. (2010).Empirical Models for Survey Cost, Response Rate and Bias Using Paradata, cat. no. 1352.0.55.113, ABS, Canberra.

Breiman, L. (2001). Random Forests. Machine Learning, 45 (1), 5-32.

Burks, A.T. and Buskirk, T. D. (2012). Can Response Propensities Grow on Trees? Exploring Response Propensity Models Based on Random Forests Using Ancillary Data Appended to an ABS Sampling Frame. Paper presented at the 2012 Midwest Association of Public Opinion Research, Chicago, IL. http://www.mapor.org/confdocs/progarchives/mapor_2012.pdf (accessed on 20/12/2017).

Buskirk, T. D., Burks, A-T., West, B.T. (2013). Can Survey Response Propensities Grow on Trees? Comparing the Validity of Random Forests and Logistic Regression Models Using Population Variables Appended to an ABS Sampling Frame,” Poster presented at the 2013 Conference on Statistical Practice, New Orleans.

Buskirk, T. D. and Kolenikov, S. (2015). Finding Respondents in the Forest: A Comparison of Logistic Regression and Random Forest Models for Response Propensity Weighting and Stratification. Survey Insights: Methods from the Field, Weighting: Practical Issues and ‘How to’ Approach. Retrieved from http://surveyinsights.org/?p=5108 (accessed on 20/12/2017).

Chen, Q., Gelman, A., Tracy, M., Norris, F. H., & Galea, S. (2012). Weighting Adjustments for Panel Nonresponse.  Available at http://www.stat.columbia.edu/~gelman/research/unpublished/weighting%20adjustments%20for%20panel%20surveys.pdf (accessed on 20/12/2017).

Durrant, G.B., Maslovskaya, O. and Smith, P. W.F. (2017).Using Prior Wave Information and Paradata: Can They Help to Predict Response Outcomes and Call Sequence Length in a Longitudinal Study? Journal of Official Statistics, 33-3, 801–833.

Earp, M., Toth, D., Phipps, P., and Oslund, C. (2013). Identifying and Comparing Characteristics of Nonrespondents throughout the Data Collection Process. Available at https://www.bls.gov/osmr/pdf/st130090.pdf (accessed on 20/12/2017).

McCarthy, J.T., Jacob, T. and Atkinson, D. (2009). Innovative Uses of Data Mining Techniques in the Production of Official Statistics. Federal Committee on Statistical Methodology Papers. https://www.nass.usda.gov/Education_and_Outreach/Reports,_Presentations_and_Conferences/reports/conferences/FCSM/data%20mining%202009%20fcsm.pdf (accessed on 20/12/2017).

Peytchev, A., Riley, S., Rosen, J., Murphy, J. and Lindblad, M. (2010). Reduction of Nonresponse Bias in Surveys through Case Prioritization. Survey Research Methods, 4-1, 21-29.

Phipps, P. and Toth, D. (2012). Analyzing Establishment Nonresponse Using an Interpretable Regression Tree Model with Linked Administrative Data. Annals of Applied Statistics, 6, 772-794.

Phipps, P. and Toth, D. (2012). Regression Tree Models for Analyzing Survey Response. Available at https://www.bls.gov/osmr/pdf/st140160.pdf (accessed on 20/12/2017).

Plewis, I. and Shlomo, N. (2017). Using Response Propensity Models to Improve the Quality of Response Data in Longitudinal Studies. Journal of Official Statistics, 33-3, 753–779.

Lohr, S., Hsu, V., and Montaquila, J. (2015). Using Classification and Regression Trees to Model Survey Nonresponse. Available at https://ww2.amstat.org/sections/srms/Proceedings/y2015/files/234054.pdf (assessed on 20/12/2017).

Valliant, R., Dever, J., and Kreuter, F. (2013). Practical Tools for Designing and Weighting Survey Samples. Springer, New York.

Whiting, J., and McNaughtan, R. (2013). Response Modelling for the 2016 Census Enumeration Model, cat. no. 1352.0.55.136, ABS, Canberra.

Wilson, T., McCarthy, J., and Dau, A. (2015). Adaptive Design in an Establishment Survey: Targeting, Applying and Measuring ‘Optimal’ Data Collection Procedures in the Agricultural Resource Management Survey. Paper presented at the 2016 International Conference on Establishment Surveys, Geneva, Switzerland. http://ww2.amstat.org/meetings/ices/2016/proceedings/047_ices15Final00159.pdf (accessed on 20/12/2017).

## Acknowledgements

### Show all

I would like to express my special thanks to Dr. David Gruen AO, Australian Statistician for providing his insights to this paper and his endorsement to publish this paper as part of the Australian Statistician technical series. My deep and sincere gratitude also goes to Dr. Siu-Ming Tam, Former Chief Methodologist, Dr. Anders Holmberg, Chief Methodologist and Paul Schubert, Program Manager for making their time to review this paper many times throughout multiple rounds of edits. I also thank Professor Natalie Shlomo from the University of Manchester for providing helpful comments.

The Gold Provider project that this paper is based on was a collaborative achievement with insightful inputs and hard work from my ABS colleagues in various areas, including Business Statistics Methodology, Agricultural Statistics Program, Data Collection Design Centre, National Data Acquisition Centre and Modelling, Analysis and Visualisation. I am very much grateful for their support and dedication. In particular, I thank Justin Farrow and Lyndon Ang for their thoughtful guidance, Susan Fletcher and Tom Davidson for their tireless efforts, and Kirrilie Horswill and Sean Geltner for assisting with publishing of this paper.

Summer Wang
Assistant Director
Methodology Division

## Further information on the 2018-19 REACS

### Show all

Further information on the ABS' 2018-19 agricultural survey are available in the publication Agricultural Commodities, Australia.

Further information on the methodology used in the ABS' 2018-19 annual agricultural survey can be found in Agricultural Commodities, Australia methodology

## Further information on the Australian Statistician's Technical Series

### Show all

The Australian Statistician's Technical Series presents analysis and discussion of new developments in the statistical methods used by the ABS.

The series aims to inform the Australian community, stimulate discussion and invite feedback about important technical issues.

Further information can be found in the following media release.