Application of Multiple Imputations to Freight Transportation Survey Data: A Case Study of Commodity Flow Survey

Problem statement: Freight transportation data was indispensable input to transportation planning. In Thailand, efforts had been put to coll ect freight movement data by conducting road side survey and commodity flow survey. The results of th ese surveys did not produce consistent volume of shipment due to limited sampling coverage and non-r esponse. Nevertheless, freight distribution patterns, which were derived from these surveys, we re favorably consistent with each other. Approach: The objective of this study was propose an approac h to improving quality of the commodity flow survey data in terms of total shipme nt weight. Our scope of study was limited to consumer goods and food stuffs. Multiple imputation s were performed to correct non-response. The shipment weight was again adjusted by taking into a ccount of the probability of no shipment in a particular quarter. Results: Comparison between the adjusted weight and road si de survey data showed that the discrepancies in total weight of significa ntly reduced. Conclusion: Total shipment weights of the CFS after the adjustments are compares to those of road side survey. Plausible result is obtained for the case of consumer goods, while that of food stuf fs is still notably different.


INTRODUCTION
Freight transportation plays an important role in economic development. It is used to move raw materials and products from origins to destinations. Efficient freight transportation management is a key success factor of country. Government is responsible to the development of infrastructure to promote efficiency and cost reduction in freight transportation. In order to build infrastructure in the most cost effective manner, public sector needs to analyze freight transportation demand and supply. Freight movement data is an indispensable input to transportation planning. Data collection of freight transportation, however, is particularly costly and time consuming. As a result, given budget limitations, we have to estimate the population parameters by using sample data.
Various survey methods may be employed to collect freight transportation data. For example, Commodity Flow Survey (CFS) is an approach to collecting data from shippers, while road side interviews may be used to gather data from drivers. Each survey technique has its own advantages and drawbacks. CFS, for instance, offers rich detailed data on origin-destination and mode share, which cannot be easily obtained by other techniques. It is an indispensable input to truck origin-destination matrix estimation, which is very useful for freight transportation planning. However, it needs cohesive cooperation from respondents as the survey period is designed to cover all quarters in a year. Road-side survey, on the contrary, requires less cooperation from interviewees due to availability of weight in motions at the survey locations. Nonetheless, road side survey suffers from incomplete coverage of survey locations and double counting, which can be somehow corrected (Kuwahara and Sullivan, 1987).
In Thailand, related agencies have long relied on freight transportation database and model, which were established in 1990s (OECD, 1997). Changes in demography, economy and transportation network have made this model become obsolete. Thus, in recent years Department of Land Transport and National Statistical Office independently conducted freight movement surveys. Department of Land Transport (2009) opted to conduct road side interviews in 10 selected provinces, which are major economic activity centers of Thailand. Given budget limitations, the agency expected that these survey locations would cover a majority of freight movement volume in the country.
National Statistical Office, on the other hand, opted to conduct shipper survey, following the framework of CFS Survey 2002 in the U.S. (Bureau of Transportation Statistics, 2007). The survey's goals are to estimate the characteristics associated with the goods shipments. The sampling methodology is theoretically sound but the survey results turned out to be underestimated in comparison to other data sources (SSHC, 2008). This is not surprising, given that a considerable number of respondents reported that there was no outbound shipment activity for all survey quarters. Zmud (2006) also mentioned high level of unit non response in the 2002 CFS conducted in the United States.
It may be stated that, at present, freight survey data in Thailand need to be refined due to limited sampling coverage in the case of road side survey and non response in the case of CFS. The CFS data are selected as datasets in this study due to their relatively solid statistical framework and richness of the data. Hence, this study is aimed to investigate an approach to improving Thailand CFS data by correcting nonresponse. Parametric regression-based multiple imputation is chosen as it was found to be comparable to non parametric methods. We focused only on two commodity types namely food stuffs and consumer products due to low seasonality of the shipment volume. Lack of reliable freight transportation data makes it impossible to test validity of our approach. Anyway, comparison to the other data source should give us some insights into benefits of the proposed method.
The study starts with a brief summary of Thailand 2007 CFS data. Next comparison with other data source provides an evidence of needs for adjustment of the CFS data. Subsequent of the study focus on multiple imputation as a tool for correcting non response. Finally, comparison between adjusted results with the road side survey data is discussed.
Commodity flow survey in Thailand: SSHC (2001) conducted Thailand Commodity Flow Survey to collect freight movement data, which are necessary for transportation infrastructure planning and development. The sampling methodology is based on that of the 2002 CFS conducted in the United States. The survey covered establishments in various industries including agriculture, mining, manufacturing, wholesale and retail trade. In order to avoid double counting, only outbound shipments were sampled from all target industries except agriculture. The reason is that it would be very costly to conduct survey at farms. Thus, agricultural products, such as rice or sugarcane, were sampled from inbound shipments of the establishments in agro-industrial sector. Population is stratified by geography, industry and size of establishment. Establishments in each stratum were sampled based on a three-stage sample design (i.e., establishments, reporting weeks and shipments). Shipment data collected in the survey include: • Total number of shipments in the specified week • Value and weight of shipments • Type of product • Origin -Destination • Mode of transport Response rate and pattern: CFS data can be classified into 2 categories namely response and non-response. Response is defined as data, of which respondent reports shipment in at least one quarter. Non-response is defined as data, of which respondent reports no shipment in all quarters. We opted to classify the data as non response due to the fact that it is very unlikely that an establishment having no shipment in any quarter throughout a year. Response rates vary with product types and survey periods. The response rate of foodstuffs is about 53%, while that of consumer goods is around just 18% as illustrate in Fig. 1. At present, non response has been treated as zero shipment, resulting in considerably low estimates of shipment weight as will be discussed later.     1  1  70  48  89  44  2  28  19  42  21  3  15  11  31  15  4  32  22  41  20  2  12  31  38  41  32  13  9  11  12  9  14  14  17  19  15  23  7  9  14  11  24  7  9  7  5  34  13  16  36  28  3  123  17  21  52  28  124  24  29  54  29  134  11  13  34  18  234  30  37  48  25  4  all  140  100  462  100  Total  448  982 *: Represents the percentage of total for each case of number of responded quarters Response data were further analyzed to examine response behavior of the respondents. The proportions of establishments having shipment slightly reduced in quarter 2 and 3, then rebounded in quarter 4 as illustrate in Fig. 2. A longitudinal analysis of response behavior over the quarters shows that respondents are more likely to report shipment in quarter 1, as can be seen by the largest share of quarter 1 in every row of Table 1. Given low seasonality of the products, this pattern suggested that establishments tend to provide non response in the later quarters. Nonetheless, transition of states between response and non response should be reversible.
If response had always changed to non response irreversibly, the response rate would have dropped continuously. A rebound of response rate in quarter 4 indicated a possibility of transition from non-response to response.

Comparison of the CFS with road side survey data:
The CFS data were compared to road side survey data in terms of total shipment weight and distance distribution. Lack of reliable source of secondary data in Thailand makes it impossible to validate the survey results. As a result, this comparison was done to empirically examine commonness and differences between these surveys, which can be used to define the assumption of this study.
Weight: the total shipment weight of foodstuffs and consumer goods of the CFS and road side survey as illustrate in Table 2. Although it is not unusual to see high variability in the results of this kind of survey, it should be apparent that estimates from the CFS data are underestimated. Under estimation is not surprising as non response has been treated as zero shipment and the response rates are considerably low as mentioned above. The comparison in Table 2 gives us some idea on how much the CFS data is underestimated.
Distance distribution: Distance distribution of the shipments, derived from the CFS and road side survey data and were compared in order to examine reliability of the results. Given that road side interviews were conducted in only 10 provinces, only CFS data in the corresponding provinces were used for the comparison. Road side survey data is believed to provide unbiased estimates of trip length distribution due to the fact that vehicles were randomly selected at the survey locations. Thus, it serves as a benchmark for testing distance distribution of the CFS data (Hirun and Sirisoponsilp, 2010). Statistical test, using chi-square test, was performed to see if these distributions are different. The trip length distributions derived from these surveys are not significantly different at confidence level of 95% as illustrate in Fig. 3. This suggested that, although being underestimated, the distance distribution of the CFS data is reasonable to some extent. As a result, amusingness of the CFS data may be treated as random, irrespective of shipment distance.
The methodology of this study is summarized in Fig.  4. First, the CFS data are classified into 2 types namely response and non-response. Next, the response data are classified into 2 types of commodity, which are the focus of this study, namely foodstuffs and consumer goods.
Multiple imputations are performed based on parametric regression, in which macroeconomic data and establishment-specific data are used as independent variables. In the next step, no shipment pattern of response data is examined by using Markov chain process.
Probabilities of no shipment in each quarter are then used to adjust the original CFS and imputed weights. Finally, total adjusted weights are obtained and compared to those of road side survey.
Multiple imputations: Multiple Imputation (MI) is a technique for handling data sets with missing values. It involves imputing m values for each missing item and creating m (m>1) complete datasets. Across these complete data sets, the observed values are the same, but the missing values are filled in with different imputations to reflect uncertainty levels (Little and Rubin, 2002;Landerman et al., 1997). This approach retains the advantage of single imputation by using a complete data set and thus allowing standard statistical methods to be utilized. However, by allowing more than one value on a missing variable to be estimated, MI corrects for sampling variability and thus improves upon single imputation techniques. In addition, random error in the imputation process yields approximately unbiased estimates of all parameters, which no deterministic method can perform. Also, repeated imputation allows for good estimates of the standard errors. Rubin (1987) and Rubin and Schenker (1986) have demonstrated that even in extreme cases where the proportion of missing information constitute about a third of the data set, no more than 5 replicates of the model provides efficient estimates.
Assumption: Multiple imputations is performed under the following assumptions: • Missing At Random (MAR), which requires that the cause of the missing data is unrelated to the missing values, but may be related to the observed values of other variables, is assumed in this study.
Most imputation methods assume MAR as it is less restrictive than Missing Completely At Random (MCAR) • MI assumes that the variables are jointly multivariate normal • Parametric regression-based multiple imputation is chosen MI obviously is an approximation, as few data sets have variables that are all continuous and unbounded, much less multivariate normal. Yet researchers have found it to study as well as more complicated alternatives specially designed for categorical or mixed data sets (Schafer, 1997;Rubin and Schenker, 1991).

Regression analysis:
In order to apply regressionbased imputation, we have to determine the relationship between dependent variable and independent variables.

Fig. 4: Methodology
Shipment weight is the dependent variable, while independent variables must be chosen among the number of employees in the establishment or socioeconomic data such as gross provincial product, population and income of the province, in which the establishment is located. Regression analysis yielded the relationships between dependent and independent variables as follows in Eq. 1: (Y) = β 0 + β 1 ln(X 1 ) + β 2 ln(X 2 ) Where: Y c ,Y f = Shipment weights of consumer goods foodstuffs X 1 = Number of employees in the establishment X 2 = Population of the province in which the establishment is located , ς θ = Regression parameters Log-log relationship was found to produce better results compared to the linear one. One of its advantages is it always produces positive weight, while linear relationship may result in negative value. In addition, log-log relationship is also more robust to cope with wider range of data comparing to the linear form.
The multiple imputation results of consumer goods and foodstuffs as illustrate in Table 3. The imputation results are based on 10 complete datasets, which were used to determine the averages and standard errors. It should be noted that the probability of no shipment has not been taken into account in the multiple imputation.
Adjustment for no shipment pattern: As we mentioned earlier that a considerable number of respondents reported that there was no outbound shipment from their establishments in the specified week. The respondents, who are not willing to provide data on their shipments, may intentionally state that there was no shipment in that period. However, some of them may be true as it is possible that no shipment is made in a whole week. Since we have no reliable information to cross check the validity of the answer to this question, it is necessary to make certain assumption on the probability of no shipment in a given period.
We believe that the possibility of an establishment having no shipment in all quarters should be very low. Hence, in this study it is assumed that such data are non-response. In order to correct the non-response, we need to examine probability of no shipment of the response, which are defined by an establishment reporting shipments in at least one quarter.  Some of these establishments may report shipment in all quarters, while others may report no shipment in some quarters. Accordingly, we shall extract pattern of the probability of no shipment from these establishments. Shipment pattern of an establishment in a given quarter is assumed to follow a stationary Markov chain process (Breiman, 1986). Three states are assumed to exist, one of them is having some shipments, S t,i = 1 and the others are having no shipment, S t,i = 0 and nonresponse, S t,i = 2. Initial probabilities of these states are assumed to equal the observed data in the first quarter. Initial probability of state 1, p (1), is set to equal to a proportion of establishments having shipments in quarter 1. Initial probability of state 0, p (0), is set to equal to a proportion of establishments having no shipment in quarter 1. Since we have no reliable information on non-response, initial probability of non response in the first quarter, p (2), is assumed to be zero. Due to relatively low seasonality of consumer goods and food stuffs in Thailand (Big C Super Center, 2010), it is assumed for sake of simplicity that the number of establishments reporting no shipment in each quarter should be identical. Nevertheless, observed data indicated that the number of establishments reporting no shipment grows in successive quarters compared to that of quarter 1. The increase should be attributed to the propensity of false reporting of no shipment to avoid prolonged survey procedures.
Transition probabilities between the three states are assumed to be time independent and may be given as follows Eq. 2: The probability of each state in quarter t+1 can be expressed as follows Eq. 3-5: Where: P = Transition probability matrix (S t,I =0) = Probability of establishment i having no shipment in quarter t (S t,I =0) = Probability of establishment i having some shipments in quarter t (S t,I =0) = Probability of establishment i false report no shipment in quarter t Time independent transition probabilities between the three states can be estimated from the observed data by using Eq. 6 maximum likelihood estimation (Zamani and Ismail, 2010): Taking logarithm to the above Eq. 7, we obtained: The right side of (7) can be rearranged as follow Eq. 8: Thus, the objective function may be rewritten as Eq. 10: Taking derivatives with respect to , I→J , we obtain Eq. 11-12: Lagrange multiplier, L i , is given by Eq. 13: Obviously, i→j ∝n ij . Transition probabilities, obtained from the above formulation, can be summarized as follows Eq. 14: Transition probabilities of food stuffs, P f and consumer goods, P c , exhibit a comparable pattern. Probability of staying in no shipment state is approximately equivalent to one third in both cases. As we earlier expected, possibility of an establishment having no shipment in all quarters is very low; equivalent to just around 1%. Probability of an establishment having shipments in two consecutive quarters is in a range of 0.60-0.76, implying that cooperative respondents are more likely to consistently provide data. On the other hand, non-cooperative respondents are also more likely to be non response repetitively as reflected by a probability of 0.38-0.42. Another interesting result is that the probability of going into state 2, which is having no shipment, is equivalent either for state 0 or 1, which is very intuitive. Adjustment of shipment weight for both response and non response data is necessary in order to take into account of false reporting. Response data need an adjustment due to our assumption that a number of establishments false reporting no shipment in quarter 2-4. The probabilities of non response in these quarters were used to adjust shipment weights. Adjusted shipment weights of the response data are summarized in Table 4. Shipment weight in quarter 1 remains unchanged since we assumed that the data in this quarter is free of non response, while those in the remaining quarters marginally increase.
Shipment weight imputation was obtained by imputing weight for all establishments. Thus, the results need an adjustment to take into account of the probabilities of no shipment Adjusted shipment weight in each quarter is summarized in Table 5. After the adjustment, imputed weights reduced by 20-30%.
Finally, total shipment weights of the two products, as shown in Table 6, were compared with the road side survey data. It should be mentioned that the CFS survey was conducted one year prior to the road side survey. So it should not be surprising that the results would be somehow different. The total shipment weights of consumer products are almost equivalent, while that of food stuffs differ at an extent of 33% this result is attributed to the fact that response rate of consumer goods is very low. Once non response has been taken into account, shipment weight considerably increases. One of advantages of multiple imputation technique is an ability to determine standard error of the estimates.       Given high variability of this kind of survey, it may be reasonable to state that the proposed approach helps enhance the CFS data quality by taking into account of non response.

CONCLUSION
Commodity flow survey was conducted in Thailand in 2007 in order to collect freight movement data of the country. Initial expansion of the survey data produced underestimated shipment weight comparing to other data source. Nevertheless, trip length distribution of the CFS seemed to agree with that of road side survey. Investigating response rate and pattern leaded to a hypothesis that transition between response and non response should be reversible. Non response data, defined as a unit reporting no shipment in all quarters, were the target for a correction by performing multiple imputations. A parametric regression-based multiple imputation was employed to correct non response under the assumption of Missing At Random (MAR). Shipment weight was found to correlate with the number of employees in the establishment and population in province.
Transition probabilities between states are assumed to follow Markov chain process and can be estimated by the observed data under certain assumptions. Three states are assumed to exist, one of them is having some shipments and the others are having no shipment and non-response respectively. Non response is defined as an establishment, which false reports no shipment in a particular quarter. Given low seasonality of the products in this study, proportion of establishments having no shipment in each quarter is assumed to be constant. Transition probability matrix of food stuffs and consumer goods turned out to be quite similar. It also confirmed our prior expectation of rare possibility of an establishment having no shipment in all quarters.
Multiple imputation results need to be adjusted for no shipment, based on the probabilities which were derived previously. Adjusted multiple imputation results reduced by 20-30% comparing to original. Under the same assumption, non response is also believed to mix in the response data. Consequently, by taking into account of non response, original shipment weights in each quarter of the CFS were also adjusted. Total shipment weights of the CFS after the adjustments were compared to those of road side survey. Plausible result was obtained for the case of consumer goods, while that of food stuffs was still notably different.
It should be noted that in this study we have not addressed the problem of under reporting, in which respondent intentionally understate the number of shipments and/or shipment weights. It was assumed that the number of shipments and shipment weights as reported by respondents are true. It is likely that the seemingly low estimates of total shipment weight of food stuffs may be attributed to the under reporting. Due to unavailability of secondary data for cross checking, this problem would be a challenging topic for future study.