Boosted Regression Estimates of Spatial Data: Pointwise Inference

In this study simple nonparametric techniques have be n adopted to estimate the trend surface of the Swiss rainfall data. In particular w e employed the Nadaraya-Watson smoother and in addition, an adapted-by boosting-version of it. Add itionally, we have explored the use of the Nadaraya-Watson estimator for the construction of p ointwise confidence intervals. Overall, boosting does seem to improve the estimate as much as previo us examples and the results indicate that crossvalidation can be successfully used for parameter s el ction on real datasets. In addition, our estimat ors compare favorably with most of the techniques previ ously used on this dataset.


INTRODUCTION
Machine learning (Michie et al. [1] p. 2) is generally taken to encompass automatic computing procedures based on logical or binary operations, that learn a task from a series of examples. Attention has mostly focused on methods developed for discrimination tasks. In this case the data take the form {(x i , y i ), i=1,…,n}, where i i1 ip x (x , , x ) = … T is an attribute vector and i y {1, ,g} ∈ = … G is a class label. Given such data, the goal is to estimate a rule, say p : δ → ℝ G , which will assign a new observation x to a class in G . The rule is assessed by comparing the true class of x (which is not used in the learning of δ ) with the predicted class. Since different methods will produce different rules, the methods themselves are then judged by the quality of the rule that is output, though this is highly dependent on the type and quantity of data which is available. Boosting (Shapire [2] ; Freund [3] ) has become a popular method in machine learning. Given that the goal is to obtain rules which are as accurate as possible, the basic idea of boosting is to enhance a method by adaptation, whereby the rule is modified according to its performance on the original data. More specifically, a B-steps boosting algorithm iteratively computes B estimates by applying a given method, called a weak learner, to B different re-weighted samples. The estimates are then combined into a single one which is the final output. This ensemble rule can be viewed as a "powerful committee", which is expected to be significantly more accurate than every single estimate. In the original setting, the weak learner was a classification tree, often with only one split (and hence weak), but recently other classifiers have been boosted. Statistical learning (Vapnik [4] ) has been used to encompass three previously used methods within statistical data analysis: density estimation (often referred to as unsupervised learning and a prelude to cluster analysis) discrimination (sometimes called classification, or pattern recognition) and regression (or prediction). All three are commonly used in real-life applications and each has its own historical development. In all three domains, methods exist which make use of a kernel function (kernel density estimation, kernel classifiers and kernel regression); these are often referred to as simply "nonparametric". Making use of these kernel methods, Di Marzio and Taylor [5][6][7] have indicated how boosting derives its success: namely, by reducing the bias of the estimators, with only moderate increases in variance. Using this result, one is able to use larger smoothing parameters and improve the overall quality of the final estimate. Other insights into why boosting works are given by Bülmann and Yu [8] (who consider boosting of splines in regression), Friedman et al. [9] (who use logistic regression in classification) and Friedman [10] .
Di Marzio and Taylor [7] investigated the use of Nadaraya-Watson (N-W) kernel regression estimators as a weak learner for L 2 Boosting. Their study focused on the one-dimensional case ( i i x , y ∈ ℝ ) and the theoretical results were illustrated with simulations. In that study, no attempt was made to derive data-based methods in which the optimal choice of smoothing parameter h and number of boosting iterations B could be obtained from the data. Firstly, we focus on the 2-d ( p 2 = ) case. Extensions to higher dimensions are then straightforward, but our application is that of some spatial data. Whilst simulations are very useful to validate theoretical results, real data is often more challenging, since many of the assumptions are violated in ways which are hard to quantify. The dataset on which we focus has been previously used as a challenging spatial interpolation problem. Within this context, we are thus forced to consider data-based methods of optimal selection of (h, B) and since the data have been previously studied we are also then able to compare our results with alternative methods. Finally, we consider the problem of obtaining confidence intervals for the predictions.

EXPLORATORY DATA ANALYSIS
In 1997 the AI-GEOSTATS mailing list set up a "Spatial Interpolation Comparison". The participants were asked to estimate the daily rainfall values at 367 locations using data of 100 observed measurements (at different locations on the same day). Only after the predictions were made, was the actual data made available. Further details are given, with the results of the competition, in Dubois et al. [11] .
The training data were 100 randomly selected sites from a database of 467 sites in Switzerland. The response variable was the amount of rainfall on 8th May 1986 (measured in 1/10th mm). Fig. 1  . As a first, very naive prediction of the test data, we simply consider y and this gives a root mean squared error (RMSE) value of 111.14. We also note that a naive 95% confidence interval y 2 However, these predictions take no account whatsoever of the spatial structure and correlations within the data. A slightly less naive approach is to use a nearest-neighbour prediction, that is to predict the test value i y by the training observation j y such that k i k j arg min d(x , x ) = , with d denoting the euclidean distance between the sites x i and x k . This nearest neighbour predictor gives a RMSE of 84.17, but no confidence interval is readily available. A digital elevation model was also made available and this is also shown as an image in Fig. 1. Rainfall often depends on elevation and so the nature and strength, of this relationship was explored. We note that height above sea level, s, can be negative, whereas rainfall y 0 ≥ in general. There may be physical models available but throughout this study, we adopt the principle of letting the data speak for themselves. In Switzerland all the elevations s c 0 ≥ > where c 200 ≈ , so we can consider transformations of the form: y α with 0 α > and s β or log(s) and then a linear model. A plot of the data and the smooth fit shown in Fig. 2 suggests a quadratic model may fit to log(s) . The residual plots from the first fitted model indicate some shortcomings (Fig. 3), so we also tried a transformation of the response variable and this seemed to improve the fit somewhat. The final fitted model has three estimated parameters and is given by We conclude that fit is not that good. Moreover, points which are close in space are likely to have similar rainfalls and have similar heights. So it is unlikely that height will be of much help and it was not considered further in this analysis.

Motivation:
The most studied and used interpolation technique is kriging (see, for example, Stein [12] ). Unfortunately, standard kriging yields unbiased predictions only if restrictive assumptions -typically some kind of stationarity or isotropy -are satisfied. Thereby, a rigorous check of them is always necessary.
In fact, a model might not hold across all spatial observations, especially if large spatial datasets are used.
In the last three decades, a number of recent researchers focus on nonparametric regression techniques as a flexible alternative to kriging. A nonparametric analysis seems suitable for exploratory purposes in the selection stage of a parametric model, or if the information on the specific case study does not allow parametric assumptions at all. We can distinguish two tendencies: entirely nonparametric or mixed approaches, in which nonparametric techniques and kriging coexist. A brief outline follows.
In pioneering research, Yakowitz and Sziradowsky [13] study the robustness of kriging in the cases of perturbed data and incorrect variogram selection. As a more robust alternative to kriging, they extensively discuss a fully nonparametric regression technique. In their examples, the nonparametric estimator performs similarly to kriging when data are correlated and better in presence of a spatial trend. Other fully nonparametric methods based on splines include works by: Wahba [14] , Hutcinson and Gessler [15] and Laslett [16] . A common conclusion is that splines constitute a serious contender to kriging in several cases. Finally, Azari and Müller [17] suggest a particular nonparametric estimator that in their case study outperforms kriging.
Concerning the mixed approach, Høst [18] and Altman [19] adopt the same philosophy in enveloping techniques where the low-frequency signal (trend) is grasped by nonparametric regression techniques, whilst the high frequency signal (autocorrelation) by kriging. In a similar logic, Opsomer et al. [20] propose a complex algorithm where nonparametric techniques are used to estimate a variance function, their goal is to carry out the variogram fitting step in a standard kriging procedure.
A nonparametric method suitable for local fitting of spatial data is local polynomial regression [21] . Prominent features of local polynomial regression are: i) a polynomial mapping is selected, but note that polynomials constitute a class of response surfaces much wider than the commonly used parametric families; ii) not particularly restrictive smoothness assumptions are needed; iii) not all data are involved, but only those lying in a neighborhood of the estimation point, with an importance proportional to the their inverse distance from it; iv) the possibility to easily give specific directions the smoothing process by properly structuring the bandwidth matrix. Although nothing works better than a properly selected parametric model, the above features appear certainly promising when spatial phenomena are to be studied and parametric assumptions are hard to motivate on the basis of the available information.
Here, we focus on the N-W estimator (the zerodegree polynomial fit) because, as it will be explained later, it is ideally suited for boosting.
Kernel regression: Given three random variables, 2 X ∈ ℝ , Y ∈ ℝ and ε ∈ ℝ , assume the following where X and ε are independent. Assuming that n i.i.d.
. This is the random design model, in the fixed design model as design observations we have a set of fixed, ordered points so the sample elements are We will assume model (1). Recall that our data is given by If m '(x) exists, then we can use the N-W estimator Here, we used the multiplicative kernel For the simplest motivation, note that a N-W fit is a locally weighted average of the responses. Clearly, the shape of the kernel weights is determined by κ that in our case is the univariate Normal density and the degree of smoothing along the coordinates by the scale h. So the multiplicative kernel amounts to a bivariate gaussian density with a diagonal covariance matrix. Other than for sake of simplicity, we make this choice because in a spatial context it seems natural to use the same degree of smoothing in each coordinate, though there could be anisotropy is some applications. The univariate properties of the N-W estimator are detailed below; it is straightforward to extend to higher dimensions.
Properties: Given x f ∈supp , assume the notation Similarly, using we have the approximation and so the bias in m(x) is in which ( j) m (x) is the N-W estimate which uses all the data except the jth observation: We have plotted CV(h) , as given by Equation (5) (2) to estimate the rainfall over a grid of values. A contour plot of the predicted rainfall is shown in Fig. 4. As expected, it can be seen that there is broad agreement between the y values and the fitted values.
The fitted model, in which m(x) is estimated from the training data, can be used to obtain fitted values for the training data and to predict the test data. As expected, the RMSE is much reduced (from 69.69 to 34.43) when the training data is simply resubstituted, but the RMSE from the test data is 61.17, which is very similar to the minimized CV estimate. Fig. 5 shows the residuals from the fitted model, the locations of the larger residuals and how the residuals are related to the predicted values for the test data. The two main issues are the bandwidth selection and the choice of the number of boosting iterations B . As the end is to get a weak learner, a natural and direct way for reducing the complexity of whatever kernel method is oversmoothing. This is because large values of the bandwidth reduce the locality of the method and consequently, overfitting. Thus the smoothing parameter can be viewed as a potential component of regularization. A quite similar point of view is supported by Vapnik [4] (pp. 327-330) who upholds that in kernel density methods regularization can be achieved by modifying the window width. This is because regularization is interpreted as a method that "makes robust" problems whose solutions have big changes for small changes in data, as kernel smoothing is considered. From this perspective we can well understand that big bandwidths regularize the learning process.
We propose to boost the N-W estimator in an obvious L 2 Boosting manner. Our boosting algorithm is described by the following pseudocode: . Note that our choice of using a fixed smoothing parameter along the iterations seems appropriate. In fact, if we optimally select the smoothing parameter for every estimation task, we would encourage the overfitting tendency, since the "learning rate" of every single step is maximized. However a formal justification is presented later, in which small bias properties are proved to hold when the bandwidth is fixed over boosting iterations.

L 2 boostNW reduces the bias of the N-W estimator:
Here, we will show (for the univariate case, d 1 = ) how boosting reduces the asymptotic bias, up to boundary effects, of the N-W estimator. The result clearly extends to d 2 ≥ by considering multivariate Taylor series' expansions.
Assume conditions (a)-(e) hold, after the first boosting step we have We take the expectation of the numerator and denominator as before. We already have r(x) E ⌢ from Equation (3) and f (x) E ⌢ from equation (4). So the only thing we need is the expectation of the second term in the numerator. By ignoring the non-stochastic term (when i j = ), expanding in a Taylor series and integrating, Di Marzio and Taylor [7] eventually obtain the following expression for the asymptotic expectation up to terms of order 2 O(h ) As a consequence, we observe a reduction in the asymptotic bias from O(h 2 ) to o(h 2 ). This conclusion is consistent with that found by Di Marzio and Taylor [5,6] , where boosting kernels gives higher order bias for both density estimation and classification. However, note that the current result uses L 2 Boosting for regression, rather than the Adaboost-like algorithms used in classification and density estimation. Remarkably, note that we have reduced the bias without requiring any new smoothness assumption. Although p-order polynomials smoothers become less biased when p increases, they require that at the same time the quantity Boosting the Swiss rainfall data predictions: We need to find the optimal pair (h, B) for our data and this can be done by leave-one-out cross-validation. Fig. 6 shows the estimates of RMSE for various values of B and h. The optimal value was found for B=2 and h=1269.4, which gave a resulting CV estimate of RMSE of 69.244. This is only a very small improvement on B=1 (no boosting). Using the pair (1269.4, 2) the RMSE on the test data was 60.99 which is a again a very small improvement on B=1 (61.17). The resulting trend surface of the boosted model is also shown in Fig. 6; it is very similar to that of Fig. 4.

COMPARISONS
Trend surface analysis: A standard method for the analysis of spatial data, is to fit a trend surface and then carry out kriging for predictions [22] . To be consistent with the previous approach we use cross-validation for parameter estimation and model selection. Firstly, we obtained the proper order of the fitted trend surface. The results are shown in Table 1 which indicates a quadratic (or possibly linear) model is optimal. An assessment of the spatial structure was made by examining a correlogram (Fig. 7) of the residuals, which indicated that points close in space tended to be similar. Various models for the covariance function were fitted: gaussian, exponential and spherical. Using cross-validation based on the training data alone, it was found that the gaussian model was best (giving an estimated RMSE of 74.5 using leave-one-out CV on the training data). The gaussian model was used to predict the test data, with a resulting RMSE of 72.66 and the fitted surface is shown in Fig. 7. Note that the fitted surface is less smooth than that of the N-W estimator in Fig. 4, or of the boosted N-W estimator in Fig. 6.
Other predictions: The edited volume Dubois et al. [11] contains many results from the original competition. Table 2 contains a summary. The kriging values given there are slightly better than we obtained and our N-W estimator (and its boosted version, in particular) performs reasonably well overall.
We note that the best combination of (h, B) (chosen with reference to the test data) is h 2040.8, B 3 = = which gave RMSE=56.37 and so we might conclude that boosting N-W works reasonably well for this dataset.

Confidence intervals:
Here, we consider confidence intervals for N-W estimates. In particular, we present two naive approaches and a paired bootstrap strategy discussed by Härdle [23] . We firstly discuss the naive approaches.  [11] , with page numbers as given.
A totally naive method is to simply use Y±2×SD(Y)=(L, U) (in which the mean and SD are estimated from the training data).
Using the kernel regression model we could simply use: ˆm(x) 2 (L, U) σ ± = in whichσ is estimated by CV. These methods (1 & 2) give, respectively,   We have drawn 1000 bootstrap samples of size n 100 = (with replacement) and got 248 counts, but with a quite low meansize of 140. Of the 119 which lie outside the confidence intervals, 71 are outside the lower interval and 48 are outside the upper interval. However, although 248/367 is only 68% (rather than 95%), note that the CIs are not simultaneous and those which are not in the CIs tend to be either clustered together, near the boundary, or are in regions where the density of test points is low; Fig. 8. However, the main cause of this rather poor coverage rate lies in the bias of the estimator. In particular, note that the above naive algorithm does not explicitly take into account accuracy when deriving the coverage rate. In fact, Hall [24] points out that the naive procedure is doomed to a poor coverage rate because of the bootstrap estimator is not centered on m(x) but on ˆm(x) m(x) (m(x)) = + error . Hall [25] argues that undersmoothing is preferable to an explicit bias estimation step as a strategy for improving the coverage rate. But undersmoothing worsens the point estimate. So the conclusion seems to be that the value of h optimal for the coverage rate differs from that one optimized for point estimation. Actually, if in our case study h is reduced to 450, then the coverage increases to 75% and the mean width increases to 164. However, the RMSE becomes 77 and further reductions in h lead to numerical instability in the N-W predictor.
Recall that the boosted N-W estimator has a biasreduction property, so deriving confidence intervals for the boosted estimate may yield improved coverage rates. In a similar manner to that described above, we can resample from the data and calculate a boosted N-W estimate of the test data using the bootstrap sample (a "boosted bootstrap"). In order to follow this approach, we have drawn 1000 bootstrap samples of size 50 and estimated by using, for every sample, the CV-optimzed values of (h, B). The resulting confidence intervals have coverage 337/367=92%, with an average width of 234 (searching a grid of h [300, 2400] ∈ and B in the range (1,2,3,4). So there is good evidence that higher order bias potentialities of boosting can be conveniently employed to improve the coverage rate.

CONCLUSIONS
The Swiss rainfall data are an extensively studied spatial dataset [11] . As seen, several approaches were employed to fit these data, both traditional and parametric like kriging and more recent nonparametric techniques. We have seen that locally adaptive techniques have been successfully employed for estimating the signal content of spatial data taken as a whole. This is additional evidence that a nonparametric analysis seems suitable for exploratory purposes in the selection stage of a parametric model as well as when the information on the specific case study does not allow parametric assumptions at all.
In this study we have proposed a new approach based on kernel smoothing and an adapted version of it by boosting. Our smoother was the popular N-W regression estimator with product kernels and crossvalidated bandwidths. Quite interestingly, the N-W smoother has given very good performance, ranking among the methods with the best behavior. Moreover, its boosted version drastically improves on the construction of confidence intervals. Note that all of our results have been obtained with an automated approach using cross-validation and the selection of bandwidth and number of boosting steps has worked well. Overall, we should not draw conclusions from only one dataset, particularly given various user-inputs which can vary in a subjective manner and further simulations are necessary. However, our results confirm that confidence interval construction based on the N-W estimator is still an open and quite challenging problem.