Riemann Estimation for Replicated Environmental Sampling Designs

In many environmental surveys the population under study is made up of biological units scattered over a planar region. A variable is consi dered on each unit and the target parameter general ly turns out to be the population total of the variabl e. In order to estimate the population total, field scientists commonly replicate a suitable design on the study region. Replicated environmental designs basically rely on the selection of a set of sample oints, in such a way that each sample point corresponds to a single design replicate. Frequentl y, the sample points are located uniformly and independently over the planar region, even if more eff ctive strategies are actually available. The population total is subsequently estimated by using the mean of the estimates obtained in each design replicate. However, this pooled estimator may be im proved by considering a suitable weighted mean rather than the simple mean of the estimates. Thu s, we propose a Riemann estimator of the population total which is actually borrowed from th e Monte Carlo integration setting. The suggested estimator displays appealing performance from both theoretical and practical perspectives.


INTRODUCTION
The aim of many quantitative environmental and agricultural studies is to estimate the total of a variable in the considered population. In order to collect the sampling information, some replicates of a suitable environmental design are carried out in the field [1] . For example, in the forestry setting, replicated line-intercept sampling is commonly adopted to estimate the canopy coverage in a delineated region [2,3] , while replicated Bitterlich sampling is considered when the total basal area in a forest is the target parameter [4] . In ecological studies, replicated plot sampling is used to estimate species composition and density [4] .
It is worth noting that the designs arising in environmental studies may be embedded in a unique theoretical framework, i.e. the "continuous population" paradigm [5,6] . Under this approach, the design is carried out by selecting a point on a continuum such as a portion of a straight line or a finite planar region. Actually, as pointed out by Barabesi and Pisani [7] , practical environmental designs may be partitioned into two large families: the first family encompasses designs which are implemented by selecting a point on the baseline (the projection of the study region onto a line of arbitrary direction), while the second family includes designs which are carried out by selecting a point on the whole study region. In the present study we exclusively focus on the first family of designs, which in any case comprises many important practical designs such as line-intercept sampling [8] , strip sampling [1] , line-transect sampling under Burnham-Anderson's detection model [8] and line-transect sampling under Hayne's detection model [9] , among others. Hence, once a suitable design is chosen, n replicates of the design are performed in the field, i.e. in this case n sample points are selected on the baseline. The usual population total estimate is simply obtained by averaging the n estimates obtained in the n design replicates. Therefore, in order to achieve accurate population total estimation, the focus boils down to the optimal placement of the n sample points.
Barabesi [5,10] has shown the equivalence of the strategies adopted for the placement of the sample points either in replicated designs or in Monte Carlo integration. Indeed, under the continuous population paradigm the population total may be represented as the integral of a certain function which depends on the chosen design. Thus, an optimal Monte Carlo integration strategy may be adopted in order to select the sample points in replicated designs. The modified Monte Carlo integration method introduced by Haber [11,12] is a highly suitable strategy. This Monte Carlo integration method involves partitioning the baseline into n equal segments and generating n independent random points in these segments. From an environmental sampling perspective the strategy is basically the so-called nonaligned systematic sampling of points suggested in the U.S. EPA QA/G-5S Guidance [13] . When this strategy is considered, Barabesi and Marcheselli [14,15] have shown that very accurate estimators -even displaying a 3 ( ) O n − variance ratemay be achieved.
Unfortunately, the data are often collected in the field by means of independent sample points uniformly placed over the baseline [8] . This sampling strategy is equivalent to the crude Monte Carlo integration method, which solely produces a population total estimator with a 1 ( ) O n − variance rate. However, it is again possible to achieve accurate estimation by adopting the Riemann Monte Carlo estimators [16] . The suggested Riemann Monte Carlo estimator of the population total is based on the weighted mean -rather than the simple mean -of the estimates obtained in the n replicates. Even if the Riemann Monte Carlo estimator is biased, it displays a 2 ( ) O n − mean square error and hence it improves over the simple mean.

PRELIMINARIES
Let us consider a well-defined planar study region and a population of N units scattered over this region at fixed locations. Furthermore, let 1 2 ( , , , ) N y y y … be the values of the target variable on the N units, in such a way that =1 = N y l l T y ∑ represents the population total. Moreover, let us assume that the estimation of T y is performed using the replication of a design which is implemented by selecting a sample point on the baseline. Without loss of generality and for the sake of simplicity, let us suppose that the baseline is given by the interval (0,1) . Moreover, let u be the position of a point selected on the baseline.
Once a suitable design is chosen, the inclusion set l P  of the l-th unit is a suitable interval contained in the baseline [10] and the l-th unit is selected -and l y is measured -if l u P ∈ . As an example, let us consider a population of plants and let l y be the biomass of the lth plant, in such a way that the target parameter is the total biomass in the forest. If line intercept sampling is adopted, the inclusion sets are the projections of the plant crowns onto the baseline. Indeed, a plant is selected if the corresponding crown is intercepted by a line perpendicular to the baseline at location u.
In order to obtain a suitable representation for y T , it is worth noting that, if solely the l-th unit is considered, the intensity of the target variable over the is the length of l P and A I is the usual indicator of a set A. Hence, the intensity of the variable at location u is given by Incidentally, it is at once apparent that ( ) y u is simply the Horvitz-Thompson estimate of y T when location u is selected. The total intensity of the variable of interest over the study region turns out to be an integral representation is achieved and hence the estimation of y T reduces to an integration problem.
Hence, the strategies adopted for the Monte Carlo quadrature of an integral can be used in order to choose n sample points 1 2 ( , , , ) on the baseline. When the n sample points are independently and randomly chosen, the crude Monte Carlo integration strategy is actually adopted. In this case, 1 2 ( , , , ) are the realization of n independent random variables 1 2 ( , , , ) uniformly distributed over the baseline. Hence, the Monte Carlo estimator is given by It is at once apparent that the pooled estimator (2) is the mean of the n Horvitz-Thompson estimators corresponding to the n design replicates. Actually, this is the usual estimation procedure adopted in replicated environmental sampling designs. Indeed, once the sample points are positioned and the data are collected, n Horvitz-Thompson estimates are obtained for each design replicate and they are subsequently averaged in order to achieve an overall estimate for . However, the crude Monte Carlo strategy precludes the small variance rates for the pooled estimator which can be achieved when more refined Monte Carlo strategies are adopted [14,15] . Accordingly, the aim of the following section is to introduce an estimator with elevated performance, even if the sample points are collected using the crude Monte Carlo strategy.

THE RIEMANN MONTE CARLO ESTIMATOR
In order to improve on estimator (2), it is worthwhile to consider a weighted estimator of type where the 1 2 ( , , , ) s are positive weights such , it is straightforward to prove that estimator (3) is unbiased and the corresponding variance is minimum when 1/ i w n = . In this case, estimator (3) actually reduces to estimator (2). Accordingly, the weights must be chosen as random functions in order to improve on estimator (2). First, it is at once apparent that estimator (3) may expressed as (1) (  (3) reduces to the usual Riemann Monte Carlo estimator [16] given by Robert and Casella [16] have proven that (4) is biased with a 2 ( ) O n − mean square error when y has a bounded derivative. However, in the present setting y does not achieve this regularity condition. Indeed, it is at once apparent that the function y defined in (1)

A SIMULATION STUDY
In order to assess the small-sample properties of estimator (4) with respect to estimator (2), a simulated experiment dealing with line intercept sampling has been considered. In this setting, it is worth noting that an interesting use of line-intercept design is described by Thompson [8] : if the study region is snowed and the total of a certain animal species (such as wolverines or arctic wolves) is the target parameter, the selected transects are flown under appropriate weather conditions with observers in the aircraft looking for animal tracks in the snow. Once a track is encountered, it is followed in each direction and mapped. Hence, the animal total is estimated on the basis of the estimated track total [17] .
The previous survey setting was mimicked by simulating three populations of twenty tracks on the unit square. Thus, the target parameter was given by the population total, i.e.
. 20 = = N T y The three populations were settled in such a way that the first population consisted of lines randomly located, the second population consisted of lines positioned with a slight trend, while the third population consisted of lines positioned with a marked trend. The three simulated populations are displayed in Fig. 1. These populations of lines may be considered quite representative of real situations [8] .  (2) and (4) were computed. On the basis of the B estimates, the simulated bias, the simulated mean square error (MSE) and the simulated relative efficiency (RE) -i.e. the ratio of the simulated mean square errors -were computed for the estimators (2) and (4). The corresponding results were reported in Table 1. From Table 1, it is at once apparent that estimator (4) always outperforms estimator (2), except for the second population and . 20 = N The performance of the Riemann estimator obviously increases as n increases. The best performance is achieved for the third population of lines, i.e. for the most irregular y function. Further simulations (not reported here) seem to confirm that this behavior generally occurs, even if the superiority of (4) over (2) is not marked for very small n values (say n less than 10). Finally, it should be emphasized that the bias of (4) is always negative in the simulation, a result which is consistent with the findings in the Remark of the Appendix.
For each (0,1) x ∈ , let us assume that By using expression (7) in estimator (4), it follows that , the first part of (9) follows. In addition, since the joint probability density function of (1) (2) ( ) . Thus, since ∑ the second part of (9) follows. Moreover, on the basis of (8) and (9), it turns out that   converges almost surely to 0 on the basis of the Dvoretzky-Kiefer-Wolfowitz inequality [18] .