#### <sup>a</sup> Department of Statistics, Computer Science, Applications "G. Parenti" **On the use of auxiliary information in spatial sampling**

, Emilia Rocco <sup>a</sup>

Chiara Boccia

On the use of auxiliary information in spatial sampling

University of Florence Chiara Bocci, Emilia Rocco

## 1. Introduction

In many fields of application it's common to be interested in spatially-related phenomena and in particular to deal with attributes which are defined on continuous spatial domains. In this framework, if the design-based approach is assumed, the attribute is usually expressed as a function y(s) taking values on a suitable subset s of the plane. In the simplest case y(s) represents the value of an attribute at the location s. As an example, in forestal surveys y(s) could be the amount of biomass measured in sampled sites over a forestal area; in environmental studies y(s) could be the quantity of plastic materials collected by net tows in sampled areas over seas; etc... .

Technology development has led to a growing availability of low-cost spatial data readyto-use, frequently derived from large scale observations (i.e. data from pervasive systems like GPS sensors, or remote sensing data from earth observation technologies). Oftentimes, these data can't directly answer specific questions posed by researchers and data users, or even if they can they are subject to measurement errors or self-selection bias. In both cases it is still necessary to rely, at least partially, on ad-hoc probabilistic surveys. On the other hand, the precision and quality of surveys estimates can be improved by using the data derived from these new sources as auxiliary information in the design phase and/or in the estimation phase.

Geographical data generally show a spatial pattern and an uneven spatial distribution over the population. In fact, usually spatial observations are not mutually independent and tend to be more similar to their neighbours. As stated by Tobler's first law of geography (Tobler, 1970): "everything is related to everything else, but near things are more related that distant things". This arises because nearby units interact with one another and tend to be influenced by the same set of natural and anthropogenic factors.

In such situations, it is well known that to estimate a mean or a total of a target variable selecting the units spatially best spread allows to collect more information and consequently provides better estimation. An important problem of sampling is thus to spread at best the sampled units in space. When, in addition to the spatial allocation, the value of one or more auxiliary variables is known for all the population units over the spatial domain, exploiting this information in the sampling design could further improve survey estimates.

A well-spread sample is usually said to be spatially balanced. Different types of spatially balanced sampling designs have been suggested in literature for sampling spatial population. Many, but not all of them, allow the use of auxiliary information, in a more or less simple way, during the units' selection procedure. For example various types of multi-phase systematic designs are used in different countries to produce National Forest Inventories for their forest monitoring programs. Tille (2020, Chapter 8), Till ´ e and Wilhelm (2017), Benedetti et al. (2012) ´ and Wang et al. (2012) give a review of the main spatial sampling methods. Since we are focusing on data that come from large scale observation (i.e. remote sensing data) to produce estimates at large scale, in the following we will focus on balanced sampling designs that can be easily implemented for big datasets.

We consider several sampling strategies, based on the spatially Balanced Sampling through Local Pivotal Method (LPM) introduced by Grafstrom et al. (2012), in order to identify the ¨

Chiara Bocci, University of Florence, Italy, chiara.bocci@unifi.it, 0000-0001-8189-4445

Emilia Rocco, University of Florence, Italy, emilia.rocco@unifi.it

Referee List (DOI 10.36253/fup\_referee\_list)

FUP Best Practice in Scholarly Publishing (DOI 10.36253/fup\_best\_practice)

Chiara Bocci, Emilia Rocco, *On the use of auxiliary information in spatial sampling*, © Author(s), CC BY 4.0, DOI 10.36253/979- 12-215-0106-3.27, in Enrico di Bella, Luigi Fabbris, Corrado Lagazio (edited by), *ASA 2022 Data-Driven Decision Making. Book of short papers*, pp. 151-156, 2023, published by Firenze University Press and Genova University Press, ISBN 979-12-215-0106-3, DOI 10.36253/979-12-215-0106-3

one which exploits geographical location and other sources of information to produce estimates for a spatially-related phenomenon in a more cost-efficient way. A strategy which could be globally applied by accounting for different areas characteristics in both the study and auxiliary variables, as well as for the differences in their relation. In all but one of the strategies under evaluation the sampling scheme consists of a different variation of the LPM, and therefore a single-phase non-informative sampling design is implemented. In addition, we propose an informative design which is based on a sequential use of the LPM and draws the final sample in two (or more) steps: (i) in the first step we collect an initial sample of observations on the target variable, which is used to investigate the relation between the auxiliary and study variables; (ii) then, this relation is exploited to target and tailor the subsequent sampling step; (iii) additional steps can be included by applying the procedure iteratively; (iv) finally, observations on the target variable collected in all the steps are used in the estimation process of the population mean.

The performance of the different strategies is investigated through Monte Carlo experiments by considering several scenarios, which differ in the distributions of the auxiliary and study variables and in their relation.

### 2. Sampling methods

Usually, in a spatial setting, the population units are plots or cells of a grid overlapping an area of interest. A value, yi, of a variable of interest is associated with each unit i(i = 1, ..., N) of the population. Moreover for each unit the spatial location si, s ∈ R<sup>2</sup> is known. Here, in addition we assume to know the value x<sup>i</sup> of an auxiliary variable for each unit of the population.

For drawing a spatial sample from such a population we decided to consider as starting point the spatially Balanced Sampling through Local Pivotal Method (LPM) introduced by Grafstrom¨ et al. (2012) since it is a flexible spatially balanced design that can draw equal and unequal probability samples in multiple dimensions. Unequal probability sampling can be more efficient than equal probability sampling if there is a positive correlation between the inclusion probabilities and the response values. Additional dimensions could include any auxiliary information in addition to the spatial coordinates.

The basic idea of LPM is to avoid that units close in distance appear together in the sample. First an inclusion probability 0 < π<sup>i</sup> ≤ 1 is assigned to each unit so that their sum over the population is equal to the fixed sample size. The sample is then obtained in at most N steps, where N is the population size. At each step one unit i is selected randomly from the available population and another unit j is chosen among the remaining units in the population by minimizing a distance function among them. This can be a univariate or a multivariate function that measures the distance with respect to one or more auxiliary variables, among which we can include the spatial coordinates. When all the variables are continuous the Euclidean distance is commonly used. Moreover, when multiple auxiliary variables are used, they are usually standardized or scaled in order to balance the contribution of each variable. After the selection of the unit i and j their inclusion probabilities are updated by using the following rule:

$$\begin{aligned} \text{if } \pi\_i + \pi\_j < 1 \text{ then } \left(\pi\_i', \pi\_j'\right) = \begin{cases} \left(0, \pi\_i + \pi\_j\right) \text{with probability } \frac{\pi\_i}{\pi\_i + \pi\_j} \\\\ \left(\pi\_i + \pi\_j, 0\right) \text{with probability } \frac{\pi\_j}{\pi\_i + \pi\_j} \end{cases} \end{aligned} \tag{1}$$

$$\begin{aligned} \text{if } \pi\_i + \pi\_j \ge 1 \text{ then } \left(\pi\_i', \pi\_j'\right) = \begin{cases} \left(1, \pi\_i + \pi\_j - 1\right) \text{with probability } \frac{1 - \pi\_j}{2 - \pi\_i - \pi\_j} \\\\ \left(\pi\_i + \pi\_j - 1, 1\right) \text{with probability } \frac{1 - \pi\_i}{2 - \pi\_i - \pi\_j} \end{cases} \end{aligned} \tag{1}$$

As a result, in each step at least one unit is removed from the population frame, either because its probability becomes zero, and consequently it is definitely excluded from the sample, or because its probability becomes one and therefore is included in the sample. The procedure continues, updating at each step the probabilities of inclusion obtained in the previous step, until all units in the population are processed. LPM selects the units with the same probability πis initially assigned to them, therefore the population mean can be estimated with the usual Horvitz-Thompson estimator.

The following specific LPM based sampling designs, which differ in how they exploit location and auxiliary information, have been investigated:


A possible alternative to the LPM method could be the double balanced sampling of Grafstrom and Till ¨ e (2012), however this sampling design is highly computationally demanding ´ when applied to big datasets, and was unfeasible in our experiments. Conversely, LPM design has been optimized for large datasets using k-d trees (Lisic and Cruze, 2016), allowing to run our Monte Carlo experiments in a reasonable amount of time.

# 3. Simulation study

We investigate the performance of the different sampling designs through Monte Carlo experiments based on several synthetic datasets. In each of them the auxiliary (X) and response (Y ) variables are drawn from a stationary bivariate spatial process [X(s), Y (s)] with s ∈ [0, 10]<sup>2</sup> (1000×1000 grid). Following Diggle and Ribeiro (2007, Chapter 3), each bivariate spatial process in turn is obtained as:

$$\begin{aligned} X(\mathbf{s}) &= f(a \ast Z\_1(\mathbf{s}) + c \ast Z\_2(\mathbf{s})) + k\_1 \\ Y(\mathbf{s}) &= g(b \ast Z\_1(\mathbf{s}) + d \ast Z\_3(\mathbf{s})) + k\_2 \end{aligned} \tag{2}$$

where:


Figure 1: Variables X(s) and Y (s) simulated under settings A1, A7 and B7.

Overall, we present our results for 24 synthetic datasets which differ in the spatial distribution of both the study and auxiliary variables, as well as in their relation. The complete list of settings used to generate the synthetic datasets is presented in Table 1. To give a better idea of the different relations between X, Y and s that can be simulated in our data, Figure 1 shows


Table 1: Root mean square error, with 500 replications and sample size = 1000

variables X(s) and Y (s) generated under settings A1, A7 and B7: in scenario A1 we observe a weak correlation between X and Y (equal to 0.298), with both variables strongly related with space; in both scenarios A7 and B7 the correlation between X and Y is stronger (more than 0.7), but they differ with respect to the spatial structure of the data since in scenario B7 part of the co-variability (about 40%) is not spatially related.

We choose to simulate scenarios with the different settings discussed above because when the analysis concerns a phenomenon measured at global scale it is common to observe different pattern between different areas of the globe and our aim is to find a strategy which could be globally applied by accounting for the various areas characteristics.

Table 1 presents for each dataset the root mean square error (rmse) of the mean estimator for the sampling designs described above, in addition to the simple random sampling (SRS) which is included as a comparison. The results for the stratified designs are omitted for lack of space since they were in line with the other strategies but they were never the best.

Results confirm that, as expected, when we analyse spatial-related phenomena spreading the sample over the area of interest is always convenient: the SpatLPM strategy is always better than both SRS and AuxLPM. Nonetheless, the use of the auxiliary information can improve the efficiency of the estimates, in particular if it is used to calculate the inclusion probabilities in the unequal designs.

It is important to note that, in order to evaluate when it is more or less convenient to use the auxiliary variable in addition to the geographical location, it is not enough to consider the correlation between X and Y : given the same level of correlation, estimates' efficiency depends on the proportion of co-variability that has spatial structure. If the co-variability is all defined by a spatial structure (that is, when Z<sup>1</sup> has C = 50), the SpatLPM design (with equal selection probabilities) is enough; on the other hand when part (or all) of the co-variability in not spatially related (that is, when Z<sup>1</sup> has C = 30 or C = 0), the additional auxiliary variable improves the estimates' efficiency, especially if used to define the unequal inclusion probabilities (UneqLPM). Moreover, UneqLPM performs better than SpatLPM even when the relation between X and Y is not linear (but still positive).

Finally, SeqUneqLPM works better than UneqLPM when the performance of the latter is worse than that of SpatLPM but it does not always manage to reach or improve the performance of the UneqLPM when this is better than the SpatLPM. These last results are very preliminary, as the experiments are still ongoing. Investigation is required on the possibility to modify the sequential procedure in order to consider more phases in which to update the inclusion probability. Moreover, experiments with more additional explanatory variables are in plan.

# References

