Chapter 1 Introduction

A survey involves gathering data from an object of study to support statistical analysis. In this technical manual, our focus is on soil properties, which serve as the target population. Examples of these properties are soil organic carbon, pH levels, soil fractions such as clay, silt, and sand, among others. The primary objective in soil sampling is to either create maps of these variables or verify their accuracy within existing maps.

This technical manual does not aim to provide an exhaustive compilation of sampling design methodologies. For this purpose, we recommend the book of Brus (2022). Instead, our focus is on demonstrating the practical application of select soil sampling techniques that can be utilised for soil mapping or assessing the precision of existing soil maps. We emphasise the context of INSII member countries, where there is a pressing need for assistance in enhancing or establishing national soil databases and maps. In many of these cases, resource limitations, accessibility challenges, time constraints, and capacity issues pose significant hurdles. Our aim is to offer pragmatic solutions within these constraints to facilitate progress in digital soil mapping (Minasny and McBratney, 2006) efforts.

1.1 Basic concepts of soil sampling

Soil sampling involves observing a subset of the larger soil population. The challenge lies in accurately inferring the characteristics of the entire population based on this limited sample. Statistical methods enable us to quantify the uncertainty inherent in these inferences. For instance, if the estimated mean nutrient level in a soil sample is marginally below a crop’s requirement, there still exists a significant probability that the actual mean of the entire population is above this threshold. This distinction between calculating descriptive statistics (direct observations from the sample) and making inferences about the broader population is fundamental in soil sampling.

According to Brus (2022), sampling methodologies can broadly be categorised into two approaches: design-based and model-based. This choice is related to the selection between probability and non-probability sampling, influencing the method of statistical inference.

In the design-based approach, samples are selected through probability sampling. Estimates are derived from the inclusion probabilities of these samples, as determined by the sampling design. This method does not involve using a model for estimation a parameter of the population. Conversely, the model-based approach utilizes statistical models for prediction. As these models already incorporate randomness, probability sampling is not a prerequisite.

The choice of the best approach depends on the survey’s objectives, the aims of soil sampling can be categorized into estimating parameters for the entire or sub population (including accuracy evaluation of a map), or mapping the study variable.

Mapping the study variable typically aligns with a model-based approach, where the focus is on predicting the variable across a fine grid that represents the study area. Both design-based and model-based approaches can be used for estimating parameters of the population or subpopulations. However, as the number of subpopulations increases, the model-based approach often becomes more appealing due to its potential for greater accuracy, depending on the model’s quality. Design-based estimates offer the advantage of an objective uncertainty assessment and almost correct confidence interval coverage.

Interestingly, probability samples can be used in model-based inference as well, providing flexibility for dual aims like mapping and parameter estimation. If probability sampling is not used, design-based estimation becomes infeasible, leaving model-based prediction as the sole option.

In conclusion, understanding these fundamental concepts is essential for effective and accurate soil sampling, allowing for reliable conclusions about soil properties and their implications.

1.2 Soil Sampling for Mapping

In soil sampling, particularly for mapping purposes, there are specific scenarios where probability sampling may not be necessary. When utilising a statistical model containing an error term modelled by a probability distribution, the need for selecting sampling units through probability sampling diminishes. This is because statistical models facilitate making quantified statistical statements about the population without strictly needing probability-based selection of units. This approach allows for the optimization of sampling units to create the most accurate maps, such as those with the smallest squared prediction error averaged over all locations in the study area.

Example of a Statistical Model for Mapping

Consider a simple linear regression model:

\[ Z_k = \beta_0 + \beta_1 x_k + \epsilon_k, \]

where $Z_k$ is the study variable for unit $k$, $\beta_0$ and $\beta_1$ are regression coefficients, $x_k$ is a covariate for unit $k$ as a predictor, and $\epsilon_k$ is the error at unit $k$, assumed to be normally distributed with mean zero and constant variance $\sigma^2$. The independence of errors is crucial, as it implies that $\text{Cov}(\epsilon_k,\epsilon_j)=0$ for $k \neq j$.

A comparison of a simple random sample without replacement and an optimized sample for mapping using this model demonstrates different sampling patterns and implications for the accuracy of map predictions. Optimal sampling often involves selecting units with extreme values of the covariate $x$, leading to strong spatial clustering. This clustering is permissible under the assumption of independent residuals in simple linear regression models.

In situations where the goal is both to map the study variable and to estimate means or totals for the entire study area or subareas, combining probability sampling with model-based approaches can be advantageous. For instance, design-based and model-assisted estimation can offer more valid and objectively assessed uncertainties in mapping soil carbon stocks while also estimating total carbon stocks. These approaches do not rely on the assumptions inherent in spatial variation models, thus avoiding potential debates over the realism of these assumptions.

When selecting a suitable probability sampling design for these dual objectives, consider:

Sampling designs with equal inclusion probabilities.
Stratified random sampling using subareas as strata, particularly when aiming to estimate means or totals for these subareas.
Geographical spreading of sampling units, potentially benefiting from spatial analysis methods like kriging.

1.3 Sampling strategies in this Technical Manual

This technical manual primarily focuses on two advanced soil sampling methodologies: Conditioned Latin Hypercube Sampling (cLHS) and Stratified Simple Random Sampling.

cLHS is a model-based method adapted for observational studies, especially effective in multivariate data representation. It stands out for its ability to handle multiple covariates, ensuring that samples accurately reflect the multivariate distribution of these variables. The method is characterized by its unique approach to defining intervals for each covariate, aiming to balance the representation of raster cells within these intervals. This balance is achieved through a minimization criterion that considers deviations in sample size, proportions of categorical covariates, and correlations in the sample compared to the overall population. cLHS is particularly recommended for environmental and soil studies where complex, interdependent variables must be efficiently sampled and where non-linear mapping methods are used.

On the other hand, Stratified Simple Random Sampling is a design-based probability sampling technique. This method is utilized to enhance sampling precision by dividing a heterogeneous population into more homogeneous subgroups or strata, such as different soil types or land use categories. Within each stratum, simple random sampling is conducted. This approach is beneficial for increasing statistical efficiency and reducing variance within each stratum. It’s broadly applicable in various fields, notably in ecological and environmental studies, where it’s used to sample distinct sub-areas with unique characteristics.

We present implementations of these methods at a national scale for real case scenarios considering different constraints.

1.4 How to use this manual

The manual is structured in two parts. 'Part One' presents a methodology to evaluate the capacity of existing soil legacy data to represent the potential soil diversity within a certain study area and determine whether it is a valid set for Digital Soil Mapping purposes. We use the Kullback-Leibler divergence (KL) measurement to quantify the difference between the probability distributions of covariate values in the legacy samples set and in the whole area and determine how much information is lost when the sample set is used to approximate the diversity in the existing environmental conditions in the whole area.

In 'Part Two' we present several methods for creating soil sampling designs. We start with the determination of the minimum sample size required for describing most of the environmental diversity in the area to the creation of sampling designs. We present examples of various sampling methods, ranging from traditional grid-based approaches to advanced statistical sampling strategies. We include systematic, random and stratified sampling methods, evaluating their strengths and weaknesses in the context of DSM.

1.5 Training material

The manual exercises are written in the statistical environment R and run in the integrated development environment (IDE) RStudio for simplicity. Some scripts include modifications of the work from @Malone, and @Brus2022, which can be found at their respective repositories.

The training material for this book is located in the Sampling-Design-TM GitHub repository. To download the input files and R scripts, clone the repository or click on this link, save the ZIP file, and extract its content in a folder, preferable located close to the root of your system, such as "C:/GIT/". Raster data can be also downloaded from the Google Earth repository of FAO-GPS (digital-soil-mapping-gsp-fao). Script 2 in the Annexes can be used in the code editor at Google Earth Engine to download the necessary environmental covariates.

We have used a common structure for file paths in the exercises. By default, the RStudio console points to the folder where the active file is located (defined by setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) in the code). With this structure, R scripts appear in the root of the working directory and data files are in a 'data/' directory within the root, with .shp and .tif files located within the sub-folders 'data/shapes' and 'data/rasters' respectively. Following this recommendation simplifies the definition of paths and execution of the scripts. If users desire to change their storage paths, they have to properly adjust data paths in the R scripts.

References

Brus, D.J. 2022. Spatial sampling with R. 1st edition. Boca Raton, Florida, Chapman; Hall/CRC. (also available at https://dickbrus.github.io/SpatialSamplingwithR/).

Minasny, B. & McBratney, A. 2006. A conditioned latin hypercube method for sampling in the presence of ancillary information. Computers & Geosciences, 32: 1378–1388. https://doi.org/10.1016/j.cageo.2005.12.009