Annex III: Mapping without point coordinates
This chapter provides step-by-step instructions in case the soil measurements do not have point coordinates but come with information on the sampling region, i.e. location within an administrative unit. In this approach, soil samples are randomly distributed across the respective area (see Figure 10.1).

Figure 10.1: Digital soil mapping approach for area-support data. Circles are the steps.
The steps are similar to the ones specified in Step 3. After merging the soil and environmental covariate data, the ones with most explanatory power are selected for model calibration. The uncertainty model is then assessed and finally the target soil property predicted. This is followed by the exporting of the final map products.
First, the working directory is set, the file path to the shapefile that contains the area of interest as well as the name of the target soil property are assigned to two object. Next, the number of simulations in which the samples are repeatedly distributed by random over the respective area is specified (see Figure 10.1). Next, the saved function for uncertainty assessment is loaded as well as all packages that are needed to run the script.
# 0 - Set working directory, soil attribute, and packages ======================
# Working directory
setwd('C:/GIT/GSNmap-TM/Digital-Soil-Mapping')
# Load Area of interest (shp)
<- '01-Data/AOI_Arg.shp'
AOI
# Terget soil attribute
<- 'p_bray'
soilatt
# Repetitions (should be equal or greater than 50)
= 5
n_sim
# Function for Uncertainty Assessment
load(file = "03-Scripts/eval.RData")
#load packages
library(tidyverse)
library(data.table)
library(caret)
library(quantregForest)
library(terra)
library(sf)
library(doParallel)
library(mapview)
# install.packages("terra", repos = "https://rspatial.r-universe.dev")
In the following lines of code, the data for validation is loaded. It consists of soil data that has point coordinates. However, if no data with covariates is available, this step can be skipped. Also, environmental covariate data are loaded and stacked together into one R object. Of this R object “covs”, values are extracted for the soil profile locations of the validation data. Finally, the data without coordinates are loaded.
# 1 - Load validation data (with coordinates) if available =====================
<- read_csv("01-Data/data_with_coord.csv") %>%
val_data vect(geom=c("x", "y"), crs = "epsg:4326")
## 1.1 - Load covariates -------------------------------------------------------
# Single-band files
<- list.files("01-Data/covs/", ".tif$", full.names = TRUE)
f <- f[f !="01-Data/covs/covs.tif"]
f <- rast(f)
covs
# Multi-band file
# covs <- rast("01-Data/covs/covs.tif")
# extract names of covariates
<- names(covs)
ncovs <- str_replace(ncovs, "-", "_")
ncovs names(covs) <- ncovs
## 1.2 - extract values from covariates for validation data --------------------
<- terra::extract(x = covs, y = val_data, xy=F)
ex <- as.data.frame(val_data)
val_data <- cbind(val_data, ex)
val_data <- na.omit(val_data)
val_data
## 1.3 - Load the soil data without coordinates --------------------------------
<- read_csv("01-Data/Buenos_Aires_sur_P_Bray.csv") dat
The next step groups the sample by “stratum”, i.e. district or other administrative unit level. The number of samples per stratum are counted and only strata with more than 5 samples retained.
# Compute the number of samples per stratum
<- dat %>% group_by(stratum) %>%
N summarise(N=n()) %>%
na.omit()
# remove any stratum with few samples
<- N[N$N > 5,]
N
# null dataframe to store model residuals
<- NULL df
In the following lines of code the modelling of the target soil attribute is modelled for each stratum. First, each stratum is masked with a cropland layer and then the number of pixels is counted. Next, a number of pixels is extracted from the stratum that corresponds to the total number of samples for the area. Within another for loop, the number of sites per stratum are selected and the user can check the exact location. Then, the selected coordinates are combined with the soil data without coordinates and the values of the environmental covariates for these points are extracted with the extract() function of the terra package. After formulating the model formula, the model is trained and then calibrated using parallel computing. Note that cross-validation is not applied. Variable importance of the covariates is assessed and the model is saved. The uncertainty assessment uses the validation data (if available) as observed values and the predicted values to calculated the error indices and model fit parameters explained in the Section on Uncertainty assessment. Finally, the point values are plotted as scatterplots and residuals are saved as a .csv file.
# 2 - Sequentially model the soil attribute ====================================
for (i in 1:n_sim) {
# load tif file of strata (masked by croplands)
<- rast("01-Data/land cover/SE_districts_croplands.tif")
stratum # Compute the number of pixels within each stratum
<- freq(stratum)
y # randomly select max(N) pixels within each stratum
<- spatSample(x = stratum, max(N$N), method = "stratified",
x as.df=T, xy=T, replace=F)
table(x$stratum)
# remove any stratum that is not present in the data (dat)
<- x[x$stratum %in% N$stratum,]
x # load data without coordinates again
<- read_csv("01-Data/Buenos_Aires_sur_P_Bray.csv") %>%
dat na.omit()
# create a vector with strata
<- unique(x$stratum)
strata # select the corresponding number of sites per stratum
<- NULL
d for (j in seq_along(strata)) {
<- x[x$stratum==strata[j],]
z <- N[N$stratum==strata[j],"N"][[1]]
n <- z[sample(1:max(N$N), size = n),]
z if(strata[j] %in% dat$stratum){
<- cbind(z,dat[dat$stratum==strata[j],soilatt])
z <- rbind(d,z)}
d
}# check the distribution of points
%>%
d st_as_sf(coords = c("x", "y"), crs = 4326) %>% # convert to spatial object
mapview(zcol = soilatt, cex = 2, lwd = 0.1) #+ mapview(raster(stratum))
<- vect(d, geom=c("x", "y"), crs = "epsg:4326")
dat
# Extract values from covariates
<- terra::extract(x = covs, y = dat, xy=F)
pv <- cbind(dat,pv)
dat <- as.data.frame(dat)
dat
# Select from dat the soil attribute and the covariates and remove NAs
<- select(dat, soilatt, ncovs)
d <- na.omit(d)
d
# formula
= as.formula(paste(soilatt," ~", paste0(ncovs,
fm collapse = "+")))
# parallel processing
<- makeCluster(detectCores()-1)
cl registerDoParallel(cl)
# Set training parameters (CV is not applied)
<- trainControl(method = "none",
fitControl # number = 10, ## 10 -fold CV
# repeats = 3, ## repeated 3 times
savePredictions = TRUE)
# Tune mtry hyperparameter
<- round(length(ncovs)/3)
mtry <- expand.grid(mtry = c(mtry))
tuneGrid
# Calibrate the QRF model
print(paste("start fitting model", i))
<- caret::train(fm,
model data = d,
method = "qrf",
trControl = fitControl,
verbose = TRUE,
tuneGrid = tuneGrid,
keep.inbag = T,
importance = TRUE, )
stopCluster(cl)
gc()
print(paste("end fitting", i))
# Extract predictor importance
<- randomForest::importance(model$finalModel)
x $importance <- x
model# Plot Covariate importance
<- varImpPlot(model$finalModel, main = soilatt, type = 1))
(g2 # Save the plot if you like
# png(filename = paste0("02-Outputs/importance_",soilatt,"_",i,".png"),
# width = 15, height = 15, units = "cm", res = 600)
# g2
# dev.off()
# Print and save model
print(model)
saveRDS(model, file = paste0("02-Outputs/models/model_",soilatt,"_",i,".rds"))
# Uncertainty assessment
# extract observed and predicted values
<- val_data[,soilatt]
o <- predict(model$finalModel, val_data, what = mean)
p <- rbind(df, data.frame(o,p, model=i))
df # Print accuracy coeficients
# https://github.com/AlexandreWadoux/MapQualityEvaluation
print(paste("model", i))
print(eval(p,o))
# Plot and save scatterplot
<- ggplot(df, aes(x = o, y = p)) +
(g1 geom_point(alpha = 0.5) +
geom_abline(slope = 1, intercept = 0, color = "red")+
ylim(c(min(o), max(o))) + theme(aspect.ratio=1)+
labs(title = soilatt) +
xlab("Observed") + ylab("Predicted"))
# save the plots if you like
# ggsave(g1, filename = paste0("02-Outputs/residuals_",soilatt,"_",i,".png"),
# scale = 1,
# units = "cm", width = 12, height = 12)
}write_csv(df, paste("residuals_",soilatt,".csv"))
The previous model fitting and calibration is simulated 5 times, as specified at the beginning of the script. In the following part of the script, the conditional mean and standard deviation are predicted for each simulation. For that, the study area is divided into tiles. For each tile, the respective covariates and models (saved in the previous step) are used to predict mean and standard deviation which are then stored in two separate raster files.
# 3 - Prediction ===============================================================
# Predict conditional mean and standard deviation for each n_sim
## 3.1 - Produce tiles ---------------------------------------------------------
<- covs[[1]]
r # adjust the number of tiles according to the size and shape of your study area
<- rast(nrows = 5, ncols = 5, extent = ext(r), crs = crs(r))
t <- makeTiles(r, t,overwrite=TRUE,filename="02-Outputs/tiles/tiles.tif")
tile
## 3.2 - Predict soil attributes per tiles -------------------------------------
# loop to predict on each tile for each simulation (n tiles x n simulation)
for (j in seq_along(tile)) {
# where j is the number of tile
for (i in 1:n_sim) {
# where i is the number of simulation
gc()
# Read the tile j
<- rast(tile[j])
t # crop the covs with the tile extent
<- crop(covs, t)
covst # read the model i
<- readRDS(paste0("02-Outputs/models/model_",soilatt,"_", i,".rds"))
model # predict mean and sd
= function(x) mean(x, na.rm=TRUE)
meanFunc <- terra::predict(covst, model = model$finalModel, na.rm = TRUE,
pred_mean cpkgs="quantregForest", what=meanFunc)
= function(x) sd(x, na.rm=TRUE)
sdFunc <- terra::predict(covst, model = model$finalModel, na.rm=TRUE,
pred_sd cpkgs="quantregForest", what=sdFunc)
# save the predicted tiles
writeRaster(pred_mean,
filename = paste0("02-Outputs/tiles/soilatt_tiles/",
"_tile_", j,"_model_",i,".tif"),
soilatt,overwrite = TRUE)
writeRaster(pred_sd,
filename = paste0("02-Outputs/tiles/soilatt_tiles/",
"_tileSD_", j,"_model_",i,".tif"),
soilatt,overwrite = TRUE)
rm(pred_mean)
rm(pred_sd)
print(paste(tile[j],"model", i))
} }
Eventually, the predicted tiles are merged for each simulation (one for mean, one for standard deviation). Finally, two raster files are saved.
## 3.3 - Merge tiles both prediction and st.Dev for each simulation ------------
for (j in seq_along(tile)) {
# merge tiles of simulation j
<- list.files(path = "02-Outputs/tiles/soilatt_tiles/",
f_mean pattern = paste0(soilatt,"_tile_",j,"_model_"),
full.names = TRUE)
<- list.files(path = "02-Outputs/tiles/soilatt_tiles/",
f_sd pattern = paste0(soilatt,"_tileSD_",j,"_model_"),
full.names = TRUE)
# Estimate the mean of the n_sim predictions (both mean and sd) for each tile
<- rast(f_mean) %>% mean(na.rm=TRUE)
pred_mean <- rast(f_sd) %>% mean(na.rm=TRUE)
pred_sd names(pred_sd) <- "sd"
# save resulting tiles
writeRaster(pred_sd,
filename = paste0("02-Outputs/tiles/aggregated_tiles/",
"_tilesd_", j,".tif"),
soilatt,overwrite = TRUE)
writeRaster(pred_mean,
filename = paste0("02-Outputs/tiles/aggregated_tiles/",
"_tileMean_", j,".tif"),
soilatt,overwrite = TRUE)
print(j)
}
In the following, the tiles are merged to one file that is outputted as raster file and saved in the maps folder.
## 3.4 - Merge tiles -----------------------------------------------------------
<- c("sd", "Mean")
name_tiles for (i in seq_along(name_tiles)) {
# list tiles
<- list.files(path = "02-Outputs/tiles/aggregated_tiles/",
f pattern = paste0(soilatt,"_tile",name_tiles[i]),
full.names = TRUE)
# read tiles
<- list()
r for (g in seq_along(f)){
<- rast(f[g])
r[[g]] print(g)
}# Mosaic tiles
<- sprc(r)
r <- mosaic(r)
r # plot final map
plot(r, main = paste("Predicted",name_tiles[i]))
# save final map
writeRaster(r, paste0("02-Outputs/maps/",soilatt,name_tiles[i],".tif"),
overwrite=TRUE)
rm(r)
gc()
}
In case data with coordinates is available, the eval function is employed at the end to assess the gap between observed values with coordinates and the predicted ones without area-support.
# 4 - Validate resulting map ===================================================
# Load data with coordinates
<- read_csv("01-Data/data_with_coord.csv") %>%
val_data vect(geom=c("x", "y"), crs = "epsg:4326")
# Load predited mean
<- rast("02-Outputs/maps/p_brayMean.tif")
pred_mean plot(pred_mean)
# extract values from predicted mean map
<- terra::extract(x = pred_mean, y = val_data, xy=F, ID=FALSE)
ex # Estimate accuracy indicators
<- as.data.frame(val_data)
val_data <- cbind(val_data, ex)
val_data <- na.omit(val_data)
val_data
eval(val_data$mean,val_data[,soilatt])
At the end of the script, the files are again masked with the cropland mask layer and saved as raster files in the maps-output folder.
# 5 - Export final maps ========================================================
## 5.1 - Mask croplands --------------------------------------------------------
<- rast("01-Data/mask_arg.tif")
msk plot(msk)
# Mask croplands in predicted mean
<- rast("02-Outputs/maps/p_brayMean.tif")
pred_mean <- mask(pred_mean, msk)
pred_mean plot(pred_mean)
# Mask croplands in predicted sd
<- rast("02-Outputs/maps/p_braysd.tif")
pred_sd <- mask(pred_sd, msk)
pred_sd plot(pred_sd)
## 5.2 - Save results ----------------------------------------------------------
writeRaster(pred_mean,
paste0("02-Outputs/maps/",soilatt,"_no_coord.tif"),
overwrite=TRUE)
writeRaster(pred_sd,
paste0("02-Outputs/maps/",soilatt,"_no_coord_SD.tif"),
overwrite=TRUE)