Chapter 5 The training dataset and basic computations in R

In this chapter, the datasets used in this technical manual are presented. They were adapted for educational purposes by the GSP Secretariat. The tutorial provided in this chapter is for demonstration purposes and is meant for users with no prior experience in R. Thus, the instructions also serve as a continuation of the basic introduction to the functioning of R and RStudio given in Chapter 3.

Instructions are given on how to:

  1. Generate user-defined variables,
  2. Set the working directory and load necessary packages,
  3. Import national data to RStudio

Users with prior experience may skip this chapter and go directly to Chapters 6 to 9 which cover all the necessary steps from data preparation to mapping and reporting.

5.1 Study area and training material

The study area is located in the southeast of the Pampas Region, in Argentina, from the foothills of the Ventania and Tandilia hill systems, until the southern coasts of the Buenos Aires Province. To illustrate the different processes of this Technical Manual, we use three datasets from this region:

  • Georeferenced topsoil data
    • Chemical soil properties
    • Physical soil properties
  • Soil profile data

5.1.1 Georeferenced topsoil data

These data were collected in 2011 by the National Institute of Agriculture Technology and Faculty of Agricultural Science of the National University of Mar del Plata (Unidad Integrada INTA-FCA) to map the status of soil nutrients in the Argentinian Pampas (Sainz Rozas et al., 2019). The modified dataset is derived from a subset of 118 locations and covers the target depth of 0-30 cm. It is structured in two different spreadsheets that contain soil chemical properties (soil_chem_data030.csv) and soil physical properties (soil_phys_data030.csv). All datasets are located in the 01-Data folder in the training material Digital-Soil-Mapping folder. The soil chemical properties spreadsheet provides data from laboratories with point coordinates (lat/long) together with data on available Phosphorus (p_bray, in ppm), available Potassium (k, in ppm), and total nitrogen (tn, in Percent) (see Table 5.1).

Table 5.1: Dataset with coordinates for chemical soil properties.
LabID x y p_bray k tn
51 -61.51282 -37.37646 20.40 852.17 0.22
60 -57.84725 -37.85136 10.52 769.55 0.30
64 -58.87620 -38.54000 15.87 992.41 0.27
67 -60.30394 -38.45300 20.85 740.24 0.18
68 -60.39772 -38.51567 13.54 724.77 0.17
69 -60.41442 -38.52914 46.17 699.03 0.13
74 -60.00556 -38.76500 20.94 518.58 0.23
75 -60.10750 -38.76472 26.82 450.17 0.24
77 -60.17139 -38.79278 22.56 858.80 0.19
78 -60.03111 -38.74611 20.09 662.91 0.20
Only the ten first rows are shown.

The spreadsheet with soil physical data contains data on soil texture for clay (clay_0_30, in g/kg), silt (silt_0_30, in g/kg), and sand (sand_0_30, in g/kg) (see Table 5.2).

Table 5.2: Dataset with coordinates for physical soil properties.
ProfID x y clay_0_30 sand_0_30 silt_0_30
154 -58.67430 -38.20796 259.79 410 330.21
197 -60.45918 -38.36285 251.05 400 348.95
262 -58.86694 -38.42194 213.04 480 306.96
2702 -58.02222 -37.82167 259.71 430 310.29
2706 -57.91861 -37.95444 265.08 400 334.92
2709 -60.47222 -36.67778 323.92 310 366.08
2710 -60.22856 -36.69115 274.30 240 485.70
2711 -60.45076 -36.84394 234.67 540 225.33
2712 -60.42631 -36.94468 293.42 310 396.58
2714 -59.27717 -36.95655 262.34 460 277.66
Only the ten first rows are shown.

The distribution of points is shown in the following map for available Phosphorus values as points. This dataset is used in Chapter 8 for mapping.

library(tidyverse)
library(sf)
library(mapview)

mapviewOptions(fgb = FALSE)

data <- 
  read_csv("Digital-Soil-Mapping/01-Data/soil_chem_data030.csv")
s <- st_as_sf(data, coords = c("x", "y"), crs = 4326)
mapview(s, zcol = "p_bray", cex = 2.5, lwd = 0)

5.1.2 Soil profile data

Finally, the third dataset belongs to the Soil Information System of Argentina (SISINTA, Olmedo, Rodriguez and Angelini (2017)) which contains soil profiles collected from the sixties to recently years for soil survey purposes. The data can be fetched using the package SISINTAR. Table 5.3 shows a subset of the data, and the map presents the distribution of soil profiles for the study area. Soil profile data consists of measurements of soil organic carbon (soc, in Percent), soil pH (ph_h2o), available Potassium (k), bulk density (bd, in g/cm3), cation exchange capacity (cec, in cmolc/100g). This dataset is used in this chapter to illustrate the preprocessing steps required for data that come from soil profiles.

Table 5.3: Soil profile dataset.
id_prof id_hor x y top bottom ph_h2o k soc bd
51 28706 -60.35188 -38.80600 0 15 6 1.5 2.6 NA
51 28707 -60.35188 -38.80600 15 25 6 1.7 2.5 NA
51 28708 -60.35188 -38.80600 25 52 6 0.8 1.3 NA
51 28709 -60.35188 -38.80600 52 57 NA NA NA NA
154 28425 -58.67430 -38.20796 0 14 6 2.2 3.6 NA
154 28426 -58.67430 -38.20796 14 26 6 1.9 2.8 NA
154 28427 -58.67430 -38.20796 26 44 6 2.5 1.1 NA
154 28428 -58.67430 -38.20796 44 56 7 2.2 0.5 NA
154 28429 -58.67430 -38.20796 56 105 6 1.8 0.2 NA
197 28588 -60.45918 -38.36285 0 13 6 2.8 3.4 NA
Only ten rows are shown.
library(tidyverse)
library(sf)
library(mapview)

mapviewOptions(fgb = FALSE)
data <- 
  read_csv("Digital-Soil-Mapping/01-Data/soil_profile_data.csv")
s <- data %>% filter(top==0)
s <- st_as_sf(s, coords = c("x", "y"), crs = 4326)
mapview(s, zcol = "k", cex = 2.5, lwd = 0)

5.2 Format requirements of soil data

Soil data generally consists of measurements at a specific geographical location, time and soil depth. Therefore, it is necessary to arrange the data following the format shown in Table 5.4.

Table 5.4: Example format of a database.
Profile ID Horizon ID Lat Long Year Top Bottom cec ph clay silt sand soc bd
1 1_1 12.12346 1.123456 2018 0 20 15 6.5 35 58 7 3.4 1.31
1 1_2 12.12346 1.123456 2018 20 40 19 7.1 42 48 10 2.1 1.32
2 2_1 23.12346 2.123456 2019 0 30 14 5.5 12 53 35 2.9 1.39
Note:
Profile ID = unique profile identifier, Horizon ID = unique layer identifier, Lat = latitude in decimal degrees, Long = longitude in decimal degrees, Year = sampling year, Top = upper limit of the layer in cm, Bottom = lower limit of the layer in cm, cec = Cation Exchange Capacity (cmol_c/kg), ph = pH in water, clay = Clay (g/100g soil), silt = Silt (g/100g soil), sand = Sand ((g/100g soil), soc = Soil Organic Carbon (g/100g soil), bd = Bulk Density (g/cm3).

5.3 Pre-processing steps

Soil data is often arranged in a different way which requires specific pre-processing steps to reach the format. On the way towards a formatted database, common issues such as, arranging the data format, fixing soil horizon depth consistency, detecting unusual soil property measurements, can be solved. Here, common issues and examples are given on how to carry out some basic data handling steps in RStudio.

5.3.1 Set the scene (set working directory, packages, load data)

Let’s open RStudio. Whenever starting to work on a project or task, it is necessary to set the working directory (WD). The WD is the folder path that is used by R to save the output, for instance a plot or a table that was generated while working in R. Thus, the WD is central since it dictates where the files and calculations can be found afterwards. As it is so important, there are multiple ways of setting the WD. One option is to right click on ‘Session’ menu > ‘Set working directory …’ and select either ‘To Source File Location’ (then the WD corresponds to the file path where the Script is saved to) or ‘Choose Directory…’. Then, the user can browse to the folder that should be the WD.

In this manual we propose an alternative way that allows for more customization and flexibility since sometimes multiple WDs are needed to for instance save the final map in a different folder than the covariates. Since the file paths differ depending on where you stored the file on your computer, it is crucial to identify the correct file path. This can be done by accessing the file explorer. There you can browse to your training material folder and then right-click on the bar highlighted in red in the Figure 5.1.

Get file path from file explorer.

Figure 5.1: Get file path from file explorer.

The file path will appear with the following format: C:\Users\GSNmap-TM\Digital-Soil-Mapping. In order to enable R to read this as file path, it is necessary to replace the \ by /. The resulting file path should look similar to this one: C:/Users/GSNmap-TM/Digital-Soil-Mapping. Once this is done, we can assign the file path that represents the WD file path to an R object. This is done by defining a character value (in this case the file path) on the right side of the arrow (<-) and name the R object on the left side (wd) (see code). Once this is done we use the function setwd() to set the WD to the file path that is specified in the object wd.

# 0 - User-defined variables ====================================

wd <- 'C:/Users/hp/Documents/GitHub/Digital-Soil-Mapping'
#wd <- "C:/GIT/Digital-Soil-Mapping"

# 1 - Set working directory and load necessary packages =========
setwd(wd) # change the path accordingly

An alternative and more automatic approach for setting the working directory makes use of the rstudioapi package, which provides functions for interacting with RStudio’s API (Application Programming Interface). The code below first sets the working directory to the directory containing the active document (in this case the script). Then, the second line of code it changes the working directory to the parent directory using the “..” specification.

#Set the working directory automatically using rstudioapi 
# It is important to note that the directory is set to the folder
#containing the script
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
# The ".." is used to go up a directory 
#(in this case our project folder)
setwd("..")

Next to in-built base R functions, there is a vast amount of so-called packages that extend the functionalities of R and allow the use of R for a broad range of purposes. For data handling and management, the tidyverse package and its dependencies offer a great help. To load packages into the RStudio session, the library function is used. However, if the package is not installed, it is necessary to use the install.packages function first.

#install.packages(tidyverse)
library(readxl)
library(tidyverse)
library(dplyr)

# load in data
data <- 
  read_csv("Digital-Soil-Mapping/01-Data/soil_chem_data030.csv")
## Rows: 119 Columns: 6
## ── Column specification ──────────────────────────────────
## Delimiter: ","
## chr (1): LabID
## dbl (5): x, y, p_bray, k, tn
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
## # A tibble: 6 × 6
##   LabID     x     y p_bray     k    tn
##   <chr> <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1 51    -61.5 -37.4   20.4  852. 0.225
## 2 60    -57.8 -37.9   10.5  770. 0.301
## 3 64    -58.9 -38.5   15.9  992. 0.266
## 4 67    -60.3 -38.5   20.8  740. 0.179
## 5 68    -60.4 -38.5   13.5  725. 0.168
## 6 69    -60.4 -38.5   46.2  699. 0.129

For further guidance and more in-depth techniques to administer and handle soil data in R, it is recommended to check the GitHub repository on soil database management of the GSP: FAO-GSP Soil DB. There, not only training data but also extensive example codes are available. For now, we continue working with the example dataset and assume that the dataset you are using complies with the format specified at the beginning.

References

Olmedo, G., Rodriguez, D. & Angelini, M. 2017. Advances in digital soil mapping and soil information systems in argentina. GlobalSoilMap, pp. 13–16. CRC Press.
Sainz Rozas, H.R., Eyherabide, M., Larrea, G.E., Martinez Cuesta, N., Angelini, H.P., Reussi Calvo, N.I. & Wyngaard, N. 2019. Relevamiento y determinación de propiedades químicas en suelos de aptitud agrícola de la región pampeana. Fertilizar, Argentina, 8(9): 12.