Chapter 5 The training dataset and basic computations in R

In this chapter, the datasets used in this technical manual are presented. They were adapted for educational purposes by the GSP Secretariat. The tutorial provided in this chapter is for demonstration purposes and is meant for users with no prior experience in R. Thus, the instructions also serve as a continuation of the basic introduction to the functioning of R and RStudio given in Chapter 3.

Instructions are given on how to:

Generate user-defined variables,
Set the working directory and load necessary packages,
Import national data to RStudio

Users with prior experience may skip this chapter and go directly to Chapters 6 to 9 which cover all the necessary steps from data preparation to mapping and reporting.

5.1 Study area and training material

The study area is located in the southeast of the Pampas Region, in Argentina, from the foothills of the Ventania and Tandilia hill systems, until the southern coasts of the Buenos Aires Province. To illustrate the different processes of this Technical Manual, we use three datasets from this region:

Georeferenced topsoil data
- Chemical soil properties
- Physical soil properties
Soil profile data

5.1.1 Georeferenced topsoil data

These data were collected in 2011 by the National Institute of Agriculture Technology and Faculty of Agricultural Science of the National University of Mar del Plata (Unidad Integrada INTA-FCA) to map the status of soil nutrients in the Argentinian Pampas (Sainz Rozas et al., 2019). The modified dataset is derived from a subset of 118 locations and covers the target depth of 0-30 cm. It is structured in two different spreadsheets that contain soil chemical properties (soil_chem_data030.csv) and soil physical properties (soil_phys_data030.csv). All datasets are located in the 01-Data folder in the training material Digital-Soil-Mapping folder. The soil chemical properties spreadsheet provides data from laboratories with point coordinates (lat/long) together with data on available Phosphorus (p_bray, in ppm), available Potassium (k, in ppm), and total nitrogen (tn, in Percent) (see Table 5.1).

Table 5.1: Dataset with coordinates for chemical soil properties.
LabID	x	y	p_bray	k	tn
51	-61.51282	-37.37646	20.40	852.17	0.22
60	-57.84725	-37.85136	10.52	769.55	0.30
64	-58.87620	-38.54000	15.87	992.41	0.27
67	-60.30394	-38.45300	20.85	740.24	0.18
68	-60.39772	-38.51567	13.54	724.77	0.17
69	-60.41442	-38.52914	46.17	699.03	0.13
74	-60.00556	-38.76500	20.94	518.58	0.23
75	-60.10750	-38.76472	26.82	450.17	0.24
77	-60.17139	-38.79278	22.56	858.80	0.19
78	-60.03111	-38.74611	20.09	662.91	0.20
Only the ten first rows are shown.

The spreadsheet with soil physical data contains data on soil texture for clay (clay_0_30, in g/kg), silt (silt_0_30, in g/kg), and sand (sand_0_30, in g/kg) (see Table 5.2).

Table 5.2: Dataset with coordinates for physical soil properties.
ProfID	x	y	clay_0_30	sand_0_30	silt_0_30
154	-58.67430	-38.20796	259.79	410	330.21
197	-60.45918	-38.36285	251.05	400	348.95
262	-58.86694	-38.42194	213.04	480	306.96
2702	-58.02222	-37.82167	259.71	430	310.29
2706	-57.91861	-37.95444	265.08	400	334.92
2709	-60.47222	-36.67778	323.92	310	366.08
2710	-60.22856	-36.69115	274.30	240	485.70
2711	-60.45076	-36.84394	234.67	540	225.33
2712	-60.42631	-36.94468	293.42	310	396.58
2714	-59.27717	-36.95655	262.34	460	277.66
Only the ten first rows are shown.

The distribution of points is shown in the following map for available Phosphorus values as points. This dataset is used in Chapter 8 for mapping.

library(tidyverse)
library(sf)
library(mapview)

mapviewOptions(fgb = FALSE)

data <- 
  read_csv("Digital-Soil-Mapping/01-Data/soil_chem_data030.csv")
s <- st_as_sf(data, coords = c("x", "y"), crs = 4326)
mapview(s, zcol = "p_bray", cex = 2.5, lwd = 0)

5.1.2 Soil profile data

Finally, the third dataset belongs to the Soil Information System of Argentina (SISINTA, Olmedo, Rodriguez and Angelini (2017)) which contains soil profiles collected from the sixties to recently years for soil survey purposes. The data can be fetched using the package SISINTAR. Table 5.3 shows a subset of the data, and the map presents the distribution of soil profiles for the study area. Soil profile data consists of measurements of soil organic carbon (soc, in Percent), soil pH (ph_h2o), available Potassium (k), bulk density (bd, in g/cm³), cation exchange capacity (cec, in cmol_c/100g). This dataset is used in this chapter to illustrate the preprocessing steps required for data that come from soil profiles.

Table 5.3: Soil profile dataset.
id_prof	id_hor	x	y	top	bottom	ph_h2o	k	soc	bd
51	28706	-60.35188	-38.80600	0	15	6	1.5	2.6	NA
51	28707	-60.35188	-38.80600	15	25	6	1.7	2.5	NA
51	28708	-60.35188	-38.80600	25	52	6	0.8	1.3	NA
51	28709	-60.35188	-38.80600	52	57	NA	NA	NA	NA
154	28425	-58.67430	-38.20796	0	14	6	2.2	3.6	NA
154	28426	-58.67430	-38.20796	14	26	6	1.9	2.8	NA
154	28427	-58.67430	-38.20796	26	44	6	2.5	1.1	NA
154	28428	-58.67430	-38.20796	44	56	7	2.2	0.5	NA
154	28429	-58.67430	-38.20796	56	105	6	1.8	0.2	NA
197	28588	-60.45918	-38.36285	0	13	6	2.8	3.4	NA
Only ten rows are shown.

library(tidyverse)
library(sf)
library(mapview)

mapviewOptions(fgb = FALSE)
data <- 
  read_csv("Digital-Soil-Mapping/01-Data/soil_profile_data.csv")
s <- data %>% filter(top==0)
s <- st_as_sf(s, coords = c("x", "y"), crs = 4326)
mapview(s, zcol = "k", cex = 2.5, lwd = 0)

5.2 Format requirements of soil data

Soil data generally consists of measurements at a specific geographical location, time and soil depth. Therefore, it is necessary to arrange the data following the format shown in Table 5.4.

Table 5.4: Example format of a database.
Profile ID	Horizon ID	Lat	Long	Year	Top	Bottom	cec	ph	clay	silt	sand	soc	bd
1	1_1	12.12346	1.123456	2018	0	20	15	6.5	35	58	7	3.4	1.31
1	1_2	12.12346	1.123456	2018	20	40	19	7.1	42	48	10	2.1	1.32
2	2_1	23.12346	2.123456	2019	0	30	14	5.5	12	53	35	2.9	1.39
Note:
Profile ID = unique profile identifier, Horizon ID = unique layer identifier, Lat = latitude in decimal degrees, Long = longitude in decimal degrees, Year = sampling year, Top = upper limit of the layer in cm, Bottom = lower limit of the layer in cm, cec = Cation Exchange Capacity (cmol_c/kg), ph = pH in water, clay = Clay (g/100g soil), silt = Silt (g/100g soil), sand = Sand ((g/100g soil), soc = Soil Organic Carbon (g/100g soil), bd = Bulk Density (g/cm3).

5.3 Pre-processing steps

Soil data is often arranged in a different way which requires specific pre-processing steps to reach the format. On the way towards a formatted database, common issues such as, arranging the data format, fixing soil horizon depth consistency, detecting unusual soil property measurements, can be solved. Here, common issues and examples are given on how to carry out some basic data handling steps in RStudio.

5.3.1 Set the scene (set working directory, packages, load data)

Let’s open RStudio. Whenever starting to work on a project or task, it is necessary to set the working directory (WD). The WD is the folder path that is used by R to save the output, for instance a plot or a table that was generated while working in R. Thus, the WD is central since it dictates where the files and calculations can be found afterwards. As it is so important, there are multiple ways of setting the WD. One option is to right click on ‘Session’ menu > ‘Set working directory …’ and select either ‘To Source File Location’ (then the WD corresponds to the file path where the Script is saved to) or ‘Choose Directory…’. Then, the user can browse to the folder that should be the WD.

In this manual we propose an alternative way that allows for more customization and flexibility since sometimes multiple WDs are needed to for instance save the final map in a different folder than the covariates. Since the file paths differ depending on where you stored the file on your computer, it is crucial to identify the correct file path. This can be done by accessing the file explorer. There you can browse to your training material folder and then right-click on the bar highlighted in red in the Figure 5.1.

Figure 5.1: Get file path from file explorer.

The file path will appear with the following format: C:\Users\GSNmap-TM\Digital-Soil-Mapping. In order to enable R to read this as file path, it is necessary to replace the \ by /. The resulting file path should look similar to this one: C:/Users/GSNmap-TM/Digital-Soil-Mapping. Once this is done, we can assign the file path that represents the WD file path to an R object. This is done by defining a character value (in this case the file path) on the right side of the arrow (<-) and name the R object on the left side (wd) (see code). Once this is done we use the function setwd() to set the WD to the file path that is specified in the object wd.

# 0 - User-defined variables ====================================

wd <- 'C:/Users/hp/Documents/GitHub/Digital-Soil-Mapping'
#wd <- "C:/GIT/Digital-Soil-Mapping"

# 1 - Set working directory and load necessary packages =========
setwd(wd) # change the path accordingly

An alternative and more automatic approach for setting the working directory makes use of the rstudioapi package, which provides functions for interacting with RStudio’s API (Application Programming Interface). The code below first sets the working directory to the directory containing the active document (in this case the script). Then, the second line of code it changes the working directory to the parent directory using the “..” specification.

#Set the working directory automatically using rstudioapi 
# It is important to note that the directory is set to the folder
#containing the script
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
# The ".." is used to go up a directory 
#(in this case our project folder)
setwd("..")

Next to in-built base R functions, there is a vast amount of so-called packages that extend the functionalities of R and allow the use of R for a broad range of purposes. For data handling and management, the tidyverse package and its dependencies offer a great help. To load packages into the RStudio session, the library function is used. However, if the package is not installed, it is necessary to use the install.packages function first.

#install.packages(tidyverse)
library(readxl)
library(tidyverse)
library(dplyr)

# load in data
data <- 
  read_csv("Digital-Soil-Mapping/01-Data/soil_chem_data030.csv")

## Rows: 119 Columns: 6
## ── Column specification ──────────────────────────────────
## Delimiter: ","
## chr (1): LabID
## dbl (5): x, y, p_bray, k, tn
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(data)

## # A tibble: 6 × 6
##   LabID     x     y p_bray     k    tn
##   <chr> <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1 51    -61.5 -37.4   20.4  852. 0.225
## 2 60    -57.8 -37.9   10.5  770. 0.301
## 3 64    -58.9 -38.5   15.9  992. 0.266
## 4 67    -60.3 -38.5   20.8  740. 0.179
## 5 68    -60.4 -38.5   13.5  725. 0.168
## 6 69    -60.4 -38.5   46.2  699. 0.129

For further guidance and more in-depth techniques to administer and handle soil data in R, it is recommended to check the GitHub repository on soil database management of the GSP: FAO-GSP Soil DB. There, not only training data but also extensive example codes are available. For now, we continue working with the example dataset and assume that the dataset you are using complies with the format specified at the beginning.

References

Olmedo, G., Rodriguez, D. & Angelini, M. 2017. Advances in digital soil mapping and soil information systems in argentina. GlobalSoilMap, pp. 13–16. CRC Press.

Sainz Rozas, H.R., Eyherabide, M., Larrea, G.E., Martinez Cuesta, N., Angelini, H.P., Reussi Calvo, N.I. & Wyngaard, N. 2019. Relevamiento y determinación de propiedades químicas en suelos de aptitud agrícola de la región pampeana. Fertilizar, Argentina, 8(9): 12.