Simulating different configurations of field plots for modeling aboveground biomass from airborne laser scanning (ALS)¶

Journals: Forest Ecology and Management? Biogeosciences? Methods in Ecology and Evolution?

Aim: This analysis aims to inform GEO-TREES to evaluate if proposed sites have sufficient plot data and to guide the placement of new plots by investigating the following research questions:

(1) What is the minimum quantity of field plot data needed to represent landscape-level variation in ALS-derived forest structure?

(2) How does this minimum differ for random sampling compared to optimized sampling?

(3) What is the minimum quantity of field plot data needed to fit robust predictive models of aboveground biomass density within a subset of the landscape?

Methods: We analyze high-quality ALS datasets (> 1000 ha) and very large field-plot measurement datasets (> 80 ha of stem measurements) at four data “hotspots” covering a range of tropical vegetation types:

  • Barro Colorado Island, Panama
  • Mpala, Kenya
  • Huai Kha Khaeng, Thailand
  • Paracou, French Guiana
  • Danum Valley, Malaysia

For question 1, we simulate two random sampling scenarios, representing clustered and distributed configurations of plot data:

  • A single, large, rectangular plot (common in ForestGEO)
  • Distributed, non-contiguous, 1-ha plots (common in Forestplots.net)

For each of these scenarios, we vary the amount of plot data in simulations from 3 – 80 ha of plot data.

Then we calculate the Mahalonobis distance of each pixel to the plots in relative structural space, using four structural metrics:

  • Mean height of canopy returns,
  • Canopy cover above 2 m,
  • Variance in height of canopy returns,
  • The coefficient of variation of the leaf-area density profile.

Mahalanobis distances are multideminsional Z-scores, so each pixel's distance can be expressed as the number of standard deviations of the sample distribution. To assess how well each plot simulation represents structure across the landscape, we quantify the relationship between total plot area and the proportion of pixels greater than 2 or 3 standard deviations. We plot the relationship between this proportion and the area of the simulation to identify the minimum viable plot size for each landscape.

For question 2, we compare the random sampling scenarios from question 1 with two realistic optimized sampling scenarios, with the plots optimally placed based to maximize structural:

  • Iterative, optimized placement of 1-ha plots (realistic for establishing a new site),
  • A single, randomly placed rectangular plot (10 or 25 ha) with optimized placement of additional 1-ha plots (realistic for adding supplemental plots to established ForestGEO sites).

For question 3, we use random spatial subsets of existing plot data with known AGBD estimates to fit biomass models.

As in question 1, we will repeat this analysis under two scenarios, firstly using a single large rectangular plot, and secondly using distributed 1-ha plots.

We vary the area of total simulated plot data from 3 ha to the maximum area that can fit within the plot data while still allowing for test data. For each random simulation of plots (training fold), we apply a buffer of 50 m (one subplot) and use the remaining unsampled subplots as the test data. We fit predictive models of AGBD with each training fold, and assess each model with the testing fold. We examine the relationship between the total area of 1-ha or single plots and the the goodness-of-fit of the biomass prediction model fitted to the plot data, such as the bias (mean residual error), RMSE, MAE.

Results:

We use changepoint analysis to to identify sills in the relationships between plot area and structural representativeness and model performance.

Recommendations for minimum plot sizes under difference scenarios.

Proof-of-concept of optimized plot placement for new or supplemental plots for calibration and validation sites for AGBD.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# site-specific information, defined by the user:
# specify metadata names

#-------------------------------------------------------
site = "PanamaCanal"
type = "ALS"
platform = "Aircraft"
date = "20230526"
# set the coordinate reference system for the site
CRS_site = 32617
  • conda install r-ggh4x r-hmisc
  • conda install ./environments/r-lasr-0.17.0-r43hecca717_0.conda
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# # # # #these packages couldn't be found by conda, so install them with R temporarily

# # install.packages("remotes")
# # remotes::install_github("njtierney/broomstick")
# # install.packages("ggh4x")
# # remotes::install_github("tidymodels/corrr")
# # remotes::install_github("jrnold/resamplr")
# # # remotes::install_github("ptompalski/lidRmetrics")
# # # remotes::install_github("r-lidar/lasR", ref = "pdt")
# # remotes::install_github("DavisVaughan/furrr")
# # # install.packages("anticlust")
# install.packages("numbers")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# load in the required R packages

suppressMessages({
    # library(resamplr)
    library(broom)
    # library(broomstick)
    library(ggpubr)
    library(sf)
    library(future)
    library(terra)
    library(lidR)
    library(RCSF)
    library(RStoolbox)
    library(BIOMASS)
    library(httr2)
    library(randomForest)
    library(tidyterra)
    library(ggh4x)
    library(exactextractr)
    # library(furrr)
    library(tidyverse)
    # library(lidRmetrics)
    # library(spatialsample)
    # library(anticlust)
    library(modelr)
    library(spatstat)
    library(lasR)

})
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
options(repr.matrix.max.cols = 400, repr.matrix.max.rows = 100)

# define some variables for mapping in ggplot

theme_map = theme(panel.border = element_rect(color = "black", fill = NA),
        axis.title = element_blank(),
        axis.text.x = element_text(angle = 90),
        axis.ticks = element_blank(),
        legend.position = "inside",
        legend.position.inside = c(.98, .02),
        legend.justification = c(1, 0),
        strip.background = element_rect(fill = "white"),
        legend.key = element_rect(fill = "white", color = "white"),
        legend.title = element_text(hjust = 0.5),
        legend.direction = "horizontal",
        legend.key.width = unit(.5, "in"),
        legend.key.height = unit(.1, "in"),
        legend.background = element_rect(color = "black", fill = "white", linewidth = .3),
        legend.margin = ggplot2::margin(7, 12, 7, 12))

cols_gradient <- c(low = "#002070", mid = "#DD6677", high = "#F1F1A1")
cols_gradient_greyscale <- c(low = "#111111", mid = "#666666", high = "#FFFFFF")
cols_gradient_biomass <- c(low = "#002070", mid = "#309040", high = "#D0E358")
cols_gradient_metrics <- c(low = "#00333A", mid = "#637F8A", high = "beige")

Table 1. Potential study sites

Site ALS acquisition date Plot Size (ha) Census date DBH cutoff (cm) Forest type
Barro Colorado Nature Monument May 2023 ForestGEO plot 50 2022 1 Old forest
Gigante fertilization plot 38.4 2023 20 (10, 1 in places) Old forest, partially fertilized
P12 1 Nov 2023 1 Old forest
P14 1 Oct/Nov 2023 1 cm Old forest
Secondary forest plots ? 2024? ? Young forest
Secondary forest plots ? 2024? ? Young forest
___________________________________
Mpala Feb 2025 ForestGEO plot 100 2025 1 Structurally distinct savannas on black cotton soil, transition zone, and red soil
___________________________________
Paracou ? 25-ha plot 25 Every 5 years 10 Old forest, mix of upland and lowland
Biodiversity plots 18.75 Every 2 years? 10 Old forest
Disturbance experiment 75 (12 x 6.25) Every 2 years 10 Selective logging, thinning, and control plots.

___________________________________||||||| | Huai Kha Khaeng | Oct 2025? | ForestGEO plot | 50 | ? | 1 | Seasonal evergreen and deciduous forest | | | | Kapook Kapieng | 16 | ? | 1 | Seasonal deciduous forest | | | | Huai Krading | 16 | ? | 1 cm |Seasonal evergreen and deciduous forest |

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
resample_1ha_plots = function (data, size_ha, drop_buffer) 
{
    # if (!is.numeric(p) || length(p) < 2 ) {
    #     stop("`p` must be a named numeric vector with at least two values.")
    # }
    # if (abs(sum(p) - 1) > 1e-06) {
    #     message("Rescaling `p` to sum to 1.")
    # }

    data = data |>
        mutate(p_x = as.numeric(str_split_i(subplot_id, "_", -2)),
              p_y = as.numeric(str_split_i(subplot_id, "_", -1))) |>
        group_by(plot_id) |>
        mutate(plot_size_ha = n_distinct(subplot_id) / 4,
               # the max dimensions of each plot, needed later to prevent failure
               # add one because we label from 0 
              plot_max_y = max(p_y + 1),
              plot_max_x = max(p_x + 1))
    
    n <- nrow(data)
    idx = seq_len(n)

    n_subplots = size_ha * 4

    dim_x = 2
    dim_y = 2

    # empty vectors to start the loop
    train_merge = c()
    buffer_merge = c()
    exclude_merge = c()

    n_restart = 1
    
    while (length(train_merge) < n_subplots) {
    
        subplots_p = which(data$p_x >= (dim_x - 1) & 
                           data$p_y >= (dim_y - 1) & 
                           data$plot_size_ha >= 1) |>
            setdiff(exclude_merge)

        # if subplots_p is empty, it means that there is no solution to achieve the desired number of plots
        # in this case, start over
        # if size_ha is too high, it may be impossible; allow 10000 restarts 
        
        if (is.empty(subplots_p)) {

            if (n_restart > 100000) {
                stop("100000 failed attempts; size_ha is probably too large for the plot data provided")
                }
            
                n_restart = n_restart + 1
                train_merge = c()
                buffer_merge = c()
                next 
        }
        
        # random seed points
        # sample does not produce a random sample from a single element
        if (length(subplots_p) == 1) {
            seed = subplots_p
            } else {
            seed = sample(subplots_p, 1)
            }
        
        seed_x = data$p_x[seed]
        seed_y = data$p_y[seed]
        seed_plot_id = data$plot_id[seed]
    
        # the index of the selected subplot
        train = which((data$p_x > (seed_x - dim_x) & data$p_x <= seed_x) &
                     (data$p_y > (seed_y - dim_y) & data$p_y <= seed_y) & 
                      data$plot_id == seed_plot_id)

        if (length(train) < 4) {
            
                next 
        }
        
        buffer = which((data$p_x > (seed_x - dim_x - 1) & data$p_x <= (seed_x + 1)) &
                     (data$p_y > (seed_y - dim_y - 1) & data$p_y <= (seed_y + 1)) & 
                      data$plot_id == seed_plot_id)
        
        exclude = which((data$p_x > (seed_x - dim_x - 1) & data$p_x <= (seed_x + 2)) &
                     (data$p_y > (seed_y - dim_y - 1) & data$p_y <= (seed_y + 2)) & 
                      data$plot_id == seed_plot_id)

    # merge with past cycles
    train_merge = c(train_merge, train) 
    buffer_merge = c(buffer_merge, buffer)
    exclude_merge = c(exclude_merge, exclude)
      
    }
        
    if (isFALSE(drop_buffer)) {
        test = resample(data, setdiff(idx, train_merge))
        } else {
        test = resample(data, setdiff(idx, buffer_merge))
        }
    
    train = resample(data, train_merge)
    
    list(test = test, train = train)
}



crossv_1ha_plots = function (data, n, size_ha, id = ".id", drop_buffer = TRUE) 
{
    if (!is.numeric(n) || length(n) != 1) {
        stop("`n` must be a single integer.", call. = FALSE)
    }
    if (!is.numeric(size_ha) || size_ha < 1) {
        stop("`size_ha` must be a value between 0 and 1.", call. = FALSE)
    }
    
    runs <- purrr::map(seq_len(n), ~resample_1ha_plots(data = data, size_ha = size_ha, drop_buffer = drop_buffer))
    cols <- purrr::transpose(runs)
    cols[[id]] <- vctrs::vec_group_id(seq_len(n))
    tibble::as_tibble(cols) |>
        mutate(size_ha = size_ha,
              scenario = "1-ha plots")
}
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
# set.seed(1)

resample_plot = function (data, size_ha, drop_buffer) 
{
    # if (!is.numeric(p) || length(p) < 2 ) {
    #     stop("`p` must be a named numeric vector with at least two values.")
    # }
    # if (abs(sum(p) - 1) > 1e-06) {
    #     message("Rescaling `p` to sum to 1.")
    # }


    data = data |>
        mutate(p_x = as.numeric(str_split_i(subplot_id, "_", -2)),
              p_y = as.numeric(str_split_i(subplot_id, "_", -1))) |>
        group_by(plot_id) |>
        mutate(plot_size_ha = n_distinct(subplot_id) / 4,
               # the max dimensions of each plot, needed later to prevent failure
               # add one because we label from 0 
              plot_max_y = max(p_y + 1),
              plot_max_x = max(p_x + 1))
    
    n <- nrow(data)
    idx = seq_len(n)

    n_subplots = size_ha * 4

    # get the integers by which the number of subplots is divisible
    divisors = numbers::divisors(n_subplots)
    
    # plots must be at least 100 m wide, so drop division by 1
    divisors = head(tail(divisors, -1), -1)

    # avoid super long, skinny plots; drop division by 2 if there are other options
    if (length(divisors) > 2) {
        divisors = head(tail(divisors, -1), -1)}

    # plots also can't be longer than either of the dimensions of the plot
    divisors = divisors[divisors <= max(data$plot_max_x) & 
                       (n_subplots / divisors) <= max(data$plot_max_y)]

    # empty vector to start the while loop
    train = c()
    
    while (length(train) < n_subplots) {
    
    if (is.empty(divisors)) {
                stop("impossible to fit such a large plot in the given dataset")
                }
    
    # x dimension of the plot
    # sample does not produce a random sample from a single element
    # for a plot size of 1 ha, there is only a single divisor, 2
    if (length(divisors) == 1) {
        dim_x = divisors
        } else {
        dim_x = sample(divisors, 1)
        }
        
    # y dimension of the plot
    dim_y = n_subplots / dim_x
    
    subplots_p = which(data$p_x >= (dim_x - 1) & 
                       data$plot_max_x >= dim_x & 
                       data$p_y >= (dim_y - 1) & 
                       data$plot_max_y >= dim_y & 
                       data$plot_size_ha >= size_ha)
    
    # random seed points
    # sample does not produce a random sample from a single element
    if (length(subplots_p) == 1) {
            seed = subplots_p
    } else {
            seed = sample(subplots_p, 1)
            }

    seed_x = data$p_x[seed]
    seed_y = data$p_y[seed]
    seed_plot_id = data$plot_id[seed]
    
    train = which((data$p_x > (seed_x - dim_x) & data$p_x <= seed_x) &
                 (data$p_y > (seed_y - dim_y) & data$p_y <= seed_y) & 
                  data$plot_id == seed_plot_id)
    }

        if (length(train) < n_subplots) {
            
                next 
        }
        
    
    exclude = which((data$p_x > (seed_x - dim_x - 1) & data$p_x <= (seed_x + 1)) &
                     (data$p_y > (seed_y - dim_y - 1) & data$p_y <= (seed_y + 1)) & 
                      data$plot_id == seed_plot_id)
    
    if (isFALSE(drop_buffer)) {
        test = resample(data, setdiff(idx, train))
        } else {
        test = resample(data, setdiff(idx, exclude))
        }
    
    train = resample(data, train)
    
    list(test = test, train = train)
}



crossv_plot = function (data, n, size_ha, id = ".id", drop_buffer = TRUE) 
{
    if (!is.numeric(n) || length(n) != 1) {
        stop("`n` must be a single integer.", call. = FALSE)
    }
    if (!is.numeric(size_ha) || size_ha < 1) {
        stop("`size_ha` must be a value between 0 and 1.", call. = FALSE)
    }
    
    runs <- purrr::map(seq_len(n), ~resample_plot(data = data, size_ha = size_ha, drop_buffer = drop_buffer))
    cols <- purrr::transpose(runs)
    cols[[id]] <- vctrs::vec_group_id(seq_len(n))
    tibble::as_tibble(cols) |>
        mutate(size_ha = size_ha,
              scenario = "Single plot")
}


# test = subs |>
#     # filter(plot_id == "Panama Canal:BCI 50 ha plot") |>
#     mutate(p_x = as.numeric(str_split_i(subplot_id, "_", -2)),
#           p_y = as.numeric(str_split_i(subplot_id, "_", -1))) |>
#     crossv_test()

# ggplot() +
#     geom_sf(data = subs, fill = "grey95"
#            ) +
#     geom_sf(data = as_tibble(test$train[[1]]), fill = "red4", alpha = .7) +
#     geom_sf(data = as_tibble(test$test[[1]]), fill = "steelblue", alpha = .5) +
#     # geom_sf(data = as_tibble(test$train[[3]]), fill = "forestgreen", alpha = .5) +
#     # geom_sf(data = as_tibble(test$train[[4]]), fill = "orchid4", alpha = .5) +
#     theme_bw()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
# resample_random = function (data, size_ha) 
# {
#     # if (!is.numeric(p) || length(p) < 2 ) {
#     #     stop("`p` must be a named numeric vector with at least two values.")
#     # }
#     # if (abs(sum(p) - 1) > 1e-06) {
#     #     message("Rescaling `p` to sum to 1.")
#     # }

#     data = data |>
#         mutate(p_x = as.numeric(str_split_i(subplot_id, "_", -2)),
#               p_y = as.numeric(str_split_i(subplot_id, "_", -1))) |>
#         group_by(plot_id) |>
#         mutate(plot_size_ha = n_distinct(subplot_id) / 4,
#                # the max dimensions of each plot, needed later to prevent failure
#                # add one because we label from 0 
#               plot_max_y = max(p_y + 1),
#               plot_max_x = max(p_x + 1))
    
#     n <- nrow(data)
#     idx = seq_len(n)

#     n_subplots = size_ha * 4

#     # empty vectors to start the loop
#     train_merge = c()
#     buffer_merge = c()
    
#     while (length(train_merge) < n_subplots) {
    
#         subplots_p = idx |>
#             setdiff(train_merge)
        
#         # random seed points
#         # sample does not produce a random sample from a single element
#         if (length(subplots_p) == 1) {
#             seed = subplots_p
#             } else {
#             seed = sample(subplots_p, 1)
#             }
        
#         seed_x = data$p_x[seed]
#         seed_y = data$p_y[seed]
#         seed_plot_id = data$plot_id[seed]
    
#         # the index of the selected subplot
#         # requires adjacent plots to be merged!

#         train = which(data$p_x == seed_x & 
#                       data$p_y == seed_y &
#                      data$plot_id == seed_plot_id)
        
#         buffer = which(data$p_x %in% c(seed_x, seed_x - 1, seed_x + 1) & 
#                        data$p_y %in% c(seed_y, seed_y - 1, seed_y + 1) &
#                      data$plot_id == seed_plot_id)

#     # merge with past cycles
#     train_merge = c(train_merge, train) 
#     buffer_merge = c(buffer_merge, buffer)
      
#     }
        
#     # train = seed + seq_len(n_subplots)
#     list(test = resample(data, setdiff(idx, buffer_merge)),
#          train = resample(data, train_merge))
# }



# crossv_random = function (data, n, size_ha, id = ".id") 
# {
#     if (!is.numeric(n) || length(n) != 1) {
#         stop("`n` must be a single integer.", call. = FALSE)
#     }
#     if (!is.numeric(size_ha) || size_ha < 1) {
#         stop("`size_ha` must be a value between 0 and 1.", call. = FALSE)
#     }
    
#     runs <- purrr::map(seq_len(n), ~resample_random(data = data, size_ha = size_ha))
#     cols <- purrr::transpose(runs)
#     cols[[id]] <- vctrs::vec_group_id(seq_len(n))
#     tibble::as_tibble(cols) |>
#         mutate(size_ha = size_ha,
#               scenario = "Random")
# }

Prepare the data for Question 1:

  • metrics_landscape 0.25-ha ALS metrics raster for all usable pixels of the acquisition, from STAGE 1.
  • domain_landscape ALS landscape metrics, as a dataframe, with predictors scaled and cell coordinates as integers.
  • runs_q1 a nested dataframe with randomly generated plots as train-test splits, as list-columns.
  • mdist_q1 a nested dataframe with Mahalanobis distances and thresholds for each split.
  • df_plot_q1 one random plot selected as an example to be plotted in the figures.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
metrics_landscape = rast("../ALS_output/PanamaCanal/ALS/Aircraft_20230526/BCI_Gigante/output/_06_metrics_raster/predictors_50m.tif") 

# ALS_coords = as_coordinates(ALS_metrics, as.raster = TRUE)

domain_landscape = metrics_landscape |>
    as_tibble(xy = TRUE) |>
    mutate(plot_id = "BCI") |>
    mutate(p_x = x / 50,
          p_y = y / 50) |>
    unite("subplot_id", p_x:p_y, remove = FALSE) |>
    drop_na(zmean.canopy.vox) |>
    filter(zq99.canopy.vox < 100 & lad_cv.aboveground.vox < 3) |>
    mutate(pzabove2.aboveground.vox = if_else(pzabove2.aboveground.vox < 100, pzabove2.aboveground.vox, (n.canopy.vox / (n.canopy.vox + 1)) * 100),
           Pf.aboveground.vox = 100 - pzabove2.aboveground.vox) %>%
    mutate(zmean.scale = scale(zmean.canopy.vox),
          zq99.scale = scale(zq99.canopy.vox),
          pzabove2.scale = scale(pzabove2.aboveground.vox),
          zvar.scale = scale(zvar.canopy.vox),
          lad_cv.scale = scale(lad_cv.aboveground.vox))
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# ALS data

set.seed(2)

run_list = list()
x_ha = seq(3, 80, 1)
# x_test = x_test[!(x_test %in% c(23, 29, 31, 37, 41))]
n_repeat = 12


for (i in seq_along(x_ha)) {

    x = x_ha[i]

    subs_plot = domain_landscape |> crossv_plot(n = n_repeat, size_ha = x, drop_buffer = FALSE)

    subs_1ha_plots = domain_landscape |> crossv_1ha_plots(n = n_repeat, size_ha = x, drop_buffer = FALSE)

    df = bind_rows(subs_plot, subs_1ha_plots)

    run_list[[i]] = df
}


runs_q1 = do.call("rbind", run_list)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# fit models for the entire dataset

mdist_q1 = runs_q1 %>%
    mutate(train = map(train, ~as_tibble(.x))) %>%
    mutate(train = map(train, ungroup)) %>%
    mutate(test = map(test, ~as_tibble(.x))) %>%
    mutate(test = map(test, ungroup)) %>%
    # Mahalanobis distance
    mutate(train = map(train, ~mutate(.x, 
                                              m_d = mahalanobis(x = select(.,
                                                                           zmean.scale,
                                                                           # zq99.scale,
                                                                           pzabove2.scale,
                                                                           zvar.scale,
                                                                           lad_cv.scale), # Select the variables for distance calculation
                                                                center = colMeans(select(.x,
                                                                                         zmean.scale,
                                                                                         # zq99.scale,
                                                                                         pzabove2.scale,
                                                                                         zvar.scale,
                                                                                         lad_cv.scale)), # Mean vector of selected variables
                                                                cov = cov(select(.x,
                                                                                 zmean.scale,
                                                                                 # zq99.scale,
                                                                                 pzabove2.scale,
                                                                                 zvar.scale,
                                                                                 lad_cv.scale)) # Covariance matrix of selected variables
                                                                ),
                                            m_d = sqrt(m_d),
                                            threshold = case_when(m_d < 2 ~ "< 2 SD",
                                                                 m_d < 3 ~ "< 3 SD",
                                                                 m_d >= 3 ~ "≥ 3 SD") ))) %>%
    mutate(test = map2(test, train, ~mutate(.x, 
                                              m_d = mahalanobis(x = select(.,
                                                                           zmean.scale,
                                                                           # zq99.scale,
                                                                           pzabove2.scale,
                                                                           zvar.scale,
                                                                           lad_cv.scale), # Select the variables for distance calculation
                                                                center = colMeans(select(.y,
                                                                                         zmean.scale,
                                                                                         # zq99.scale,
                                                                                         pzabove2.scale,
                                                                                         zvar.scale,
                                                                                         lad_cv.scale)), # Mean vector of selected variables
                                                                cov = cov(select(.y,
                                                                                 zmean.scale,
                                                                                 # zq99.scale,
                                                                                 pzabove2.scale,
                                                                                 zvar.scale,
                                                                                 lad_cv.scale)) # Covariance matrix of selected variables
                                                                ),
                                            m_d = sqrt(m_d),
                                            threshold = case_when(m_d < 2 ~ "< 2 SD",
                                                                 m_d < 3 ~ "< 3 SD",
                                                                 m_d >= 3 ~ "≥ 3 SD") )))
1
2
3
4
5
6
7
df_plot_q1 = mdist_q1 |>
    filter(size_ha == 16 & .id == 11) |>
    mutate(train = map(train, ~mutate(as_tibble(.x), subset = "train"))) |>
    mutate(test = map(test, ~mutate(as_tibble(.x), subset = "test"))) |>
    mutate(data = map2(train, test, ~bind_rows(.x, .y))) |>
    unnest(data) |>
    dplyr::select(-.id, -test, -train)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# data frame for plotting trends in model accuracy 
df_eval_q1 = mdist_q1 %>%
        dplyr::select(-train) %>%
        unnest(test) %>%
        group_by(.id, size_ha, scenario) |>
        # add the total number of pixels, to calculate percentages later
        mutate(n_total = n(),
              `Mean Mahalanobis distance` = mean(m_d)) |>
        group_by(.id, size_ha, scenario, threshold, n_total, `Mean Mahalanobis distance`) |>
        tally() |>
        mutate(n = n / n_total) |>
        pivot_wider(names_from = "threshold", values_from = "n") |>
        # change to percent pixels greater than 2 SD
        mutate(`Pixels ≥ 2 SD (%)` = 1 - `< 2 SD`,
              `Pixels ≥ 3 SD (%)` = `≥ 3 SD`) |>
        pivot_longer(c(`Mean Mahalanobis distance`, `Pixels ≥ 2 SD (%)`, `Pixels ≥ 3 SD (%)`),
                    values_to = "value", names_to = "metric") |>
        dplyr::select(.id:n_total, metric, value)

Prepare the data for question 3:

  • subplots_summ 0.25-ha summary metrics from field measurements, from John's script.
  • metrics_subplots combined field metrics and ALS metrics for all subplots, with predictors scaled.
  • runs_q3 a nested dataframe with randomly generated plots as train-test splits, as list-columns.
  • models_q3 a nested dataframe with predictions and residuals from AGBD models for each split.
  • df_plot_q3 one random plot selected as an example to be plotted in the figures.
  • df_eval_q3 a dataframe with each split summarized as the mean Mahalanobis distance and percent of pixels past the threshold.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# read in the ALS metrics for subplots

# path for the subplot-level field measurement outputs
subplots_summ = read.csv("/projects/my-private-bucket/GEO-TREES/data/PanamaCanal/plots/processed/sub_summ.csv") 

# paths for the subplot-level ALS metrics
dir_subplots_list = Sys.glob(paste0("../ALS_output/", site, "/", type, "/", platform, "_", date, "/*/output/_08_metrics_subplots_dataframe/subplots_metrics.rds"))

# subplot metrics as a data frame
for (i in seq_along(dir_subplots_list)) {

    subplots_i = readRDS(dir_subplots_list[i]) |>
        mutate(subplot_id = as.character(subplot_id))
    
    if (i == 1) {
        subplots_df = subplots_i
        } else {
        subplots_df = bind_rows(subplots_df, subplots_i)
        }

    }

# join the field measurements to the ALS metrics
metrics_subplots = subplots_df %>%
    # create a new variable, gap fraction, Pf
    mutate(pzabove2.aboveground.vox = if_else(pzabove2.aboveground.vox < 100, pzabove2.aboveground.vox, (n.canopy.vox / (n.canopy.vox + 1)) * 100),
           Pf.aboveground.vox = 100 - pzabove2.aboveground.vox) %>%
    left_join(subplots_summ,
             by = c("plot_id", "subplot_id")) |>
    st_sf() |>
    drop_na(agb_sum_ha) |>
    # select only plots within BCI_Gigante, for now
    filter(plot_id %in% c("Panama Canal:BCI 50 ha plot", "Panama Canal:Gigante fertilization plot",  "Panama Canal:P12",  "Panama Canal:P14"))
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
set.seed(5)
run_list = list()
x_ha = 3:25
x_ha = x_ha[!(x_ha %in% c(23, 29, 31, 37, 41))]
n_repeat = 50


for (i in seq_along(x_ha)) {

    x = x_ha[i]

    subs_plot = metrics_subplots |> crossv_plot(n = n_repeat, size_ha = x)

    subs_1ha_plots = metrics_subplots |> crossv_1ha_plots(n = n_repeat, size_ha = x)

    # subs_random = subs |> crossv_random(n = n_repeat, size_ha = x)

    df = bind_rows(subs_plot, subs_1ha_plots)

    run_list[[i]] = df
}


runs_q3 = do.call("rbind", run_list)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# fit models for the entire dataset

models_q3 = runs_q3 %>%
    # multiple models to compare
    mutate(model_zmean = purrr::map(train, 
                               ~lm(log(agb_sum_ha) ~ 
                                              log(zmean.canopy.vox),
                              data = .)),
          model_Coomes = purrr::map(train, 
                               ~lm(log(agb_sum_ha) ~ 
                                            log(zmean.canopy.vox) + 
                                              log(Pf.aboveground.vox),
                              data = .)),
          model_Bouvier = purrr::map(train, 
                               ~lm(log(agb_sum_ha) ~ 
                                              log(zmean.canopy.vox) +
                                              log(Pf.aboveground.vox) +
                                              log(zvar.canopy.vox) + 
                                              log(lad_cv.aboveground.vox),
                              data = .))) %>%
        pivot_longer(model_zmean:model_Bouvier, names_to = "model_name", values_to = "model") %>%
        # extract the parameters, for the full models
        mutate(tidy = map(model, broom::tidy),
                    B0 = tidy %>% map_dbl(function(x) x$estimate[1]),
                    B1 = tidy %>% map_dbl(function(x) x$estimate[2]),
                    B2 = tidy %>% map_dbl(function(x) x$estimate[3]),
                    B3 = tidy %>% map_dbl(function(x) x$estimate[4]),
                    B4 = tidy %>% map_dbl(function(x) x$estimate[5]),
                    p = tidy %>% map_dbl(function(x) nrow(x))                     
                                          ) %>%  
        # convert from a resample object to a data frame
        mutate(train = map(train, ~as_tibble(.x))) %>%
        # add predictions for each model
        mutate(train = map2(train, model, ~modelr::add_predictions(.x, .y))) %>%
        # add residuals for each model
        mutate(train = map2(train, model, ~modelr::add_residuals(.x, .y))) %>%
        # bias correction for the log-log relationship of this model
        # extract p, the number of parameters
        mutate(train = map(train, ungroup)) %>%
        mutate(train = map2(train, p, ~mutate(.x, 
                                              resid2 = resid^2,
                                              p = .y,
                                              SEE = sqrt((sum(resid2)) / (length(resid2) - p)),
                                              pred2 = pred * (exp(SEE^2 / 2)),
                                              residual = agb_sum_ha - pred2))) %>%
        # extract the SEE for each model, to be applied to the test data
        mutate(SEE = map(train, ~first(.$SEE))) %>%
        # apply each model to the testing data and produce predictions and residuals
        mutate(test = map2(test, model, ~modelr::add_predictions(as_tibble(.x), .y))) %>%
        mutate(test = map2(test, SEE, ~mutate(.x, 
                                              SEE = .y,
                                              pred2 = exp(pred) * (exp(SEE^2 / 2)),
                                              residual = agb_sum_ha - pred2)))
1
2
3
4
5
6
7
8
df_plot_q3 = runs_q3 |>
    filter(size_ha == 16 & .id == 8) |>
    mutate(train = map(train, ~mutate(as_tibble(.x), subset = "train"))) |>
    mutate(test = map(test, ~mutate(as_tibble(.x), subset = "test"))) |>
    mutate(data = map2(train, test, ~bind_rows(.x, .y))) |>
    unnest(data) |>
    dplyr::select(-.id, -test, -train) |>
    st_as_sf()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# data frame for plotting trends in model accuracy 
df_eval_q3 = models_q3 %>%
        dplyr::select(-train, -model, -tidy, -SEE) %>%
        unnest(test) %>%
        group_by(.id, size_ha, scenario, model_name) |>
        summarize(Bias = mean(residual, na.rm = TRUE),
               MAE = mean(abs(residual), na.rm = TRUE),
               RMSE = sqrt(mean(residual^2, na.rm = TRUE))) |>
    pivot_longer(Bias:RMSE,
                    values_to = "value", names_to = "metric")
`summarise()` has grouped output by '.id', 'size_ha', 'scenario'. You can
override using the `.groups` argument.

Figures.


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
options(repr.plot.width = 12, repr.plot.height = 12)

fig_landscape = ggplot() +
    geom_tile(data = domain_landscape, aes(x = x, y = y), color = "grey70", fill = NA) +
    geom_tile(data = df_plot_q1, aes(x = x, y = y, fill = subset), alpha = .8) +
    scale_fill_manual(values = c("steelblue3", "indianred")) +
    theme_bw(base_size = 15) +
    scale_x_continuous(expand = expansion(mult = c(.01, .01))) + 
    scale_y_continuous(expand = expansion(mult = c(.01, .01))) +
    theme_map +
    coord_equal() +
    theme(legend.position.inside = c(.02, .98),
          plot.background = element_blank(),
        legend.justification = c(0, 1),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.key = element_rect(fill = "white", color = "white"),
        legend.title = element_blank(),
        axis.text = element_blank(),
        axis.text.x = element_blank(),
        legend.direction = "vertical",
        legend.key.width = unit(.1, "in"),
        legend.key.height = unit(.1, "in"),
         legend.key.spacing.y = unit(.05, "in")) +
    facet_grid(. ~ scenario)

fig_plot = ggplot() +
    geom_tile(data = domain_landscape, aes(x = x, y = y), fill = "grey75", color = NA) +
    geom_sf(data = metrics_subplots, fill = "grey75") +
    geom_sf(data = df_plot_q3, aes(fill = subset)) +
    scale_fill_manual(values = c("steelblue3", "indianred")) +
    theme_bw(base_size = 15) +
    scale_x_continuous(expand = expansion(mult = c(.01, .01))) + 
    scale_y_continuous(expand = expansion(mult = c(.01, .01))) +
    theme_map +
    # coord_equal() +
    theme(legend.position.inside = c(.02, .98),
          plot.background = element_blank(),
        legend.justification = c(0, 1),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.key = element_rect(fill = "white", color = "white"),
        legend.title = element_blank(),
        axis.text = element_blank(),
        axis.text.x = element_blank(),
        legend.direction = "vertical",
        legend.key.width = unit(.1, "in"),
        legend.key.height = unit(.1, "in"),
         legend.key.spacing.y = unit(.05, "in")) +
    facet_grid(. ~ scenario)

fig = ggarrange(fig_landscape, fig_plot,
               ncol = 1)

ggsave(plot = fig, 
       filename = "./figures/Figure_2.jpeg", 
       device = jpeg,
       width = 12,
       height = 24,
       units = "in",
       dpi = 150,
       bg = "white",
      create.dir = TRUE)

IRdisplay::display_jpeg(file = "./figures/Figure_2.jpeg")
No description has been provided for this image

Figure 2. Conceptual framework for this paper, an example of random 1-ha plots and single random large plot to sample (a) variation across the landscape (Questions 1 and 2) and (b) real field plot data (Question 3).

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
pred_names = c("zmean.canopy.vox", "pzabove2.aboveground.vox", "lad_cv.aboveground.vox", "zvar.canopy.vox")
metrics_list = list()

cols_gradient_greyscale <- c(low = "#122222", mid = "#666666", high = "snow2")

title_list = c("Mean of canopy heights (m)", 
               "Canopy cover (%)",
               "CV of the LAD profile",
               "Variance of canopy heights (m)")

df_plot_i = df_plot_q1 |>
    arrange(subset) |>
    filter(scenario == "Single plot")

for (i in 1:length(pred_names)) {
        
    metrics_list[[i]] = ggplot() +
    geom_tile(data = df_plot_q1, aes(x = x, y = y, fill = .data[[pred_names[i]]])) +
    geom_tile(data = filter(df_plot_i, subset == "train"), aes(x = x, y = y), fill = "indianred", alpha = .8) +
        coord_equal() +
            scale_fill_gradientn(colors = cols_gradient_greyscale,
                                na.value = "grey70",
                                    name = title_list[[i]]) +
                scale_x_continuous(expand = expansion(mult = c(.02, .02))) + 
               scale_y_continuous(expand = expansion(mult = c(.15, .01))) +
                theme_bw(base_size = 12) +
                theme_map +
                theme(panel.grid.major = element_blank(),
                      panel.grid.minor = element_blank(),
                      plot.title = element_text(hjust = 0.5),
                     axis.text.x = element_blank(),
                     axis.text.y = element_blank()) +
                guides(fill = guide_colourbar(title.position="top"))
    
}

fig_cor1 = df_plot_i |>
    ggplot(aes_string(x = pred_names[1], y = pred_names[2])) +
    geom_point(alpha = .2, 
               color = cols_gradient_greyscale[2],
               size = 3,
              shape = 19) +
    geom_point(data = filter(df_plot_i, subset == "train"),
               alpha = .9, 
               aes(color = subset),
               size = 3,
              shape = 19) +
    stat_boxplot(aes(x = .data[[pred_names[1]]],
                     fill = subset, 
                     y = (max(.data[[pred_names[2]]]) - min(.data[[pred_names[2]]])) *.1 + max(.data[[pred_names[2]]])),
                 width = (max(df_plot_q1[[pred_names[2]]]) - min(df_plot_q1[[pred_names[2]]])) * .1,
                 position = position_dodge(width = (max(df_plot_i[[pred_names[2]]]) - min(df_plot_i[[pred_names[2]]])) * .12),
               outlier.shape = NA) +
    stat_boxplot(aes(y = .data[[pred_names[2]]],
                     fill = subset, 
                     x = (max(.data[[pred_names[1]]]) - min(.data[[pred_names[1]]])) *.1 + max(.data[[pred_names[1]]])),
                    width = (max(df_plot_q1[[pred_names[1]]]) - min(df_plot_q1[[pred_names[1]]])) * .1,
                 position = position_dodge(width = (max(df_plot_i[[pred_names[1]]]) - min(df_plot_i[[pred_names[1]]])) * .12),
               outlier.shape = NA) +
    coord_cartesian(clip = "off") +
    labs(x = title_list[[1]], y = title_list[[2]]) +
    scale_x_continuous(expand = expansion(mult = c(.02, .02)),
                      breaks = seq(0, 100, 10)) + 
    scale_y_continuous(expand = expansion(mult = c(.02, .02)),
                      breaks = seq(0, 100, 25)) +
    scale_color_manual(values = c("test" = cols_gradient_greyscale[2], "train" = "indianred")) +
    scale_fill_manual(values = c("test" = cols_gradient_greyscale[2], "train" = "indianred")) +
    theme_bw(base_size = 15) +
    theme(aspect.ratio = 1,
         legend.position = "none")

fig_cor2 = df_plot_i |>
    ggplot(aes_string(x = pred_names[3], y = pred_names[4])) +
    geom_point(alpha = .2, 
               color = cols_gradient_greyscale[2],
               size = 3,
              shape = 19) +
    geom_point(data = filter(df_plot_i, subset == "train"),
               alpha = .9, 
               aes(color = subset),
               size = 3,
              shape = 19) +
    stat_boxplot(aes(x = .data[[pred_names[3]]],
                     fill = subset, 
                     y = (max(.data[[pred_names[4]]]) - min(.data[[pred_names[4]]])) *.1 + max(.data[[pred_names[4]]])),
                 width = (max(df_plot_q1[[pred_names[4]]]) - min(df_plot_q1[[pred_names[4]]])) * .1,
                 position = position_dodge(width = (max(df_plot_i[[pred_names[4]]]) - min(df_plot_i[[pred_names[4]]])) * .12),
               outlier.shape = NA) +
    stat_boxplot(aes(y = .data[[pred_names[4]]],
                     fill = subset, 
                     x = (max(.data[[pred_names[3]]]) - min(.data[[pred_names[3]]])) *.1 + max(.data[[pred_names[3]]])),
                    width = (max(df_plot_q1[[pred_names[3]]]) - min(df_plot_q1[[pred_names[3]]])) * .1,
                 position = position_dodge(width = (max(df_plot_i[[pred_names[3]]]) - min(df_plot_i[[pred_names[3]]])) * .12),
               outlier.shape = NA) +
    coord_cartesian(clip = "off") +
    labs(x = title_list[[3]], y = title_list[[4]]) +
    scale_x_continuous(expand = expansion(mult = c(.02, .02))) + 
    scale_y_continuous(expand = expansion(mult = c(.02, .02))) +
    scale_color_manual(values = c("test" = cols_gradient_greyscale[2], "train" = "indianred")) +
    scale_fill_manual(values = c("test" = cols_gradient_greyscale[2], "train" = "indianred")) +
    theme_bw(base_size = 15) +
    theme(aspect.ratio = 1,
         legend.position = "none")

fig_1 = ggarrange(metrics_list[[1]],
                metrics_list[[2]],
                  fig_cor1,
                  widths = c(.5, .5, 1),
        ncol = 3, nrow = 1)
    

fig_2 = ggarrange(metrics_list[[3]],
                metrics_list[[4]],
                  fig_cor2,
                widths = c(.5, .5, 1),
        ncol = 3, nrow = 1)

fig = ggarrange(fig_1, fig_2,
               ncol = 1, nrow = 2)

ggsave(plot = fig, 
       filename = "./figures/Figure_3.jpeg", 
       device = jpeg,
       width = 15,
       height = 15,
       units = "in",
       dpi = 150,
       bg = "white",
      create.dir = TRUE)

IRdisplay::display_jpeg(file = "./figures/Figure_3.jpeg")
Warning message:
“`aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.”
No description has been provided for this image

Figure 3. Landscape-level variation in forest structural metrics on BCI. (a & b, d & e) landscape-level variation in mean height of canopy returns, canopy cover, coefficient of variation of the leaf-area density profile, and variance of canopy returns at BCI, with one random 16-ha plot shown. (c & f) Scatter plots showing distribution of all pixels and the plot in “structural space”.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
# df_plot = df_mdist |>
#     mutate(c = factor(if_else(sqrt(m_d) > 3, 1, 0))) |>
#     drop_na(c)

plots_train = df_plot_q1 |> 
    filter(subset == "train")

fig_hist = ggplot() +
    #geom_tile(data = df_plot, aes(x = x, y = y), color = "grey70", fill = NA) +
    geom_histogram(data = df_plot_q1, 
                   aes(x = m_d, fill = factor(threshold)), 
                   alpha = .8,
                  bins = 100) +
    scale_fill_manual(values = c("steelblue3", "darkseagreen", "grey80")) +
    scale_y_continuous(expand = expansion(mult = c(.01, .03))) +
    theme_bw(base_size = 14) +
    labs(x = "Mahalanobis distance") +
    # theme_map +
    # coord_equal() +
  scale_x_log10(breaks = c(.2, 2, 20, 200)) +
    theme(aspect.ratio = .4,
          legend.position = "none",
        legend.justification = c(0, 1),
        panel.background = element_rect(fill = "white"),
        legend.key = element_rect(fill = "white", color = "white"),
        legend.title = element_blank(),
          axis.text.y = element_blank(),
          axis.title.y = element_blank(),
          axis.ticks.y = element_blank(),
          strip.background = element_rect(fill = "white"),
          # plot.background = element_rect(fill = "white", color = "blue"),
        legend.direction = "vertical",
        legend.key.width = unit(.1, "in"),
        legend.key.height = unit(.1, "in"),
         legend.key.spacing.y = unit(.05, "in")) +
    facet_grid(. ~ scenario)

fig_map = ggplot() +
    #geom_tile(data = df_plot, aes(x = x, y = y), color = "grey70", fill = NA) +
    geom_tile(data = df_plot_q1, aes(x = x, y = y, fill = factor(threshold)), alpha = .8) +
    geom_tile(data = filter(df_plot_q1, subset == "train"), 
                aes(x = x, y = y),
                # color = NULL,
                fill = "indianred",
                alpha = .8) +
    scale_fill_manual(values = c("steelblue3", "darkseagreen", "grey80")) +
    scale_x_continuous(expand = expansion(mult = c(.02, .02))) + 
    scale_y_continuous(expand = expansion(mult = c(.01, .01))) +
    theme_bw(base_size = 14) +
    theme_map +
    coord_equal() +
    theme(legend.position.inside = c(.02, .985),
        legend.justification = c(0, 1),
        legend.direction = "vertical",
        legend.key.width = unit(.1, "in"),
        legend.key.height = unit(.1, "in"),
         legend.key.spacing.y = unit(.05, "in"),
        panel.background = element_rect(fill = "white"),
          panel.grid.major = element_blank(),
                      panel.grid.minor = element_blank(),
        legend.key = element_rect(fill = "white", color = "white"),
        legend.title = element_blank(),
        # axis.text = element_text(size = 8),
          axis.text.x = element_blank(),
          axis.text.y = element_blank()) +
    facet_grid(. ~ scenario)

fig = ggarrange( fig_hist, fig_map, ncol = 1,
                heights = c(.3075, 1))
               # widths = c(1, 1))

ggsave(plot = fig, 
       filename = "./figures/Figure_4.jpeg", 
       device = jpeg,
       width = 10,
       height = 12,
       units = "in",
       dpi = 150,
       bg = "white",
      create.dir = TRUE)

IRdisplay::display_jpeg(file = "./figures/Figure_4.jpeg")
Warning message in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
“conversion failure on '≥ 3 SD' in 'mbcsToSbcs': dot substituted for <e2>”
Warning message in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
“conversion failure on '≥ 3 SD' in 'mbcsToSbcs': dot substituted for <89>”
Warning message in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
“conversion failure on '≥ 3 SD' in 'mbcsToSbcs': dot substituted for <a5>”
Warning message in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
“conversion failure on '≥ 3 SD' in 'mbcsToSbcs': dot substituted for <e2>”
Warning message in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
“conversion failure on '≥ 3 SD' in 'mbcsToSbcs': dot substituted for <89>”
Warning message in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
“conversion failure on '≥ 3 SD' in 'mbcsToSbcs': dot substituted for <a5>”
No description has been provided for this image

Figure 4. Mahalanobis distance and thresholding for plot representativeness. (a) The distribution of Mahalanobis distances generated from a random selection of 10-ha plots. (b) Mahalanobis distances across the landscape.(c) Pixels below the threshold of acceptable distance (2 std) from the distribution of plot data.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
options(repr.plot.width = 12, repr.plot.height = 5)

fig = df_eval_q1 |>
    ggplot(aes(x = size_ha, y = value, color = scenario, fill = scenario)) +
    labs(x = "Area of simulated plot data (ha)") +
    stat_summary(fun.data = "mean_cl_normal", 
                 geom = "ribbon", 
                 color = NA,
                 alpha = 0.5,
                 (aes(group = scenario))) +
    stat_summary(fun = "mean", 
                 geom = "line", 
                 alpha = 0.8,
                 (aes(group = scenario))) +
    scale_color_manual(values = c("forestgreen", "orchid4")) +
    scale_fill_manual(values = c("forestgreen", "orchid4")) +
    theme_bw(base_size = 15) +
    theme_map +
    theme(legend.position.inside = c(.88, .95),
        legend.justification = c(0, 1),
        legend.direction = "vertical",
        legend.key.width = unit(.1, "in"),
        legend.key.height = unit(.1, "in"),
         legend.key.spacing.y = unit(.05, "in"),
          aspect.ratio = 1,
         axis.title.x = element_text()) +
    facet_grid2(. ~ metric,
                independent = "all",
                scales = "free"
              ) +
  facetted_pos_scales(
    y = list(
      metric == "Mean Mahalanobis distance" ~ scale_y_continuous(limits = c(1, 7)),
      metric == "Pixels ≥ 2 SD (%)" ~ scale_y_continuous(limits = c(0, 1), 
                                                         breaks = seq(0, 1, .2),
                                                        labels = seq(0, 100, 20)),
              metric == "Pixels ≥ 3 SD (%)" ~ scale_y_continuous(limits = c(0, 1), 
                                                         breaks = seq(0, 1, .2),
                                                        labels = seq(0, 100, 20))
    )
  )

ggsave(plot = fig, 
       filename = "./figures/Figure_5.jpeg", 
       device = jpeg,
       width = 14,
       height = 7,
       units = "in",
       dpi = 150,
       bg = "white",
      create.dir = TRUE)

IRdisplay::display_jpeg(file = "./figures/Figure_5.jpeg")
No description has been provided for this image

Figure 5. Relationship between plot area in ha and percent of pixels outside of acceptable distance for (a) Question 1 and (b) Question 2.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
options(repr.plot.width = 12, repr.plot.height = 5)

fig = df_eval_q3 |>
    ggplot(aes(x = size_ha, y = value, color = scenario, fill = scenario)) +
    labs(x = "Area of simulated plot data (ha)") +
    stat_summary(fun.data = "mean_cl_normal", 
                 geom = "ribbon", 
                 color = NA,
                 alpha = 0.5,
                 (aes(group = scenario))) +
    stat_summary(fun = "mean", 
                 geom = "line", 
                 alpha = 0.8,
                 (aes(group = scenario))) +
    scale_color_manual(values = c("forestgreen", "orchid4")) +
    scale_fill_manual(values = c("forestgreen", "orchid4")) +
    theme_bw(base_size = 15) +
    theme_map +
    theme(legend.position.inside = c(.88, .95),
        legend.justification = c(0, 1),
        legend.direction = "vertical",
        legend.key.width = unit(.1, "in"),
        legend.key.height = unit(.1, "in"),
         legend.key.spacing.y = unit(.05, "in"),
          aspect.ratio = 1,
         axis.title.x = element_text()) +
    facet_grid2(. ~ metric,
                independent = "all",
                scales = "free"
              ) #+
  # facetted_pos_scales(
  #   y = list(
  #     metric == "Mean Mahalanobis distance" ~ scale_y_continuous(limits = c(1, 7)),
  #     metric == "Pixels ≥ 2 SD (%)" ~ scale_y_continuous(limits = c(0, 1), 
  #                                                        breaks = seq(0, 1, .2),
  #                                                       labels = seq(0, 100, 20)),
  #             metric == "Pixels ≥ 3 SD (%)" ~ scale_y_continuous(limits = c(0, 1), 
  #                                                        breaks = seq(0, 1, .2),
  #                                                       labels = seq(0, 100, 20))
  #   )
  # )

ggsave(plot = fig, 
       filename = "./figures/Figure_6.jpeg", 
       device = jpeg,
       width = 14,
       height = 7,
       units = "in",
       dpi = 150,
       bg = "white",
      create.dir = TRUE)

IRdisplay::display_jpeg(file = "./figures/Figure_6.jpeg")
No description has been provided for this image

Figure 6. Relationship between plot area in ha and MAE for Question 3.