The objective of this practical is to simulate plot data using FieldSimR which captures spatial variation with realistic structure and complexity. Genetic values will be simulated using AlphaSimR based on an additive genetic trait representing grain yield.


Summary

The simulation can be summarised by the following steps:

  1. Simulating plot errors with FieldSimR
    1. Simulation parameters
    2. field_trial_error() and plot_effects() functions

  1. Simulating genetic values with AlphaSimR
    1. Simulation parameters
    2. Simulating an additive genetic trait

  1. Generating phenotypes by combining the plot errors with the genetic values
    1. make_phenotypes() function
  2. Model fitting


Preliminaries

Clean the working environment.

rm(list = ls())

Load required packages.

library(FieldSimR)
library(AlphaSimR)
#> Loading required package: R6
library(asreml)
#> Loading required package: Matrix
#> Loading ASReml-R version 4.2
library(ggplot2)

The results in this practical can be reproduced by initially setting the following seed.

set.seed(123)


Simulation details

Consider a scenario in which 400 inbred wheat genotypes are evaluated for grain yield (t/ha) in a single field trial. The genotypes are simulated with additive genetic relatedness using AlphaSimR. The trial comprises 30 columns and 40 rows for 1200 plots in total. There are three complete blocks of all genotypes aligned in the column direction (side-by-side). The trial is 240 m long in the column direction and 80 m wide in the row direction, with plots 8m long by 2m wide. The trait mean is 3 t/ha and the trait variance is 0.3 t2/ha2. The plot-level heritability is 0.1, which reflects the ratio of genetic variance to total phenotypic variance. The error variance is given by 0.3/0.1-0.3 = 2.7 t2/ha2.


The simulation parameters are:

n_genos <- 400 # no. genotypes
n_reps <- n_blocks <- 3 # no. replicates/blocks
block_dir <- "col" # block direction
n_cols <- 30   # no. columns
n_rows <- 40   # no. rows
n_plots <- n_cols * n_rows # no. plots == n_genos * n_reps
plot_length <- 8
plot_width <- 2 
mu <- 3        # trait mean
sigma2 <- 0.3  # trait variance 
H2 <- 0.1      # plot-level heritability
sigma2e <- sigma2/H2 - sigma2 # error variance
round(sigma2e, digits = 2)
#> [1] 2.7


The phenotypes are generated based on the linear mixed model:

\(\mathbf{y} = \mathbf{Z}\mathbf{gv} + \mathbf{e}\),

where \(\mathbf{gv}\) is the vector of genetic values with design matrix \(\mathbf{Z}\) and \(\mathbf{e}\) is the vector of plot errors. The inclusion of additional non-genetic effects is straightforward.

The simulation involves three steps:

  1. Simulating the plot errors with FieldSimR
  2. Simulating the genetic values with AlphaSimR
  3. Generating the phenotypes by combining the plot errors with the genetic values.


1. Simulating the plot errors with FieldSimR

The plot errors are simulated to capture spatial variation, which can be broadly categorised as either global and local trend (smooth spatial trend), random error (noise) or extraneous variation (systematic error).

The plot errors are constructed as the sum of three terms:

\(\mathbf{e} = \mathbf{se} + \mathbf{re} + \mathbf{ee}\),

where \(\mathbf{se}\) is a vector of spatial errors that capture global and local trend, \(\mathbf{re}\) is a vector of random errors that capture random measurement error/noise and \(\mathbf{ee}\) is a vector of extraneous errors that capture extraneous variation.

Simulating the plot errors involves two parts:

  1. Setting the simulation parameters for the spatial, extraneous and random error terms
  2. Simulating the plot errors with the field_trial_error() function and displaying them with the plot_effects() function.


a. Setting the simulation parameters

Simulate the plot errors based on a proportion of spatial error variance of 0.6 and a proportion of extraneous error variance of 0.1. FieldSimR will then allocate the remaining 0.3 to the random error variance by default.

The initial simulation parameters are:

round(sigma2e, digits = 2) # error variance
#> [1] 2.7
prop_spatial <- 0.6 # proportion of spatial error variance
prop_ext <- 0.1 # proportion of extraneous error variance
1 - (prop_spatial + prop_ext) # proportion of random error variance
#> [1] 0.3

The simulation parameters for the spatial and extraneous errors are defined in the following. No further parameters are required for the random errors.


Spatial errors

FieldSimR has the capacity to simulate spatial errors based on a separable first order autoregressive (AR1) process or bivariate interpolation. These options are set with spatial_model = "AR1:AR1" or spatial_model = "Bivariate".

Simulate spatial errors based on a separable AR1 process with a column autocorrelation of 0.3 and row autocorrelation of 0.9. The simulation parameters are:

spatial_model1 <- "AR1:AR1"
prop_spatial # proportion of spatial error variance
#> [1] 0.6
col_cor <- 0.3 # column autocorrelation
row_cor <- 0.9 # row autocorrelation

Note that the plot lengths and widths are not required when spatial_model = "AR1:AR1".


Also simulate spatial errors based on bivariate interpolation with 200 additional knots points. This option is set with complexity = 200. The simulation parameters are:

spatial_model2 <- "Bivariate"
prop_spatial # proportion of spatial error variance
#> [1] 0.6
complexity <- 200 # number of additional knot points
plot_length
#> [1] 8
plot_width
#> [1] 2

Note that increasing the complexity will generate more local trend relative to global trend, up to a point where the spatial errors capture minimal or no trend.


Extraneous errors

FieldSimR has the capacity to simulate extraneous errors based on random or zig-zag ordering. These options are set with ext_ord = "random" or ext_ord = "zig-zag". The zig-zag ordering is achieved by alternating the positive and negative simulated values between neighbouring columns and/or rows.

Extraneous errors can be simulated in both directions, or either the column or row directions individually. These options are set with ext_dir = "both", ext_dir = "col" or ext_dir = "row".

Simulate extraneous errors using zig-zag ordering between neighbouring rows. The simulation parameters are:

prop_ext # proportion of extraneous error variance
#> [1] 0.1
ext_ord <- "zig-zag" # ordering of extraneous variation
ext_dir <- "row" # direction of extraneous variation


b. Simulating the plot errors

The plot errors are simulated using FieldSimR’s function field_trial_error(). The functionality will be demonstrated separately for the simulations involving the AR1 process and bivariate interpolation.


Autoregressive process

Simulate plot errors using the field_trial_error() function with terms appropriate to the AR1 process, i.e. col_cor and row_cor. A data frame is returned containing the environment, block, column and row numbers as well as the simulated plot error. When return_effects = TRUE, a list is returned with an additional data frame containing the spatial, random and extraneous error terms.

error_df1 <- field_trial_error(n_traits = 1,
                               n_envs = 1,
                               n_blocks = n_blocks, # 3
                               block_dir = block_dir, # "col"
                               n_cols = n_cols, # 30
                               n_rows = n_rows, # 40
                               var_R = sigma2e, # 0.3
                               prop_spatial = prop_spatial, # 0.6
                               spatial_model = spatial_model1, # "AR1:AR1"
                               col_cor = col_cor, # 0.3
                               row_cor = row_cor, # 0.9
                               prop_ext = prop_ext, # 0.1
                               ext_ord = ext_ord, # "zig-zag"
                               ext_dir = ext_dir, # "row"
                               return_effects = TRUE)
head(error_df1$plot_df) # data frame with simulated plot errors
#>   env block col row   e.Trait1
#> 1   1     1   1   1 -0.4280387
#> 2   1     1   1   2 -0.6974782
#> 3   1     1   1   3  0.4924385
#> 4   1     1   1   4 -0.2846024
#> 5   1     1   1   5  0.3706938
#> 6   1     1   1   6  2.3101299
head(error_df1$Trait1) # data frame with simulated spatial, random and extraneous errors
#>   env block col row     e.spat     e.rand e.ext.col   e.ext.row
#> 1   1     1   1   1 -0.9304495  0.5692075         0 -0.06679673
#> 2   1     1   1   2 -1.0088154 -0.7027582         0  1.01409551
#> 3   1     1   1   3  0.1044457  0.7831546         0 -0.39516177
#> 4   1     1   1   4  0.1215683 -0.6939111         0  0.28774045
#> 5   1     1   1   5  0.1758757  0.5788023         0 -0.38398418
#> 6   1     1   1   6  1.2741363  1.0095339         0  0.02645974


The separate error terms can be displayed using FieldSimR’s function plot_effects(). Display the spatial errors.

plot_effects(error_df1$Trait1, effect = "e.spat", labels = TRUE)


Display the total plot errors.

plot_effects(error_df1$plot_df, effect = "e.Trait1", labels = TRUE)


The theoretical and sample variograms can be displayed using FieldSimR’s functions theoretical_variogram() and sample_variogram(). Display the sample variogram for the total plot errors:

sample_variogram(error_df1$plot_df, effect = "e.Trait1")

Bivariate interpolation

Also simulate plot errors using the field_trial_error() function with terms appropriate to bivariate interpolation, i.e. plot_length, plot_width and complexity.

error_df2 <- field_trial_error(n_traits = 1,
                               n_envs = 1,
                               n_blocks = n_blocks, # 3
                               block_dir = block_dir, # "col"
                               n_cols = n_cols, # 30
                               n_rows = n_rows, # 40
                               var_R = sigma2e, # 0.3
                               prop_spatial = prop_spatial, # 0.6
                               spatial_model = spatial_model2, # "Bivariate"
                               plot_length = plot_length, # 8
                               plot_width = plot_width, # 2
                               complexity = complexity, # 200
                               prop_ext = prop_ext, # 0.1
                               ext_dir = ext_dir, # "row"
                               ext_ord = ext_ord, # "zig-zag"
                               return_effects = TRUE)
head(error_df2$plot_df) # data frame with simulated plot errors
#>   env block col row   e.Trait1
#> 1   1     1   1   1 -1.7257294
#> 2   1     1   1   2  0.3068376
#> 3   1     1   1   3 -0.4149513
#> 4   1     1   1   4 -1.9735581
#> 5   1     1   1   5 -2.9815700
#> 6   1     1   1   6  0.7566291
head(error_df2$Trait1) # data frame with simulated spatial, random and extraneous errors
#>   env block col row     e.spat     e.rand e.ext.col   e.ext.row
#> 1   1     1   1   1 -0.8949584 -0.1965031         0 -0.63426787
#> 2   1     1   1   2 -0.7921239  0.7974286         0  0.30153298
#> 3   1     1   1   3 -1.0598630  0.6476575         0 -0.00274582
#> 4   1     1   1   4 -1.3276021 -0.6769702         0  0.03101411
#> 5   1     1   1   5 -1.3607886 -1.4868072         0 -0.13397418
#> 6   1     1   1   6 -1.1889768  1.7521228         0  0.19348312


Display the spatial errors:

plot_effects(error_df2$Trait1, effect = "e.spat", labels = TRUE)


Display the total plot errors:

plot_effects(error_df2$plot_df, effect = "e.Trait1", labels = TRUE)


2. Simulating the genetic values with AlphaSimR

Simulating the genetic values involves two parts:

  1. Setting the simulation parameters for the founder population and trait architecture
  2. Simulating an additive genetic trait for a population of inbred wheat genotypes.


a. Setting the simulation parameters

Set the number of homogeneous genotypes in the founder population to 20, the number of chromosomes to 20 and the number of segregating sites (biallelic QTN) per chromosome to 300.

n_founders <- 20 # Number of genotypes in the founder population
n_chr <- 20 # Number of chromosomes
n_seg_sites <- 300 # Number of QTN per chromosome


The trait mean and variance are:

mu # trait mean
#> [1] 3
sigma2 # trait variance
#> [1] 0.3


b. Simulating an additive genetic trait

Simulate the founder population using the runMacs() function.

# Simulating a founder population using AlphaSimR's "WHEAT" presets to mimic it's evolutionary history
founders <- runMacs(nInd = n_founders,
                    nChr = n_chr,
                    segSites = n_seg_sites,
                    inbred = TRUE,
                    species = "WHEAT",
                    nThreads = 2)
founders # MapPop object
#> An object of class "MapPop" 
#> Ploidy: 2 
#> Individuals: 20 
#> Chromosomes: 20 
#> Loci: 6000
SP <- SimParam$new(founders)


Simulate an additive genetic trait representing grain yield using the addTraitA() function.

SP$addTraitA(nQtlPerChr = n_seg_sites,
             mean = mu,
             var = sigma2)
founders <- newPop(founders) # create pop-class object


Simulate the inbred wheat genotypes by randomly crossing the founders using the randCross() function, and then made into doubled haploid (DH) lines using the makeDH() function.

f1s <- randCross(pop = founders, 
                 nCrosses = 40, # no. crosses
                 nProgeny = 10) # no. progeny per cross
DHs <- makeDH(pop = f1s, nDH = 1)
DHs
#> An object of class "Pop" 
#> Ploidy: 2 
#> Individuals: 400 
#> Chromosomes: 20 
#> Loci: 6000 
#> Traits: 1


Obtain the genetic values for the DH population.

gv <- c(DHs@gv)
round(mean(gv), digits = 2) # trait mean in the DH population
#> [1] 3.04
round(var(gv), digits = 2) # trait variance in the DH population
#> [1] 0.33
gv_df <- data.frame(env = factor(1),
                    rep = factor(rep(1:n_reps, each = n_genos)),
                    id = factor(DHs@id),
                    gv = gv)
head(gv_df)
#>   env rep  id       gv
#> 1   1   1 421 2.727611
#> 2   1   1 422 1.962928
#> 3   1   1 423 3.250774
#> 4   1   1 424 3.427477
#> 5   1   1 425 3.244724
#> 6   1   1 426 2.134860


3. Generating the phenotypes

Generate the phenotypes using FieldSimR’s function make_phenotypes() for plot errors simulated with a separable AR1 process and bivariate interpolation. When randomise = TRUE, genotypes are allocated to plots using a randomised complete block design.

trial_df1 <- make_phenotypes(gv_df = gv_df, error_df = error_df1, randomise = TRUE)
trial_df2 <- make_phenotypes(gv_df = gv_df, error_df = error_df2, randomise = TRUE)
head(trial_df1)
#>   env block col row  id phe.Trait1
#> 1   1     1   1   1 517   2.405821
#> 2   1     1   1   2 582   2.640356
#> 3   1     1   1   3 695   4.456456
#> 4   1     1   1   4 732   2.047950
#> 5   1     1   1   5 787   2.924540
#> 6   1     1   1   6 679   6.395116
head(trial_df2)
#>   env block col row  id phe.Trait1
#> 1   1     1   1   1 580  1.5582814
#> 2   1     1   1   2 704  3.7000911
#> 3   1     1   1   3 682  3.6144583
#> 4   1     1   1   4 560  0.9547222
#> 5   1     1   1   5 748 -0.5755032
#> 6   1     1   1   6 609  3.7004799


Display the phenotypes.

plot_effects(trial_df1, effect = "phe.Trait1", labels = TRUE)

plot_effects(trial_df2, effect = "phe.Trait1", labels = TRUE)


4. Model fitting

Run a linear mixed model analysis in ASReml-R and compare the prediction accuracies between the models.

# Complete block analysis, ID residual
asr1 <- asreml(phe.Trait1 ~ 1,
               random = ~ id + block, # random genotype and block effects
               residual = ~ units,
               data = trial_df1)
#> Warning in asreml(phe.Trait1 ~ 1, random = ~id + block, residual = ~units, :
#> Some components changed by more than 1% on the last iteration
asr1 <- update(asr1) # update the model
summary(asr1)$varcom # check the estimated variance parameters
#>          component   std.error    z.ratio bound %ch
#> block   0.00184438 0.008772788  0.2102388     P   0
#> id      0.25081576 0.095081936  2.6378908     P   0
#> units!R 2.76880424 0.138613592 19.9749838     P   0
main_g <- gv_df$gv[gv_df$rep == 1]
names(main_g) <- gv_df$id[gv_df$rep == 1]
var(main_g) # true genetic variance
#> [1] 0.3329084
var(error_df1$plot_df$e.Trait1) # true error variance
#> [1] 2.7
BLUPs1 <- asr1$coefficients$random[grep("id", rownames(asr1$coefficients$random)),] # predicted genetic effects 
cor(BLUPs1, main_g) # prediction accuracy = correlation between true and predicted effects
#> [1] 0.5230161
trial_df1$resid1 <- c(asr1$residuals)
sample_variogram(trial_df1, effect = "resid1") # check the sample variogram

# Complete block analysis, AR1 + ID residual 
asr2 <- asreml(phe.Trait1 ~ 1,
               random = ~ id + block + units, # random genotype and block effects plus ID error (units)
               residual = ~ ar1(col):ar1(row), # separable AR1 process
               data = trial_df1)
#> Warning in asreml(phe.Trait1 ~ 1, random = ~id + block + units, residual =
#> ~ar1(col):ar1(row), : Log-likelihood not converged
#> Warning in asreml(phe.Trait1 ~ 1, random = ~id + block + units, residual =
#> ~ar1(col):ar1(row), : Some components changed by more than 1% on the last
#> iteration
asr2 <- update(asr2) # update the model
#> Warning in asreml(fixed = phe.Trait1 ~ 1, random = ~id + block + units, :
#> Log-likelihood not converged
summary(asr2)$varcom
#>                    component  std.error   z.ratio bound %ch
#> block           2.544192e-07         NA        NA     B  NA
#> id              2.595857e-01 0.05912774  4.390253     P 0.4
#> units           9.724167e-01 0.09011289 10.791095     P 0.5
#> col:row!R       2.514201e+00 0.39349608  6.389392     P 0.0
#> col:row!col!cor 5.567283e-01 0.05795197  9.606720     U 0.3
#> col:row!row!cor 8.750148e-01 0.02469724 35.429663     U 0.0
var(main_g) # true genetic variance
#> [1] 0.3329084
var(error_df1$plot_df$e.Trait1) # true error variance
#> [1] 2.7
BLUPs2 <- asr2$coefficients$random[grep("id", rownames(asr2$coefficients$random)),] # predicted genetic effects 
cor(BLUPs2, main_g) # prediction accuracy = correlation between true and predicted effects
#> [1] 0.6205295
trial_df1$resid3 <- c(asr2$residuals) 
sample_variogram(trial_df1, effect = "resid3") # check the sample variogram

# Complete block analysis, AR1 + ID residual + random row
asr3 <- asreml(phe.Trait1 ~ 1,
               random = ~ id + block + units + row, # random genotype and block effects plus ID error (units) and row effects
               residual = ~ ar1(col):ar1(row), # separable AR1 process
               data = trial_df1)
asr3 <- update(asr3) # update the model
summary(asr3)$varcom
#>                    component  std.error   z.ratio bound %ch
#> block           1.749078e-07         NA        NA     B  NA
#> row             2.903215e-01 0.07966877  3.644106     P   0
#> id              2.550782e-01 0.05314363  4.799790     P   0
#> units           8.051866e-01 0.07719777 10.430179     P   0
#> col:row!R       1.728460e+00 0.21352920  8.094723     P   0
#> col:row!col!cor 1.889194e-01 0.07568503  2.496126     U   0
#> col:row!row!cor 8.737678e-01 0.02293748 38.093451     U   0
var(main_g) # true genetic variance
#> [1] 0.3329084
var(error_df1$plot_df$e.Trait1) # true error variance
#> [1] 2.7
BLUPs3 <- asr3$coefficients$random[grep("id", rownames(asr3$coefficients$random)),] # predicted genetic effects 
cor(BLUPs3, main_g) # prediction accuracy = correlation between true and predicted effects
#> [1] 0.6540424
trial_df1$resid4 <- c(asr3$residuals) 
sample_variogram(trial_df1, effect = "resid4") # check the sample variogram


Compare the predicted genetic effects from each model with the true genetic values.

Cors <- round(cor(cbind(main_g,
                        BLUPs1,
                        BLUPs2,
                        BLUPs3)), 3)
colnames(Cors) <- rownames(Cors) <- c("True gvs", "Model 1", "Model 2", "Model 3")
Cors
#>          True gvs Model 1 Model 2 Model 3
#> True gvs    1.000   0.523   0.621   0.654
#> Model 1     0.523   1.000   0.821   0.769
#> Model 2     0.621   0.821   1.000   0.945
#> Model 3     0.654   0.769   0.945   1.000

What model produces the highest prediction accuracies for trial_df1? What terms have resulted in this increase?

What is the expected line-mean heritability and prediction accuracy? (Hint: see tutorial)


Compare the top 20 genotypes selected from each model.

true_gv <- gv_df[gv_df$rep == 1,][rev(order(main_g)),][1:20,]  # sort based on true genotype effects and take top 20 genos
top1 <- gv_df[gv_df$rep == 1,][rev(order(BLUPs1)),][1:20,] # sort based on model 1 BLUPs and take top 20 genos
top2 <- gv_df[gv_df$rep == 1,][rev(order(BLUPs2)),][1:20,] # sort based on model 2 BLUPs and take top 20 genos
top3 <- gv_df[gv_df$rep == 1,][rev(order(BLUPs3)),][1:20,] # sort based on model 3 BLUPs and take top 20 genos
rank_list <- cbind(true_gv$id, top1$id, top2$id, top3$id) # top 20 genotypes selected from each model
colnames(rank_list) <- c("True rank", "Model 1", "Model 2", "Model 3")
rownames(rank_list) <- paste("Ind", 1:20)
rank_list
#>        True rank Model 1 Model 2 Model 3
#> Ind 1         78      55      55     251
#> Ind 2        266     386     251     275
#> Ind 3        265     251     266     255
#> Ind 4         72     222     386     266
#> Ind 5        251     315     255      55
#> Ind 6        112     296     260     386
#> Ind 7        254     111     161     296
#> Ind 8         55     266      72     161
#> Ind 9        124     388     296      89
#> Ind 10       276     325     275      72
#> Ind 11        53     161     111     262
#> Ind 12       150     260     222     342
#> Ind 13       208      87     192      87
#> Ind 14       280      89      89     222
#> Ind 15       246     347     388     260
#> Ind 16       151      72     108      73
#> Ind 17       274      92     265     379
#> Ind 18       261      57     123     317
#> Ind 19       182      10      74     152
#> Ind 20       259     292     342      35
tt <- data.frame(model = factor(rep(1:4, each = 20)),
                 id = c(true_gv$id, top1$id, top2$id, top3$id))
tt <- with(tt, table(id, model))
t(tt) %*% tt # top 20 genotypes in common between pairs of models
#>      model
#> model  1  2  3  4
#>     1 20  4  5  4
#>     2  4 20 12 11
#>     3  5 12 20 13
#>     4  4 11 13 20
Reduce(intersect, list(true_gv$id, top1$id, top2$id, top3$id)) # top 20 genotypes in common between all three models
#> [1] "686" "492" "671" "475"


Compare the mean genetic value (genetic gain) and genetic variance of the top 20 genotypes selected from each model.

round(mean(true_gv$gv), digits = 2) # obtain genetic values for top 20 genos
#> [1] 4.29
round(mean(top1$gv), digits = 2) # obtain genetic values for top 20 genos
#> [1] 3.61
round(mean(top2$gv), digits = 2) # obtain genetic values for top 20 genos
#> [1] 3.77
round(mean(top3$gv), digits = 2) # obtain genetic values for top 20 genos
#> [1] 3.78
round(var(true_gv$gv), digits = 2) # obtain genetic values for top 20 genos
#> [1] 0.04
round(var(top1$gv), digits = 2) # obtain genetic values for top 20 genos
#> [1] 0.36
round(var(top2$gv), digits = 2) # obtain genetic values for top 20 genos
#> [1] 0.36
round(var(top3$gv), digits = 2) # obtain genetic values for top 20 genos
#> [1] 0.26

Which model has produced the highest genetic gain and lowest genetic variance? Is this what you would expect? If not, why might this be the case?


Extra: Also simulate plot errors using the field_trial_error() function with complexity of 100 and 1,000 for bivariate interpolation. Label these trial_df3 and trial_df4.

Now go and re-run models 1,2 and 3 for the other datasets (trial_df3 and trial_df4). What is the effect of increasing the complexity of simulated spatial variation? In particular, what happens to the prediction accuracies, selection and genetic gain. Note that the heritability is the same for each dataset, so the observed differences are coming purely from the differences in simulatd complexity.