The objective of this practical is to simulate plot data using FieldSimR which captures spatial variation with realistic structure and complexity. Genetic values will be simulated using AlphaSimR based on an additive genetic trait representing grain yield.
Summary
The simulation can be summarised by the following steps:
field_trial_error()
and plot_effects()
functionsmake_phenotypes()
function
Clean the working environment.
rm(list = ls())
Load required packages.
library(FieldSimR)
library(AlphaSimR)
#> Loading required package: R6
library(asreml)
#> Loading required package: Matrix
#> Loading ASReml-R version 4.2
library(ggplot2)
The results in this practical can be reproduced by initially setting the following seed.
set.seed(123)
Consider a scenario in which 400 inbred wheat genotypes are evaluated for grain yield (t/ha) in a single field trial. The genotypes are simulated with additive genetic relatedness using AlphaSimR. The trial comprises 30 columns and 40 rows for 1200 plots in total. There are three complete blocks of all genotypes aligned in the column direction (side-by-side). The trial is 240 m long in the column direction and 80 m wide in the row direction, with plots 8m long by 2m wide. The trait mean is 3 t/ha and the trait variance is 0.3 t2/ha2. The plot-level heritability is 0.1, which reflects the ratio of genetic variance to total phenotypic variance. The error variance is given by 0.3/0.1-0.3 = 2.7 t2/ha2.
The simulation parameters are:
n_genos <- 400 # no. genotypes
n_reps <- n_blocks <- 3 # no. replicates/blocks
block_dir <- "col" # block direction
n_cols <- 30 # no. columns
n_rows <- 40 # no. rows
n_plots <- n_cols * n_rows # no. plots == n_genos * n_reps
plot_length <- 8
plot_width <- 2
mu <- 3 # trait mean
sigma2 <- 0.3 # trait variance
H2 <- 0.1 # plot-level heritability
sigma2e <- sigma2/H2 - sigma2 # error variance
round(sigma2e, digits = 2)
#> [1] 2.7
The phenotypes are generated based on the linear mixed model:
\(\mathbf{y} = \mathbf{Z}\mathbf{gv} + \mathbf{e}\),
where \(\mathbf{gv}\) is the vector of genetic values with design matrix \(\mathbf{Z}\) and \(\mathbf{e}\) is the vector of plot errors. The inclusion of additional non-genetic effects is straightforward.
The simulation involves three steps:
The plot errors are simulated to capture spatial variation, which can be broadly categorised as either global and local trend (smooth spatial trend), random error (noise) or extraneous variation (systematic error).
The plot errors are constructed as the sum of three terms:
\(\mathbf{e} = \mathbf{se} + \mathbf{re} + \mathbf{ee}\),
where \(\mathbf{se}\) is a vector of spatial errors that capture global and local trend, \(\mathbf{re}\) is a vector of random errors that capture random measurement error/noise and \(\mathbf{ee}\) is a vector of extraneous errors that capture extraneous variation.
Simulating the plot errors involves two parts:
field_trial_error()
function and displaying them with the plot_effects()
function.Simulate the plot errors based on a proportion of spatial error variance of 0.6 and a proportion of extraneous error variance of 0.1. FieldSimR will then allocate the remaining 0.3 to the random error variance by default.
The initial simulation parameters are:
round(sigma2e, digits = 2) # error variance
#> [1] 2.7
prop_spatial <- 0.6 # proportion of spatial error variance
prop_ext <- 0.1 # proportion of extraneous error variance
1 - (prop_spatial + prop_ext) # proportion of random error variance
#> [1] 0.3
The simulation parameters for the spatial and extraneous errors are defined in the following. No further parameters are required for the random errors.
FieldSimR has the capacity to simulate spatial errors based on a
separable first order autoregressive (AR1) process or bivariate
interpolation. These options are set with
spatial_model = "AR1:AR1"
or
spatial_model = "Bivariate"
.
Simulate spatial errors based on a separable AR1 process with a column autocorrelation of 0.3 and row autocorrelation of 0.9. The simulation parameters are:
spatial_model1 <- "AR1:AR1"
prop_spatial # proportion of spatial error variance
#> [1] 0.6
col_cor <- 0.3 # column autocorrelation
row_cor <- 0.9 # row autocorrelation
Note that the plot lengths and widths are not required when
spatial_model = "AR1:AR1"
.
Also simulate spatial errors based on bivariate interpolation with
200 additional knots points. This option is set with
complexity = 200
. The simulation parameters are:
spatial_model2 <- "Bivariate"
prop_spatial # proportion of spatial error variance
#> [1] 0.6
complexity <- 200 # number of additional knot points
plot_length
#> [1] 8
plot_width
#> [1] 2
Note that increasing the complexity will generate more local trend relative to global trend, up to a point where the spatial errors capture minimal or no trend.
FieldSimR has the capacity to simulate extraneous errors based on
random or zig-zag ordering. These options are set with
ext_ord = "random"
or ext_ord = "zig-zag"
. The
zig-zag ordering is achieved by alternating the positive and negative
simulated values between neighbouring columns and/or rows.
Extraneous errors can be simulated in both directions, or either the
column or row directions individually. These options are set with
ext_dir = "both"
, ext_dir = "col"
or
ext_dir = "row"
.
Simulate extraneous errors using zig-zag ordering between neighbouring rows. The simulation parameters are:
prop_ext # proportion of extraneous error variance
#> [1] 0.1
ext_ord <- "zig-zag" # ordering of extraneous variation
ext_dir <- "row" # direction of extraneous variation
The plot errors are simulated using FieldSimR’s function
field_trial_error()
. The functionality will be demonstrated
separately for the simulations involving the AR1 process and bivariate
interpolation.
Simulate plot errors using the field_trial_error()
function with terms appropriate to the AR1 process,
i.e. col_cor
and row_cor
. A data frame is
returned containing the environment, block, column and row numbers as
well as the simulated plot error. When
return_effects = TRUE
, a list is returned with an
additional data frame containing the spatial, random and extraneous
error terms.
error_df1 <- field_trial_error(n_traits = 1,
n_envs = 1,
n_blocks = n_blocks, # 3
block_dir = block_dir, # "col"
n_cols = n_cols, # 30
n_rows = n_rows, # 40
var_R = sigma2e, # 0.3
prop_spatial = prop_spatial, # 0.6
spatial_model = spatial_model1, # "AR1:AR1"
col_cor = col_cor, # 0.3
row_cor = row_cor, # 0.9
prop_ext = prop_ext, # 0.1
ext_ord = ext_ord, # "zig-zag"
ext_dir = ext_dir, # "row"
return_effects = TRUE)
head(error_df1$plot_df) # data frame with simulated plot errors
#> env block col row e.Trait1
#> 1 1 1 1 1 -0.4280387
#> 2 1 1 1 2 -0.6974782
#> 3 1 1 1 3 0.4924385
#> 4 1 1 1 4 -0.2846024
#> 5 1 1 1 5 0.3706938
#> 6 1 1 1 6 2.3101299
head(error_df1$Trait1) # data frame with simulated spatial, random and extraneous errors
#> env block col row e.spat e.rand e.ext.col e.ext.row
#> 1 1 1 1 1 -0.9304495 0.5692075 0 -0.06679673
#> 2 1 1 1 2 -1.0088154 -0.7027582 0 1.01409551
#> 3 1 1 1 3 0.1044457 0.7831546 0 -0.39516177
#> 4 1 1 1 4 0.1215683 -0.6939111 0 0.28774045
#> 5 1 1 1 5 0.1758757 0.5788023 0 -0.38398418
#> 6 1 1 1 6 1.2741363 1.0095339 0 0.02645974
The separate error terms can be displayed using FieldSimR’s function
plot_effects()
. Display the spatial errors.
plot_effects(error_df1$Trait1, effect = "e.spat", labels = TRUE)
Display the total plot errors.
plot_effects(error_df1$plot_df, effect = "e.Trait1", labels = TRUE)
The theoretical and sample variograms can be displayed using
FieldSimR’s functions theoretical_variogram()
and
sample_variogram()
. Display the sample variogram for the
total plot errors:
sample_variogram(error_df1$plot_df, effect = "e.Trait1")
Also simulate plot errors using the field_trial_error()
function with terms appropriate to bivariate interpolation,
i.e. plot_length
, plot_width
and
complexity
.
error_df2 <- field_trial_error(n_traits = 1,
n_envs = 1,
n_blocks = n_blocks, # 3
block_dir = block_dir, # "col"
n_cols = n_cols, # 30
n_rows = n_rows, # 40
var_R = sigma2e, # 0.3
prop_spatial = prop_spatial, # 0.6
spatial_model = spatial_model2, # "Bivariate"
plot_length = plot_length, # 8
plot_width = plot_width, # 2
complexity = complexity, # 200
prop_ext = prop_ext, # 0.1
ext_dir = ext_dir, # "row"
ext_ord = ext_ord, # "zig-zag"
return_effects = TRUE)
head(error_df2$plot_df) # data frame with simulated plot errors
#> env block col row e.Trait1
#> 1 1 1 1 1 -1.7257294
#> 2 1 1 1 2 0.3068376
#> 3 1 1 1 3 -0.4149513
#> 4 1 1 1 4 -1.9735581
#> 5 1 1 1 5 -2.9815700
#> 6 1 1 1 6 0.7566291
head(error_df2$Trait1) # data frame with simulated spatial, random and extraneous errors
#> env block col row e.spat e.rand e.ext.col e.ext.row
#> 1 1 1 1 1 -0.8949584 -0.1965031 0 -0.63426787
#> 2 1 1 1 2 -0.7921239 0.7974286 0 0.30153298
#> 3 1 1 1 3 -1.0598630 0.6476575 0 -0.00274582
#> 4 1 1 1 4 -1.3276021 -0.6769702 0 0.03101411
#> 5 1 1 1 5 -1.3607886 -1.4868072 0 -0.13397418
#> 6 1 1 1 6 -1.1889768 1.7521228 0 0.19348312
Display the spatial errors:
plot_effects(error_df2$Trait1, effect = "e.spat", labels = TRUE)
Display the total plot errors:
plot_effects(error_df2$plot_df, effect = "e.Trait1", labels = TRUE)
Simulating the genetic values involves two parts:
Set the number of homogeneous genotypes in the founder population to 20, the number of chromosomes to 20 and the number of segregating sites (biallelic QTN) per chromosome to 300.
n_founders <- 20 # Number of genotypes in the founder population
n_chr <- 20 # Number of chromosomes
n_seg_sites <- 300 # Number of QTN per chromosome
The trait mean and variance are:
mu # trait mean
#> [1] 3
sigma2 # trait variance
#> [1] 0.3
Simulate the founder population using the runMacs()
function.
# Simulating a founder population using AlphaSimR's "WHEAT" presets to mimic it's evolutionary history
founders <- runMacs(nInd = n_founders,
nChr = n_chr,
segSites = n_seg_sites,
inbred = TRUE,
species = "WHEAT",
nThreads = 2)
founders # MapPop object
#> An object of class "MapPop"
#> Ploidy: 2
#> Individuals: 20
#> Chromosomes: 20
#> Loci: 6000
SP <- SimParam$new(founders)
Simulate an additive genetic trait representing grain yield using the
addTraitA()
function.
SP$addTraitA(nQtlPerChr = n_seg_sites,
mean = mu,
var = sigma2)
founders <- newPop(founders) # create pop-class object
Simulate the inbred wheat genotypes by randomly crossing the founders
using the randCross()
function, and then made into doubled
haploid (DH) lines using the makeDH()
function.
f1s <- randCross(pop = founders,
nCrosses = 40, # no. crosses
nProgeny = 10) # no. progeny per cross
DHs <- makeDH(pop = f1s, nDH = 1)
DHs
#> An object of class "Pop"
#> Ploidy: 2
#> Individuals: 400
#> Chromosomes: 20
#> Loci: 6000
#> Traits: 1
Obtain the genetic values for the DH population.
gv <- c(DHs@gv)
round(mean(gv), digits = 2) # trait mean in the DH population
#> [1] 3.04
round(var(gv), digits = 2) # trait variance in the DH population
#> [1] 0.33
gv_df <- data.frame(env = factor(1),
rep = factor(rep(1:n_reps, each = n_genos)),
id = factor(DHs@id),
gv = gv)
head(gv_df)
#> env rep id gv
#> 1 1 1 421 2.727611
#> 2 1 1 422 1.962928
#> 3 1 1 423 3.250774
#> 4 1 1 424 3.427477
#> 5 1 1 425 3.244724
#> 6 1 1 426 2.134860
Generate the phenotypes using FieldSimR’s function
make_phenotypes()
for plot errors simulated with a
separable AR1 process and bivariate interpolation. When
randomise = TRUE
, genotypes are allocated to plots using a
randomised complete block design.
trial_df1 <- make_phenotypes(gv_df = gv_df, error_df = error_df1, randomise = TRUE)
trial_df2 <- make_phenotypes(gv_df = gv_df, error_df = error_df2, randomise = TRUE)
head(trial_df1)
#> env block col row id phe.Trait1
#> 1 1 1 1 1 517 2.405821
#> 2 1 1 1 2 582 2.640356
#> 3 1 1 1 3 695 4.456456
#> 4 1 1 1 4 732 2.047950
#> 5 1 1 1 5 787 2.924540
#> 6 1 1 1 6 679 6.395116
head(trial_df2)
#> env block col row id phe.Trait1
#> 1 1 1 1 1 580 1.5582814
#> 2 1 1 1 2 704 3.7000911
#> 3 1 1 1 3 682 3.6144583
#> 4 1 1 1 4 560 0.9547222
#> 5 1 1 1 5 748 -0.5755032
#> 6 1 1 1 6 609 3.7004799
Display the phenotypes.
plot_effects(trial_df1, effect = "phe.Trait1", labels = TRUE)
plot_effects(trial_df2, effect = "phe.Trait1", labels = TRUE)
Run a linear mixed model analysis in ASReml-R and compare the prediction accuracies between the models.
# Complete block analysis, ID residual
asr1 <- asreml(phe.Trait1 ~ 1,
random = ~ id + block, # random genotype and block effects
residual = ~ units,
data = trial_df1)
#> Warning in asreml(phe.Trait1 ~ 1, random = ~id + block, residual = ~units, :
#> Some components changed by more than 1% on the last iteration
asr1 <- update(asr1) # update the model
summary(asr1)$varcom # check the estimated variance parameters
#> component std.error z.ratio bound %ch
#> block 0.00184438 0.008772788 0.2102388 P 0
#> id 0.25081576 0.095081936 2.6378908 P 0
#> units!R 2.76880424 0.138613592 19.9749838 P 0
main_g <- gv_df$gv[gv_df$rep == 1]
names(main_g) <- gv_df$id[gv_df$rep == 1]
var(main_g) # true genetic variance
#> [1] 0.3329084
var(error_df1$plot_df$e.Trait1) # true error variance
#> [1] 2.7
BLUPs1 <- asr1$coefficients$random[grep("id", rownames(asr1$coefficients$random)),] # predicted genetic effects
cor(BLUPs1, main_g) # prediction accuracy = correlation between true and predicted effects
#> [1] 0.5230161
trial_df1$resid1 <- c(asr1$residuals)
sample_variogram(trial_df1, effect = "resid1") # check the sample variogram
# Complete block analysis, AR1 + ID residual
asr2 <- asreml(phe.Trait1 ~ 1,
random = ~ id + block + units, # random genotype and block effects plus ID error (units)
residual = ~ ar1(col):ar1(row), # separable AR1 process
data = trial_df1)
#> Warning in asreml(phe.Trait1 ~ 1, random = ~id + block + units, residual =
#> ~ar1(col):ar1(row), : Log-likelihood not converged
#> Warning in asreml(phe.Trait1 ~ 1, random = ~id + block + units, residual =
#> ~ar1(col):ar1(row), : Some components changed by more than 1% on the last
#> iteration
asr2 <- update(asr2) # update the model
#> Warning in asreml(fixed = phe.Trait1 ~ 1, random = ~id + block + units, :
#> Log-likelihood not converged
summary(asr2)$varcom
#> component std.error z.ratio bound %ch
#> block 2.544192e-07 NA NA B NA
#> id 2.595857e-01 0.05912774 4.390253 P 0.4
#> units 9.724167e-01 0.09011289 10.791095 P 0.5
#> col:row!R 2.514201e+00 0.39349608 6.389392 P 0.0
#> col:row!col!cor 5.567283e-01 0.05795197 9.606720 U 0.3
#> col:row!row!cor 8.750148e-01 0.02469724 35.429663 U 0.0
var(main_g) # true genetic variance
#> [1] 0.3329084
var(error_df1$plot_df$e.Trait1) # true error variance
#> [1] 2.7
BLUPs2 <- asr2$coefficients$random[grep("id", rownames(asr2$coefficients$random)),] # predicted genetic effects
cor(BLUPs2, main_g) # prediction accuracy = correlation between true and predicted effects
#> [1] 0.6205295
trial_df1$resid3 <- c(asr2$residuals)
sample_variogram(trial_df1, effect = "resid3") # check the sample variogram
# Complete block analysis, AR1 + ID residual + random row
asr3 <- asreml(phe.Trait1 ~ 1,
random = ~ id + block + units + row, # random genotype and block effects plus ID error (units) and row effects
residual = ~ ar1(col):ar1(row), # separable AR1 process
data = trial_df1)
asr3 <- update(asr3) # update the model
summary(asr3)$varcom
#> component std.error z.ratio bound %ch
#> block 1.749078e-07 NA NA B NA
#> row 2.903215e-01 0.07966877 3.644106 P 0
#> id 2.550782e-01 0.05314363 4.799790 P 0
#> units 8.051866e-01 0.07719777 10.430179 P 0
#> col:row!R 1.728460e+00 0.21352920 8.094723 P 0
#> col:row!col!cor 1.889194e-01 0.07568503 2.496126 U 0
#> col:row!row!cor 8.737678e-01 0.02293748 38.093451 U 0
var(main_g) # true genetic variance
#> [1] 0.3329084
var(error_df1$plot_df$e.Trait1) # true error variance
#> [1] 2.7
BLUPs3 <- asr3$coefficients$random[grep("id", rownames(asr3$coefficients$random)),] # predicted genetic effects
cor(BLUPs3, main_g) # prediction accuracy = correlation between true and predicted effects
#> [1] 0.6540424
trial_df1$resid4 <- c(asr3$residuals)
sample_variogram(trial_df1, effect = "resid4") # check the sample variogram
Compare the predicted genetic effects from each model with the true genetic values.
Cors <- round(cor(cbind(main_g,
BLUPs1,
BLUPs2,
BLUPs3)), 3)
colnames(Cors) <- rownames(Cors) <- c("True gvs", "Model 1", "Model 2", "Model 3")
Cors
#> True gvs Model 1 Model 2 Model 3
#> True gvs 1.000 0.523 0.621 0.654
#> Model 1 0.523 1.000 0.821 0.769
#> Model 2 0.621 0.821 1.000 0.945
#> Model 3 0.654 0.769 0.945 1.000
What model produces the highest prediction accuracies for trial_df1? What terms have resulted in this increase?
What is the expected line-mean heritability and prediction accuracy? (Hint: see tutorial)
Compare the top 20 genotypes selected from each model.
true_gv <- gv_df[gv_df$rep == 1,][rev(order(main_g)),][1:20,] # sort based on true genotype effects and take top 20 genos
top1 <- gv_df[gv_df$rep == 1,][rev(order(BLUPs1)),][1:20,] # sort based on model 1 BLUPs and take top 20 genos
top2 <- gv_df[gv_df$rep == 1,][rev(order(BLUPs2)),][1:20,] # sort based on model 2 BLUPs and take top 20 genos
top3 <- gv_df[gv_df$rep == 1,][rev(order(BLUPs3)),][1:20,] # sort based on model 3 BLUPs and take top 20 genos
rank_list <- cbind(true_gv$id, top1$id, top2$id, top3$id) # top 20 genotypes selected from each model
colnames(rank_list) <- c("True rank", "Model 1", "Model 2", "Model 3")
rownames(rank_list) <- paste("Ind", 1:20)
rank_list
#> True rank Model 1 Model 2 Model 3
#> Ind 1 78 55 55 251
#> Ind 2 266 386 251 275
#> Ind 3 265 251 266 255
#> Ind 4 72 222 386 266
#> Ind 5 251 315 255 55
#> Ind 6 112 296 260 386
#> Ind 7 254 111 161 296
#> Ind 8 55 266 72 161
#> Ind 9 124 388 296 89
#> Ind 10 276 325 275 72
#> Ind 11 53 161 111 262
#> Ind 12 150 260 222 342
#> Ind 13 208 87 192 87
#> Ind 14 280 89 89 222
#> Ind 15 246 347 388 260
#> Ind 16 151 72 108 73
#> Ind 17 274 92 265 379
#> Ind 18 261 57 123 317
#> Ind 19 182 10 74 152
#> Ind 20 259 292 342 35
tt <- data.frame(model = factor(rep(1:4, each = 20)),
id = c(true_gv$id, top1$id, top2$id, top3$id))
tt <- with(tt, table(id, model))
t(tt) %*% tt # top 20 genotypes in common between pairs of models
#> model
#> model 1 2 3 4
#> 1 20 4 5 4
#> 2 4 20 12 11
#> 3 5 12 20 13
#> 4 4 11 13 20
Reduce(intersect, list(true_gv$id, top1$id, top2$id, top3$id)) # top 20 genotypes in common between all three models
#> [1] "686" "492" "671" "475"
Compare the mean genetic value (genetic gain) and genetic variance of the top 20 genotypes selected from each model.
round(mean(true_gv$gv), digits = 2) # obtain genetic values for top 20 genos
#> [1] 4.29
round(mean(top1$gv), digits = 2) # obtain genetic values for top 20 genos
#> [1] 3.61
round(mean(top2$gv), digits = 2) # obtain genetic values for top 20 genos
#> [1] 3.77
round(mean(top3$gv), digits = 2) # obtain genetic values for top 20 genos
#> [1] 3.78
round(var(true_gv$gv), digits = 2) # obtain genetic values for top 20 genos
#> [1] 0.04
round(var(top1$gv), digits = 2) # obtain genetic values for top 20 genos
#> [1] 0.36
round(var(top2$gv), digits = 2) # obtain genetic values for top 20 genos
#> [1] 0.36
round(var(top3$gv), digits = 2) # obtain genetic values for top 20 genos
#> [1] 0.26
Which model has produced the highest genetic gain and lowest genetic variance? Is this what you would expect? If not, why might this be the case?
Extra: Also simulate plot errors using the
field_trial_error()
function with complexity of 100 and
1,000 for bivariate interpolation. Label these trial_df3 and
trial_df4.
Now go and re-run models 1,2 and 3 for the other datasets (trial_df3 and trial_df4). What is the effect of increasing the complexity of simulated spatial variation? In particular, what happens to the prediction accuracies, selection and genetic gain. Note that the heritability is the same for each dataset, so the observed differences are coming purely from the differences in simulatd complexity.