## If a package is installed, it will be loaded. If any
## are not, the missing package(s) will be installed
## from CRAN and then loaded.
## First specify the packages of interest
= c("MCMCglmm", "parallel")
packages
## Now load or install&load all
<- lapply(
package.check
packages,FUN = function(x) {
if (!require(x, character.only = TRUE)) {
install.packages(x, dependencies = TRUE)
library(x, character.only = TRUE)
}
} )
Parallel processing for MCMCglmm in R (Windows-friendly)
I set up a virtual cluster and then use the parallel::parLapply() function to run iterations of MCMCglmm() in parallel for computers running Windows.
Lately, I have been using the MCMCglmm package to run linear mixed-models in a Bayesian framework. The documentation is generally very good but there seems to be relatively little support for using parallel processing (here: using multiple cores on your machine) to efficiently run large volumes of mcmc runs. This is especially true for Windows users, who cannot use functions like parallel::mclapply()
.
I’m happy to share that I have worked out a solution using the parallel package. Basically, I set up a virtual cluster and then use the parallel::parLapply()
function to run iterations of MCMCglmm()
in parallel.
Data
I’ll use “Example 2” from the MCMCglmm() function help. You can skip ahead to the next section if instead you’d like to tailor this to your own data & analysis.
First load (or install&load) the MCMCglmm
and parallel
packages:
With the packages loaded, we’ll prep our data set. Lifting this directly from the MCMCglmm()
help page:
data(bird.families)
<- rbv(bird.families, 1, nodes = "TIPS")
phylo.effect <- phylo.effect + rnorm(dim(phylo.effect)[1], 0, 1)
phenotype
# simulate phylogenetic and residual effects
# with unit variance
<- data.frame(phenotype = phenotype,
test.data taxon = row.names(phenotype))
<- inverseA(bird.families)$Ainv
Ainv
# inverse matrix of shared phyloegnetic history
<- list(R = list(V = 1, nu = 0.002),
prior G = list(G1 = list(V = 1, nu = 0.002)))
<- MCMCglmm(
model2 ~ 1,
phenotype random = ~ taxon,
ginverse = list(taxon = Ainv),
data = test.data,
prior = prior,
verbose = FALSE,
nitt = 1300,
burnin = 300,
thin = 1
)summary(model2)
Iterations = 301:1300
Thinning interval = 1
Sample size = 1000
DIC: 375.0159
G-structure: ~taxon
R-structure: ~units
Location effects: phenotype ~ 1
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) 0.1630 -0.5899 0.8938 1000 0.654
Of course, the example provided sets nitt
to only 1300, yielding an ESS of only ~800 for the fixed effect. I am guessing this is intended to make sure the example is quick to run.
Boosting this to nitt = 100000
, burnin = 10000
, and thin = 10
gives a more healthy ESS > 8000. But please note that this will take a lot longer to finish (I’ll leave it up to you to use the Sys.time()
function to time it yourself).
Run MCMC chains in parallel
Whenever conducting MCMC-based analyses, it’s advisable to conduct multiple runs (different chains) and then assess convergence. I’ll leave the convergence assessments for another day (but here’s a good StackExchange post). For now we’ll just conduct 10 runs of this model, each using nitt = 100000
, using parallel processing.
PLEASE NOTE: I am setting this up to use only 80% of your machine’s total logical processors. You can certainly harness all of your CPUs if you’d like, although I advise against doing so if any of your MCMC runs take more than a few minutes. It also doesn’t make sense to set the number of logical processors to be greater than the number of runs (chains), but more on that later. Anyway, treat your silicon well!
# use detectCores() by itself if you want all CPUs
<- round(detectCores() * 0.8)
setCores
# make the cluster
<- makeCluster(getOption("cl.cores", setCores))
cl # EDIT ON 2020-07-27: I have been informed that Mac users
# may have better luck using:
# cl <- parallel::makeCluster(getOption("cl.cores", setCores),
# setup_strategy = "sequential")
# This is due to an apparent issue in RStudio.
# See this stackoverflow page for details:
# https://stackoverflow.com/questions/61700586/r-makecluster-command-used-to-work-but-now-fails-in-rstudio
# load the MCMCglmm package within the cluster
<- clusterEvalQ(cl, library(MCMCglmm))
cl.pkg
# import each object that's necessary to run the function
clusterExport(cl, "prior")
clusterExport(cl, "test.data")
clusterExport(cl, "Ainv")
# use parLapply() to execute 10 runs of MCMCglmm()
# each with nitt=100000
<- parLapply(cl = cl, 1:10, function(i) {
model2_10runs MCMCglmm(
~ 1,
phenotype random = ~ taxon,
ginverse = list(taxon = Ainv),
data = test.data,
prior = prior,
verbose = FALSE,
nitt = 100000,
burnin = 10000,
thin = 10
)
})
# once it's finished, use stopCluster() to stop running
# the parallel cluster
stopCluster(cl)
The model2_10runs
object is a list that contains each of the 10 mcmc models. You can perform all the usual summarization, plotting…etc, but just be sure to specify models within the list, e.g.: summary(model2_10runs[[3]])
to summarize the third model out of the 10
Iterations = 10001:99991
Thinning interval = 10
Sample size = 9000
DIC: 109.7491
G-structure: ~taxon
post.mean l-95% CI u-95% CI eff.samp
taxon 1.782 0.3085 2.989 178.6
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 0.4437 0.0001843 1.224 181.1
Location effects: phenotype ~ 1
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) 0.1697 -0.5989 0.9841 9000 0.666
As I mentioned above, we’ll leave convergence and other fun topics like autocorrelation for another day.
That’s all!
🐢