posts by vikram

vbaliga.github.io

Sporadically updated, (somewhat) useful stuff.

vikram-baliga.com back to vikram-baliga.com

mailto:vbaliga@zoology.ubc.ca Email

orcid.org/0000-0002-9367-8974 ORCID

github.com/vbaliga GitHub

vbaliga.github.io/feed.xml RSS feed



or click here for the archive



Parallel processing for MCMCglmm in R (Windows-friendly)


Lately, I have been using the MCMCglmm package to run linear mixed-models in a Bayesian framework. The documentation is generally very good but there seems to be relatively little support for using parallel processing (here: using multiple cores on your machine) to efficiently run large volumes of mcmc runs. This is especially true for Windows users, who cannot use functions like parallel::mclapply().

I’m happy to share that I have worked out a solution using the parallel package. Basically, I set up a virtual cluster and then use the parallel::parLapply() function to run iterations of MCMCglmm() in parallel.

(This is a re-post of an entry that appeared on my old blog - see here).

Data

I’ll use “Example 2” from the MCMCglmm() function help. You can skip ahead to the next section if instead you’d like to tailor this to your own data & analysis.

First load (or install\&load) the MCMCglmm and parallel packages:

packages = c("MCMCglmm", "parallel")

## Now load or install&load all
package.check <- lapply(
  packages,
  FUN = function(x) {
    if (!require(x, character.only = TRUE)) {
      install.packages(x, dependencies = TRUE)
      library(x, character.only = TRUE)
    }
  }
)
## Loading required package: MCMCglmm

## Loading required package: Matrix

## Loading required package: coda

## Loading required package: ape

## Loading required package: parallel

With the packages loaded, we’ll prep our data set. Lifting this directly from the MCMCglmm() help page:

data(bird.families)
phylo.effect <- rbv(bird.families, 1, nodes = "TIPS")
phenotype <- phylo.effect + rnorm(dim(phylo.effect)[1], 0, 1)

# simulate phylogenetic and residual effects
# with unit variance
test.data <- data.frame(phenotype = phenotype,
                        taxon = row.names(phenotype))
Ainv <- inverseA(bird.families)$Ainv

# inverse matrix of shared phyloegnetic history
prior <- list(R = list(V = 1, nu = 0.002), 
              G = list(G1 = list(V = 1, nu = 0.002)))

model2 <- MCMCglmm(
  phenotype ~ 1,
  random =  ~ taxon,
  ginverse = list(taxon = Ainv),
  data = test.data,
  prior = prior,
  verbose = FALSE,
  nitt = 1300,
  burnin = 300,
  thin = 1
)
summary(model2)
## 
##  Iterations = 301:1300
##  Thinning interval  = 1
##  Sample size  = 1000 
## 
##  DIC: 468.589 
## 
##  G-structure:  ~taxon
## 
##       post.mean l-95% CI u-95% CI eff.samp
## taxon    0.4607  0.04937    1.424    4.904
## 
##  R-structure:  ~units
## 
##       post.mean l-95% CI u-95% CI eff.samp
## units     1.568   0.8002    2.134    19.49
## 
##  Location effects: phenotype ~ 1 
## 
##             post.mean l-95% CI u-95% CI eff.samp pMCMC
## (Intercept)   -0.1574  -0.6576   0.3099      812 0.488

Of course, the example provided sets nitt to only 1300, yielding an ESS of only ~800 for the fixed effect. I am guessing this is intended to make sure the example is quick to run.

Boosting this to nitt = 100000, burnin = 10000, and thin = 10 gives a more healthy ESS > 8000. But please note that this will take a lot longer to finish (I’ll leave it up to you to use the Sys.time() function to time it yourself).

Run MCMC chains in parallel

Whenever conducting MCMC-based analyses, it’s advisable to conduct multiple runs (different chains) and then assess convergence. I’ll leave the convergence assessments for another day (but here’s a good StackExchange post). For now we’ll just conduct 10 runs of this model, each using nitt = 100000, using parallel processing.

PLEASE NOTE: I am setting this up to use only 80% of your machine’s total logical processors. You can certainly harness all of your CPUs if you’d like, although I advise against doing so if any of your MCMC runs take more than a few minutes. It also doesn’t make sense to set the number of logical processors to be greater than the number of runs (chains), but more on that later. Anyway, treat your silicon well!

# use detectCores() by itself if you want all CPUs
setCores <- round(detectCores() * 0.8)

# make the cluster
cl <- makeCluster(getOption("cl.cores", setCores))

# load the MCMCglmm package within the cluster
cl.pkg <- clusterEvalQ(cl, library(MCMCglmm))

# import each object that's necessary to run the function
clusterExport(cl, "prior")
clusterExport(cl, "test.data")
clusterExport(cl, "Ainv")

# use parLapply() to execute 10 runs of MCMCglmm()
# each with nitt=100000
model2_10runs <- parLapply(cl = cl, 1:10, function(i) {
  MCMCglmm(
    phenotype ~ 1,
    random =  ~ taxon,
    ginverse = list(taxon = Ainv),
    data = test.data,
    prior = prior,
    verbose = FALSE,
    nitt = 100000,
    burnin = 10000,
    thin = 10
  )
})

# once it's finished, use stopCluster() to stop running
# the parallel cluster
stopCluster(cl)

The model2_10runs object is a list that contains each of the 10 mcmc models. You can perform all the usual summarization, plotting…etc, but just be sure to specify models within the list, e.g.: summary(model2_10runs[[3]]) to summarize the third model out of the 10

As I mentioned above, we’ll leave convergence and other fun topics like autocorrelation for another day.

That’s all!

🐢

Written on May 3, 2019 by Vikram B. Baliga (vbaliga).