posts by vikram

Sporadically updated, (somewhat) useful stuff.

The motivation often comes from trying to google a question or topic only to find little-to-no answer. Should I have luck in answering it myself, I'll post it here with the hope that it helps other people with similar questions 🐢 back to Email GitHub

@labrichthys Twitter ORCID RSS feed

or click here to see a list of all posts

Replace text within all cells of a specific column of a data frame or tibble

Recently, I needed to find a way to rename specific cells within one column of a tibble without affecting cells in other columns. I knew that stringr::str_replace() is awesome for this sort of thing, but I hadn’t quite grasped how I could target specific columns with it.

Fortunately,dplyr::mutate_at(), and newer mechanisms via dplyr::across(), seem to fit the bill. I’ll run through a few examples in this post.

Read more

Searching for sequences on GenBank - a few tips and tricks

This will be a “low-tech” post, i.e. one that doesn’t showcase code.

Over the past few years, I’ve done a lot of searches for genetic sequences on GenBank with the endgame of building phylogenetic trees. Here’s a list of things that have & haven’t worked for me when it has come to the task of finding specific genes for specific taxa.

A few good ways to set up the search term for nuclear genes are (ordered by specificity):

  • (Genus_species) NOT (whole genome) NOT predicted
  • (Genus_species) NOT (whole genome) NOT predicted NOT mitochondri*
  • (Genus_species) NOT (whole genome) NOT predicted NOT mitochondri* AND (gene1 OR gene2)

Note 1: I always avoid including sequences that have the word “predicted” in the title; I’d rather rely on sequences that have established identity. That said, including NOT predicted can be dangerous as the word “predicted” may appear e.g. in the title of the corresponding paper, but the sequence itself may be known with more certainity.

Note 2: For similar reasons, NOT (whole genome) may fare better than NOT genome

Note 3: OR needs to be nested within parenthetical statements. Taking the third bullet point as an example, a search without the the parentheses surrounding the OR statement such as (Genus_species) NOT (whole genome) NOT predicted NOT mitochondri* AND gene1 OR gene2 would amount to searching for (Genus_species) NOT (whole genome) NOT predicted NOT mitochondri* AND gene1 OR gene2. Therefore, all cases of gene2 for every species ever would appear in the search results!

Perhaps this list of notes will grow more someday, but that’s all for now


Read more

GenBank in R, part II - Pulling multiple sequences for multiple genes (with help from purrr)

Building off my previous post, I have now devised a way to not only batch download GenBank sequences for a given gene, but also across multiple genes. This post will give a worked-out example using the sets of genes I used to build a phylogeny of 220 birds (available here) as part of Baliga et al. (2019).

One wrinkle I had encountered was that the GenBank API enforces limits to batch requests (as it should!). Luckily, purrr::slowly() provides a way around this by introducing pauses between successive calls.

The code, which is available at vbaliga/genbank_downloadR, currently works but is also in a developmental stage – I hope to soon add in functionality for getting proteins and other sequence types. A stable version of the code that works for DNA sequences can be found in this release, as well as in the text of this post.

Read more

GenBank in R - Download DNA or protein sequences using the ape and rentrez packages

I realized the other day that using ape::read.Genbank() does not work for downloading protein sequences in batch from Genbank.

unaligned sequences from genbank

This post will cover how to use the rentrez package to download protein sequences from GenBank while also recapping how read.Genbank() can do a similar thing for a set of DNA seqs. I’ll actually start with the DNA example because I suspect it’s the more common use case.

Both pipelines generate FASTA files, which can then be used for multiple sequence alignment etc etc.

Read more

Introducing workloopR - an R package for muscle physiology data

I haven’t written here in a few months, largely because I have been writing & sharing code elsewhere. It is my pleasure to share that one of these projects has now been published.


workloopR is an R package that is now available through rOpenSci. As I explain in this post on the rOpenSci Blog, studies of muscle physiology have long relied on closed-source, proprietary software for data analyses. This raises the same issues as any other black-box situation: it’s hard to know how analyses are performed or whether things are working properly. To give muscle physiologists a set of tools to help perform reproducible research, workloopR is designed to handle the import, transfomation, and analysis of data from three types of experiments: twitch, tetanus, and of course, work loops.

After writing the package with my lab mate, Shree Senthivasan, all code and documentation was peer-reviewed via rOpenSci. We also wrote a short accompanying paper that has now been published in the Journal of Open Source Software, available here:

The package is available on GitHub via rOpenSci:

I also recommend seeing our documentation site, which includes vignettes:


Read more

PCA and inferring ancestral states - some observations

There is a healthy debate on when we should use phylogenetic principal component analysis (pPCA; Revell 2009). The technique can be valuable because it provides a rotation of multivariate data that accounts for the effects of phylogeny. ancestral states via PCA techniques But unlike 'vanilla' PCA, pPCA results in species' scores being correlated across axes. Summed eigenvalues also don’t match the total variance in the original data. Accordingly, some researchers prefer to use vanilla PCA even if phylogenetic signal in the data is strong.

I started to wonder how ancestral state estimation could be affected by choice of PCA technique. Surely, inferring ancestral states directly from scores from vanilla PC axes would lead to faulty results? What are the best practices to avoid this issue? And does the software we use already account for all this?

In this post, I’ll show how inferring ancestral states from scores on vanilla PC axes indeed leads to biased estimates. But if ancestral states are estimated beforehand and then rotated during PCA, this problem goes away. Inferring ancestral states from phylogenetic PCA scores, however, seems to be fine.

Read more

Images of Brachyistius aletes, courtesy of the California Academy of Sciences

I am happy to share that a study in which we determined how ecological specialization drives convergence in body shape and size among cleaner fishes is now available. Brachyistius aletes image 2

To trace the evolution of body shape, I collected morphological data from ~300 species of fishes. As part of this project, I got the chance to visit several museum collections across the US, including the Smithsonian National Museum of Natural History, the LA County Museum of Natural History, and the California Academy of Sciences.

After these visits, I realized I had missed one species: Brachyistius aletes, a type of surfperch. To my surprise, there were no clear photographs of this species available online - at least my frantic Google Images search yielded nothing.

Fortunately, Dave Catania (Senior Collections Manager of Ichthyology at the CAS) was able to hunt down a set of preserved specimens and snap a few photos for me. Here I’ll post these photos of Brachyistius aletes to make them available to others. Thanks again for sharing these, Dave!

Read more

Run phytools' make.simmap() in parallel

In macroevolutionary studies, we often use stochastic character mapping to infer how a discrete trait may have evolved. simmap parallel anoledata

I am grateful that the phytools package allows easy implementation of character mapping via the make.simmap() function. This method uses a Markovian process where we sample character histories in proportion to their posterior probabilities under a given model. So we need to simulate many, many (hundreds, thousands…) of potential histories to get meaningful results.

As with any other algorithm that we’d like to run repeatedly, it makes sense to see if parallelization can help us.

Here I provide code to run make.simmap() in parallel. It’s a Windows-friendly approach and similar to my code from another blog post, I make use of parLapply().

Read more

Sample variance at small sample sizes II - distributions

sample variance ggridges In my previous post (see here), I showed that although sample variance on average gives an unbiased estimate of population variance, it is highly unreliable at extremely small sample sizes. This time, I will focus more closely on the distributions of sample variance. Does sample size affect the distributions of sample variance? And how might this inform how we determine which sample sizes are too small? I'll use one of my favorite new(ish) packages, ggridges, to plot the sets of distributions from one example simulation.

Read more

Dangers of sample variance at small sample size

sample variance vs sample size Sample variance generally gives an unbiased estimate of the true population variance, but that does not mean it provides a reliable estimate of population variance. Here, I show that sample variance itself has high variance at low sample sizes. I run through a variety of empirical simulations that vary population size and population variance to see what general patterns emerge.

Read more

Parallel processing for MCMCglmm in R (Windows-friendly)

Lately, I have been using the MCMCglmm package to run linear mixed-models in a Bayesian framework. The documentation is generally very good but there seems to be relatively little support for using parallel processing (here: using multiple cores on your machine) to efficiently run large volumes of mcmc runs. This is especially true for Windows users, who cannot use functions like parallel::mclapply().

I’m happy to share that I have worked out a solution using the parallel package. Basically, I set up a virtual cluster and then use the parallel::parLapply() function to run iterations of MCMCglmm() in parallel.

(This is a re-post of an entry that appeared on my old blog - see here).

Read more

Check if packages are installed (and install if not) in R

Say you have an R script that you share with others. You may not be sure that each user has installed all the packages the script will require. Using install.packages() would be unnessary for users who already have the packages and simply need to load them.

Here’s some code that provides an easy way to check whether specific packages are in the default Library. If they are, they’re simply loaded via library(). If any packages are missing, they’re installed (with dependencies) into the default Library and are then loaded.

(This is a re-post of an entry that appeared on my old blog - see here).

Read more

Set max DLLs in R (Windows)

On occasion, you may need adjust the max number of .dll files that R can handle. I first encountered this need when using a high number of packages together.

I’ve had trouble finding this info in the past, so I decided to create this post for others. This works if you are using Windows.

Read more

test post

#tags: #test  #R 

I am using this post to modify an R Markdown template provided by the rmarkdown package. This will be mostly futzing around so I can get familiar with how things work. Some light plagiarism will be involved for the things I don’t feel like changing.

Read more