Using Rcpp and C++ to Count Genotypes

I had a matrix (88662 loci x 2060 genotypes) of maize chromosome 1 genotypes, encoded as 0, 1, 2 (e.g. the number of alternate alleles). I needed genotype counts per row, which at first glance is quite easy to solve: just use apply and table:

 counts <- apply(chr1g, 1, table)

What’s the problem with this approach? First, technical: if a row only has genotypes 0 and 2, we don’t get counts for 1, which makes merging into a matrix later on a total nightmare (evident because apply is smart enough to return a list). Second, it ignores NA, which is not good. We can fix this with exclude=NULL:

counts <- apply(chr1g, 1, function(x) table(x, exclude=NULL))

This doesn’t solve our first problem, and it’s a bit slower. apply(chr1g, 1, table) took:

  user  system elapsed
84.638   3.110  87.865

Where as apply(chr1g, 1, function(x) table(x, exclude=NULL)) takes:

   user  system elapsed
108.718   5.708 114.609

(and no, passing exclude=NULL to directly to apply doesn’t make it faster). The way around the first technical issue is to use factors. as.factor removes dimensions, so we’re out of luck converting the whole matrix at once (plus, this would require a copy in memory and these objects are moderately large). So, we could so something like:

count <- apply(chr1g, 1, function(x) table(factor(x, levels=c(0, 1, 2, NA)), exclude=NULL))})

This works well, but is also not too fast (chromosome 1 is 15% of our dataset):

  user  system elapsed
95.918   5.746 101.861

Rcpp stands out as a simple solution here: this is very easy to code up (it took me literally five minutes). Looking at the timing first:

  user  system elapsed
 8.427   1.573  10.027

We see we have a clear winner. The whole dataset is 573,392 rows. Overall, this would take (95.918 / 88662)* 573392 = 620.3178 seconds or about 10 minutes to complete on all data. Chances are, I’ll have to run this code a few times as the analysis changes. In contrast, the Rcpp method takes (8.427 / 88662)* 573392 = 54.49882 seconds. That’s under 6 minutes to code and implement a working, faster solution in Rcpp!

The code is quite easy to understand too:

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
IntegerVector countGenotypes(IntegerVector x) {
    // This method is a specialized version of R's table that counts genotypes
    // encoded as 0, 1, 2 in a vector (and also returns NA) always of length 4,
    // always as numbers of 0, 1, 2, NA. This allows faster usage with apply, as
    // we don't need to convert to factor to get all genotype counts, even if
    // none are present.
    CharacterVector names = CharacterVector::create("0", "1", "2", "NA");
    IntegerVector genocounts(4);
    genocounts.fill(0);
    for (int i = 0; i < x.length(); i++) {
            if (!IntegerVector::is_na(x[i])) {
                    genocounts[x[i]] += 1;
            } else {
                    genocounts[3] += 1;
            }
    }
    genocounts.attr("names") = names;
    return genocounts;
}
Recent Notes » archive

Notes includes my technical writing on population and evolutionary genetics, evolution, statistics and probability, and computing.