Day to day bioinformatics involves interfacing and executing many programs to process data. We end up with some refinement of the data from which we extract biological meaning through data analysis. Given how much interfacing bioinformatics involves, this process undergoes very little thought or design optimization.
Much more attention is needed on the design of interfaces in bioinformatics, to improve their ease of use, robustness, and scalability. Interfacing is a low-level problem that we shouldn’t be wasting time on when there are much better high-level problems out there.
More bluntly, interfacing is currently an inconvenient glue that full time bioinformaticians waste too many hours on. There is no better illustration of this than by looking at how much time we waste in file parsing tasks. Parsers are most commonly employed in bioinformatics as crappy interfaces to non-standard formats. We need better designed interfaces and cleaner interface patterns to help.
I’m hardly the only one to complain about this. Fred Ross had this sadly accurate description:
Estimating from my own experience and observation of my colleagues, most bioinformaticists today spend between 90% and 100% of their time stuck in cruft. Methods are chosen because the file formats are compatible, not because of any underlying suitability. Second, algorithms vanish from the field…. I’m worried about the number of bioinformaticists who don’t understand the difference between an O(n) and an O(n^2) algorithm, and don’t realize that it matters.
He’s parting with bioinformatics, leaving our field with one less person to fix things. However, if practices are suboptimal and frustrating now, it’s not because people are unprepared to implement better approaches, it’s because they’re content with the status quo because it does work. But, as I’ll argue, we shouldn’t be wasting our time on this and much more elegant solutions exist.
The current practice is characterized by system calls from a scripting language to execute larger bioinformatics programs, parse any output files created, and repeat. In my mind, I see this as a paradigm in which state) is stored on the file system, and execution is achieved by passing a stringified set of command-line arguments to a system call.
This practice and paradigm has been the standard for bioinformatics for ages. It’s clunky and inelegant, but it works for routine bioinformatics tasks. However, the current practice isn’t well suited for large, embarrassingly parallel tasks that will grow increasingly common as the number of samples increases in genomics projects. Portability of these pipelines is usually terrible, and involves awkward tasks like ensuring all called programs are in a user’s
$PATH (good luck with version differences too). Program state is stored on the disk, the slowest component apart from the network. Here’s a better look at how to really understand how costly storing state on a disk can be (zoom into this image).
Storing state on a slow, highly-mutable, non-concurrent component is only acceptable if it’s too big to store in memory. Bioinformatics certainly has tasks that produce files too large to store in memory. However, if a user had a task with little memory overhead, the complete lack of interfaces other than the command-line to all aligners, mappers, assemblers, etc would require the user to write to the disk. If they’re clever, they can invoke a subprocess from a script and capture standard out, but then they’re back to parsing text as a sloppy interface, rather than handling higher-level models or objects. This needs to change.
While I’m maybe being a little harsh on the current paradigm, I should say some parts are elegant. Unix piping is extremely elegant, and avoids any latency due to writing to the disk between execution steps. Workflows like this example from the samtools mpileup man page are clear and powerful:
samtools mpileup -uf ref.fa aln1.bam aln2.bam | bcftools view -bvcg - > var.raw.bcf bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf
The real drawback of the current system occurs when a user wants to leverage the larger command-line programs for something novel. Most aligners assume you want to align a few sequences, output the results, and then stop. A user that wants an aligner to align a few sequences, and then proceed down different paths depending on the output has the hassle of writing the sequences to a FASTA file or serializing the sequences in the FASTA format, invoking a subprocess, and then either passing it a filename, or (if the tool supports this), passing the serialized string to the subprocess through standard in. If the command-line tool has overheard for starting up, this is incurred during each subprocess call (even if it could be shared and the overhead amortized across subprocess calls).
To make matters worse, many bioinformatics formats like FASTA, FASTQ, and GTF are either ill-defined or implemented differently across libraries, making them risky interface formats. In contrast, consider the elegance of Google’s protocol buffers. This allow users to write their data structures in the protocol buffer interface description language, and compile interfaces for C++, Java, and Python. This is the type of high-level functionality bioinformatics needs to interface incredibly complex data structures, yet we’re still stuck in the text parsing stone age.
One way to avoid unnecessary clunky system calls is through foreign foreign interfaces (FFIs) to shared libraries, using thin wrappers in a user’s scripting language of choice. Large bioinformatics programs like aligners, assemblers, and mappers are most commonly written in C or C++, which allows code to be compiled as a shared library relatively painlessly.
FFIs could solve a few problems for bioinformaticians. First, they would allow much more code recycling in low-level projects: rather than writing your own highly-optimized C FASTA/FASTQ parser, you can ejust link against a shared library with that routine. Additionally, that shared library can be separately developed and improved.
Second, FFIs allow modularity and high-level access to low-level routines. Genome assemblers are packed to the gills with useful functions. So are aligners. Yet unless the developer took the time to separate this out via subcommands or an API (like git or samtools), you’re unlikely to ever be able to access this functionality. Developers with an eye for better program design can write higher level functions that could be utilized through a FFI. Now, novel bioinformatics tasks that may require some sequence assembly, or a few parallel calls to an aligner can be tackled without the system call rubbish, or re-implementing all the low-level algorithms.
For higher-level functionality with FFIs and shared libraries, wrappers work beautifully. Rather than wrapping entire programs through the command line (as BioPython does), scripting language libraries could interact more directly with low-level programs. In cases in which the current paradigm just doesn’t fit, we’d have the option to avoid it by calling routines directly. Tools like samtools are very successful because they have a powerful API that allow programs like pysam to call their routines.
Imagine now that you could also load adapter and quality trimmers wrappers around shared libraries. Rather than using Unix pipes or bash scripts to write quality control pipelines, and have every program in the execution chain read, parse, and then write FASTA formatted files, it could be done once, using object abstractions of data.
import sys import biolib import sickle import scythe for read in biolib.read_fasta(open(sys.argv)): read_tmd = scythe.trim(read.seq, read.qual, prior=0.3) biolib.write_fasta(sickle.trim(read_tmd.seq, read_tmd.qual, qual=20), stdout)
In this artificial example, reads could then be sent directly to aligners rather than to standard out. Working with higher-level models of data (in this case, a read) allows easier real-time debugging, statistics gathering, and parallelization. Imagine being able to put this entire block in a
try statement, and have exceptions handled at a higher level. An error could invoke a debugger, and a bioinformatician could inspect the culprit interactively in real-time. This is impossible in the old paradigm (and we’ve all spent ages using streaming tools to track down such bugs).
Note that I’m not arguing that your average biologist should suddenly start trying to understand position-independent code and compile shared libraries to avoid making systems calls in their scripts. Sometimes a system call is the right tool for the job. But bioinformatics software developers should reach for a system call not because it’s the only interface, but because it’s the best interface for a particular task. Maybe someday we’ll even see thin wrappers coming packaged with bioinformatics tools (even if under
contrib/ and written by other developers) — I can dream, right?
Here’s a real FFI example: I needed to assemble many sets of sequences for a pet project. I really wanted to avoid needlessly writing FASTA files of these sets of sequences to disk, as this is quite costly. However, most assemblers were designed solely to be interfaced through the command-line. The inelegance of writing thousands of files, making thousands of system calls to an assembler, then reading and parsing the results not appealing, so I wrote pyfermi, a simple Python interface to Heng Li’s Fermi assembler (note that this software is experimental, so use it with caution). First off, I couldn’t have done this without Heng’s excellent assembler and code, so I owe him a debt of gratitude. The Fermi source has a beautiful example of high-level functions that can be interfaced with relative ease: see the
fm6_api_* functions used in example.c.
I wrote a few extra C functions in pyfermi (mostly to deal with
* necessary because Python’s ctypes can’t handle foreign types AFAIK), and compile Fermi as a shared library. I was able to do all this in far less time than it would have taken me to go down the route of writing thousands of files, making syscalls, and handing file parsing.
Overall, bioinformaticians need to be more conscious of the design and interfacing of our tools. I strongly believe tools and methods shape how we approach and solve problems. A data programmer only trained in Perl will likely at some point wage a messy war trying to produce decent statistical graphics. Likewise, a statistician only trained in t-tests and ANOVA, will only see normally distributed data (and apply every transformation under the sun to force their data into this shape). I’m hardly the first person to argue that this occurs: this idea is known as the law of the instrument. Borrowing a 1964 quote from Abraham Kaplan (and Wikipedia):
I call it the law of the instrument, and it may be formulated as follows: Give a small boy a hammer, and he will find that everything he encounters needs pounding.
Earlier in our bioinformatics careers, we’re trained in file formats, writing scripts, and calling large programs via the command-line. This becomes a hammer, and all problems start looking like nails. However, the old practice and paradigms are breaking down as the scale of our data and the complexity of our problems increase. We need new school bioinformatics with a focus on the bigger picture, so let’s start talking about how we do it.