[BioPython] [Popgen] a binary format for genotypes

Mon Dec 15 22:53:29 UTC 2008

A lot of the headaches of dealing with large scale data sets in a  
performance optimizing manner (self-describing format, platform  
independant binary files) have been worked out in other fields of  
science who've been dealing with large scale data sets for a lot  
longer than the field of bioinformatics (e.g. astronomy and  
climatology).

While I've only used it a little bit, so I can't comment if there are  
any other formats that are worthy contenders, the HDF5 format is well  
established for working with large scale data sets:

http://www.hdfgroup.org/HDF5/

There are libraries for accessing this format for many languages. With  
Python there is PyTables, which is a very good library:

http://www.pytables.org/

I haven't heard of anyone using this in bioinformatics, but I've seen  
it demonstrated in very high traffic financial application written in  
Python where performance of this library was impressive. The developer  
ported to PyTables after PostgreSQL became a bottle-neck and found  
that PyTables was an order of magnitude faster. Of course, this isn't  
a purely fair comparison, since PyTables gives up transactions,  
concurrency and referential integrity in favor of pure speed. But in  
most data analysis pipelines, each data set can be produced  
independantly of each other, so those features of a RDBMS aren't  
usually needed.

There have been a number of other bioinformatics tools and libraries  
that have been using custom binary file formats to deal with the ever  
increasing size of bioinformatic data sets. From a sysadmin and  
developer perspective it's a big headache since these custom formats  
can be platform-sensitive and require compiling and installing  
binaries to deal with each data format. Bleh!

I have yet to see a "custom bioinformatic binary file format" which  
had to be developed to account for short comings of an already  
existing binary file format ...