[BioPython] [Popgen] a binary format for genotypes

Giovanni Marco Dall'Olio dalloliogm at gmail.com
Mon Dec 15 23:49:29 UTC 2008


On Mon, Dec 15, 2008 at 11:53 PM, Kevin Teague <kteague at bcgsc.ca> wrote:
> A lot of the headaches of dealing with large scale data sets in a
> performance optimizing manner (self-describing format, platform independant
> binary files) have been worked out in other fields of science who've been
> dealing with large scale data sets for a lot longer than the field of
> bioinformatics (e.g. astronomy and climatology).
>
> While I've only used it a little bit, so I can't comment if there are any
> other formats that are worthy contenders, the HDF5 format is well
> established for working with large scale data sets:
>
> http://www.hdfgroup.org/HDF5/

I have already heard of this format, but for some reasons I thought
that it couldn't be more efficient than a database.
I have to deal with a table of ~10^7 entries, correlated with another
one of 10^3, so, if I'd organize it in a certain way, it will have
10^10 entries.
Do you think that this binary format would be more efficient than a
database to handle all this? Does it supports relationships? (ok, I
will read the documentation!! :) ).

>
> There are libraries for accessing this format for many languages. With
> Python there is PyTables, which is a very good library:
>
> http://www.pytables.org/

Thanks for the link

> I haven't heard of anyone using this in bioinformatics, but I've seen it
> demonstrated in very high traffic financial application written in Python
> where performance of this library was impressive. The developer ported to
> PyTables after PostgreSQL became a bottle-neck and found that PyTables was
> an order of magnitude faster. Of course, this isn't a purely fair
> comparison, since PyTables gives up transactions, concurrency and
> referential integrity in favor of pure speed. But in most data analysis
> pipelines, each data set can be produced independantly of each other, so
> those features of a RDBMS aren't usually needed.
>
> There have been a number of other bioinformatics tools and libraries that
> have been using custom binary file formats to deal with the ever increasing
> size of bioinformatic data sets. From a sysadmin and developer perspective
> it's a big headache since these custom formats can be platform-sensitive and
> require compiling and installing binaries to deal with each data format.
> Bleh!

> I have yet to see a "custom bioinformatic binary file format" which had to
> be developed to account for short comings of an already existing binary file
> format ...
>
>



-- 

My blog on bioinformatics (now in English): http://bioinfoblog.it



More information about the Biopython mailing list