[Bioperl-l] Bio::DB::Fasta fails for files over 4GB
Charles Tilford
charles.tilford at bms.com
Mon Aug 7 16:05:56 UTC 2006
I just found out that Bio::DB::Fasta has an inherit 4GB file size limit
in it. This is due to how indexing information is stored. The module
pack()s information using this format:
use constant STRUCT =>'NNnnCa*';
... where the first token is the file offset. N = 32-bit unsigned
integer, and rolls-over when the file position passes the 4GB mark,
resulting in garbage out for those entries. Changing the packing format to:
use constant STRUCT =>'QNnnCa*';
...solves the problem (Q = 64-bit unsigned int). We have several genomic
files (ensembl dumps) where this is an issue:
-rw-rw-r-- 1 kirovs bioinfo 7.2G Jul 13 12:28
pan_troglodytes.genome.CHIMP1A.fa
-rw-rw-r-- 1 kirovs bioinfo 6.8G Jul 13 12:25
monodelphis_domestica.genome.BROADO3.fa
-rw-rw-r-- 1 kirovs bioinfo 5.0G Jul 13 12:26
mus_musculus.genome.NCBIM36.fa
-rw-rw-r-- 1 kirovs bioinfo 4.6G Aug 2 15:31 bos_taurus.genome.Btau2.fa
-rw-rw-r-- 1 kirovs bioinfo 4.1G Jul 13 12:22
danio_rerio.genome.ZFISH6.fa
These are not really large genomes, but have a fair number of
unassembled (duplicitous) fragments in them, which bump up the file
size. Some fully assembled genomes will probably eventually top the 4GB
mark, anyway.
Unfortunately, this raises a backward compatibility issue, since an
index packed with 'N' will fail when unpacked with 'Q'. Perhaps the
module could dynamically bifurcate the packing structure based on a file
size test?
The second token is for the sequence length, I can't imagine a single
sequence exceeding 4Gb, so it's probably safe - yes? Should it also be Q
in the event that biology someday exceeds our current imagination?
Thanks,
CAT
--
Charles Tilford, Bioinformatics-Applied Genomics
Bristol-Myers Squibb PRI, Hopewell 3A039
P.O. Box 5400, Princeton, NJ 08543-5400, (609) 818-3213
charles.tilford at bms.com
More information about the Bioperl-l
mailing list