[Bioperl-l] Bio::DB::Fasta fails for files over 4GB

Mon Aug 7 16:05:56 UTC 2006

I just found out that Bio::DB::Fasta has an inherit 4GB file size limit 
in it. This is due to how indexing information is stored. The module 
pack()s information using this format:

use constant STRUCT =>'NNnnCa*';

... where the first token is the file offset. N = 32-bit unsigned 
integer, and rolls-over when the file position passes the 4GB mark, 
resulting in garbage out for those entries. Changing the packing format to:

use constant STRUCT =>'QNnnCa*';

...solves the problem (Q = 64-bit unsigned int). We have several genomic 
files (ensembl dumps) where this is an issue:

-rw-rw-r--  1 kirovs   bioinfo 7.2G Jul 13 12:28 
pan_troglodytes.genome.CHIMP1A.fa
-rw-rw-r--  1 kirovs   bioinfo 6.8G Jul 13 12:25 
monodelphis_domestica.genome.BROADO3.fa
-rw-rw-r--  1 kirovs   bioinfo 5.0G Jul 13 12:26 
mus_musculus.genome.NCBIM36.fa
-rw-rw-r--  1 kirovs   bioinfo 4.6G Aug  2 15:31 bos_taurus.genome.Btau2.fa
-rw-rw-r--  1 kirovs   bioinfo 4.1G Jul 13 12:22 
danio_rerio.genome.ZFISH6.fa

These are not really large genomes, but have a fair number of 
unassembled (duplicitous) fragments in them, which bump up the file 
size. Some fully assembled genomes will probably eventually top the 4GB 
mark, anyway.

Unfortunately, this raises a backward compatibility issue, since an 
index packed with 'N' will fail when unpacked with 'Q'. Perhaps the 
module could dynamically bifurcate the packing structure based on a file 
size test?

The second token is for the sequence length, I can't imagine a single 
sequence exceeding 4Gb, so it's probably safe - yes? Should it also be Q 
in the event that biology someday exceeds our current imagination?

Thanks,
CAT

-- 
Charles Tilford, Bioinformatics-Applied Genomics
Bristol-Myers Squibb PRI, Hopewell 3A039
P.O. Box 5400, Princeton, NJ 08543-5400, (609) 818-3213
charles.tilford at bms.com