[Bioperl-l] Bio::DB::Fasta fails for files over 4GB
Chris Fields
cjfields at uiuc.edu
Mon Aug 7 18:46:37 UTC 2006
Why don't you submit your patch to Bugzilla?
http://bugzilla.open-bio.org/
http://www.bioperl.org/wiki/HOWTO:SubmitPatch
Lincoln could take a look at it when he gets back from vacation and
comment on it. He may have some other possibilities we haven't
thought of.
Chris
On Aug 7, 2006, at 1:30 PM, Charles Tilford wrote:
> Chris Fields wrote:
>> Dynamically determining the packing based on file size is probably
>> the way to go; it would be nice to see how this affects speed.
>> Chris
> Ok, I seem to have a functioning patch. I was also concerned about
> performance; I assume Lincoln's using a constant for the pack
> format because it optimizes compilation of the _pack() and _unpack
> () methods. So rather than make the format a variable, I made the
> methods themselves variant. The code looks at the file(s) being
> indexed, and if any file exceeds 4Gb it will use 64-bit packing
> (for both file offset and sequence length - being paranoid on the
> later, in case we get Martian genomes with chromosomes over 4Gb in
> length). I have not tested it for a directory of multiple files,
> but it should still work. Two packing formats are defined as
> constants, and the pack / unpack methods are bifurcated to use one
> or the other. I do not have a good feeling for the performance
> difference in calling a method directly or by de-referencing a
> method reference, but I assume it is minuscule.
>
> -s is used for the file size test - I assume that is nuclear-hard
> portable across platforms?
>
> I didn't figure out a good reason for the _pack/_unpack calls to
> remain object methods, so they're not in the patch.
>
>
> The patch below is against:
> # $Id: Fasta.pm,v 1.44 2006/07/17 10:39:37 sendu Exp $
>
>
> --- /net/thegeneral/home/tilfordc/Fasta.pm 2006-08-07
> 14:01:27.000000000 -0400
> +++ /stf/biocgi/tilfordc/patch_lib/Bio/DB/Fasta.pm 2006-08-07
> 14:07:52.033844532 -0400
> @@ -418,7 +418,8 @@
> *ids = \&get_all_ids;
> *get_seq_by_primary_id = *get_Seq_by_acc = \&get_Seq_by_id;
> -use constant STRUCT =>'NNnnCa*';
> +use constant STRUCT =>'NNnnCa*';
> +use constant STRUCTBIG =>'QQnnCa*'; # 64-bit file offset and seq
> length
> use constant DNA => 1;
> use constant RNA => 2;
> use constant PROTEIN => 3;
> @@ -568,6 +569,7 @@
> # get the most recent modification time of any of the contents
> my $modtime = 0;
> my %modtime;
> + $self->set_pack_method( @files );
> foreach (@files) {
> my $m = (stat($_))[9];
> $modtime{$_} = $m;
> @@ -612,6 +614,32 @@
> return Bio::PrimarySeq::Fasta->new($self,$id);
> }
> +=head2 set_pack_method
> +
> + Title : set_pack_method
> + Usage : $db->set_pack_method( @files )
> + Function: Determines whether data packing uses 32 or 64 bit integers
> + Returns :
> + Args : one or more file paths
> +
> +=cut
> +
> +sub set_pack_method {
> + my $self = shift;
> + # Find the maximum file size:
> + my ($maxsize) = sort { $b <=> $a } map { -s $_ } @_;
> + my $fourGB = (2 ** 32) - 1;
> +
> + if ($maxsize > $fourGB) {
> + # At least one file exceeds 4Gb - we will need to use 64 bit
> ints
> + $self->{packmeth} = \&_packBig;
> + $self->{unpackmeth} = \&_unpackBig;
> + } else {
> + $self->{packmeth} = \&_pack;
> + $self->{unpackmeth} = \&_unpack;
> + }
> +}
> +
> =head2 index_file
> Title : index_file
> @@ -629,6 +657,7 @@
> my $file = shift;
> my $force_reindex = shift;
> + $self->set_pack_method( $file );
> my $index = $self->index_name($file);
> # if caller has requested reindexing, then unlink the index
> unlink $index if $force_reindex;
> @@ -716,9 +745,9 @@
> if ($id) {
> my $seqlength = $pos - $offset - length($_);
> $seqlength -= $termination_length * $seq_lines;
> - $offsets->{$id} = $self->_pack($offset,$seqlength,
> - $linelength,$firstline,
> - $type,$base);
> + $offsets->{$id} = &{$self->{packmeth}}($offset,$seqlength,
> + $linelength,
> $firstline,
> + $type,$base);
> }
> $id = ref($self->{makeid}) eq 'CODE' ? $self->{makeid}->
> ($_) : $1;
> ($offset,$firstline,$linelength) = ($pos,length($_),0);
> @@ -746,10 +775,10 @@
> }
> $seqlength -= $termination_length * $seq_lines;
> };
> - $offsets->{$id} = $self->_pack($offset,$seqlength,
> - $linelength,$firstline,
> - $type,$base);
> - }
> + $offsets->{$id} = &{$self->{packmeth}}($offset,$seqlength,
> + $linelength,$firstline,
> + $type,$base);
> +}
> $offsets->{__termination_length} = $termination_length;
> return \%offsets;
> }
> @@ -770,35 +799,35 @@
> my $self = shift;
> my $id = shift;
> my $offset = $self->{offsets}{$id} or return;
> - ($self->_unpack($offset))[0];
> + (&{$self->{unpackmeth}}($offset))[0];
> }
> sub length {
> my $self = shift;
> my $id = shift;
> my $offset = $self->{offsets}{$id} or return;
> - ($self->_unpack($offset))[1];
> + (&{$self->{unpackmeth}}($offset))[1];
> }
> sub linelen {
> my $self = shift;
> my $id = shift;
> my $offset = $self->{offsets}{$id} or return;
> - ($self->_unpack($offset))[2];
> + (&{$self->{unpackmeth}}($offset))[2];
> }
> sub headerlen {
> my $self = shift;
> my $id = shift;
> my $offset = $self->{offsets}{$id} or return;
> - ($self->_unpack($offset))[3];
> + (&{$self->{unpackmeth}}($offset))[3];
> }
> sub alphabet {
> my $self = shift;
> my $id = shift;
> my $offset = $self->{offsets}{$id} or return;
> - my $type = ($self->_unpack($offset))[4];
> + my $type = (&{$self->{unpackmeth}}($offset))[4];
> return $type == DNA ? 'dna'
> : $type == RNA ? 'rna'
> : 'protein';
> @@ -818,7 +847,7 @@
> my $self = shift;
> my $id = shift;
> my $offset = $self->{offsets}{$id} or return;
> - $self->fileno2path(($self->_unpack($offset))[5]);
> + $self->fileno2path((&{$self->{unpackmeth}}($offset))[5]);
> }
> sub fileno2path {
> @@ -899,7 +928,7 @@
> my $self = shift;
> my $id = shift;
> my ($offset,$seqlength,$linelength,$firstline,$type,$file)
> - = $self->_unpack($self->{offsets}{$id}) or return;
> + = &{$self->{unpackmeth}}($self->{offsets}{$id}) or return;
> $offset -= $firstline;
> my $data;
> my $fh = $self->fh($id) or return;
> @@ -914,7 +943,7 @@
> my $self = shift;
> my $id = shift;
> my $a = shift()-1;
> - my ($offset,$seqlength,$linelength,$firstline,$type,$file) =
> $self->_unpack($self->{offsets}{$id});
> + my ($offset,$seqlength,$linelength,$firstline,$type,$file) = &
> {$self->{unpackmeth}}($self->{offsets}{$id});
> $a = 0 if $a < 0;
> $a = $seqlength-1 if $a >= $seqlength;
> my $tl = $self->{offsets}{__termination_length};
> @@ -940,15 +969,21 @@
> }
> sub _pack {
> - shift;
> pack STRUCT, at _;
> }
> +sub _packBig {
> + pack STRUCTBIG, at _;
> +}
> +
> sub _unpack {
> - shift;
> unpack STRUCT,shift;
> }
> +sub _unpackBig {
> + unpack STRUCTBIG,shift;
> +}
> +
> sub _type {
> shift;
> local $_ = shift;
>
>
>
>
>
>>
>> On Aug 7, 2006, at 11:05 AM, Charles Tilford wrote:
>>
>>> I just found out that Bio::DB::Fasta has an inherit 4GB file size
>>> limit in it. This is due to how indexing information is stored.
>>> The module pack()s information using this format:
>>>
>>> use constant STRUCT =>'NNnnCa*';
>>>
>>> ... where the first token is the file offset. N = 32-bit unsigned
>>> integer, and rolls-over when the file position passes the 4GB
>>> mark, resulting in garbage out for those entries. Changing the
>>> packing format to:
>>>
>>> use constant STRUCT =>'QNnnCa*';
>>>
>>> ...solves the problem (Q = 64-bit unsigned int). We have several
>>> genomic files (ensembl dumps) where this is an issue:
>>>
>>> -rw-rw-r-- 1 kirovs bioinfo 7.2G Jul 13 12:28
>>> pan_troglodytes.genome.CHIMP1A.fa
>>> -rw-rw-r-- 1 kirovs bioinfo 6.8G Jul 13 12:25
>>> monodelphis_domestica.genome.BROADO3.fa
>>> -rw-rw-r-- 1 kirovs bioinfo 5.0G Jul 13 12:26
>>> mus_musculus.genome.NCBIM36.fa
>>> -rw-rw-r-- 1 kirovs bioinfo 4.6G Aug 2 15:31
>>> bos_taurus.genome.Btau2.fa
>>> -rw-rw-r-- 1 kirovs bioinfo 4.1G Jul 13 12:22
>>> danio_rerio.genome.ZFISH6.fa
>>>
>>> These are not really large genomes, but have a fair number of
>>> unassembled (duplicitous) fragments in them, which bump up the
>>> file size. Some fully assembled genomes will probably eventually
>>> top the 4GB mark, anyway.
>>>
>>> Unfortunately, this raises a backward compatibility issue, since
>>> an index packed with 'N' will fail when unpacked with 'Q'.
>>> Perhaps the module could dynamically bifurcate the packing
>>> structure based on a file size test?
>>>
>>> The second token is for the sequence length, I can't imagine a
>>> single sequence exceeding 4Gb, so it's probably safe - yes?
>>> Should it also be Q in the event that biology someday exceeds our
>>> current imagination?
>>>
>>> Thanks,
>>> CAT
>>>
>>> --
>>> Charles Tilford, Bioinformatics-Applied Genomics
>>> Bristol-Myers Squibb PRI, Hopewell 3A039
>>> P.O. Box 5400, Princeton, NJ 08543-5400, (609) 818-3213
>>> charles.tilford at bms.com <mailto:charles.tilford at bms.com>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org <mailto:Bioperl-l at lists.open-bio.org>
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> Christopher Fields
>> Postdoctoral Researcher
>> Lab of Dr. Robert Switzer
>> Dept of Biochemistry
>> University of Illinois Urbana-Champaign
>>
>>
>>
>
> --
> Charles Tilford, Bioinformatics-Applied Genomics
> Bristol-Myers Squibb PRI, Hopewell 3A039
> P.O. Box 5400, Princeton, NJ 08543-5400, (609) 818-3213
> charles.tilford at bms.com
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
More information about the Bioperl-l
mailing list