[Bioperl-l] Bio::DB::Fasta fails for files over 4GB

Mon Aug 7 18:46:37 UTC 2006

Why don't you submit your patch to Bugzilla?

http://bugzilla.open-bio.org/
http://www.bioperl.org/wiki/HOWTO:SubmitPatch

Lincoln could take a look at it when he gets back from vacation and  
comment on it.  He may have some other possibilities we haven't  
thought of.

Chris

On Aug 7, 2006, at 1:30 PM, Charles Tilford wrote:

> Chris Fields wrote:
>> Dynamically determining the packing based on file size is probably  
>> the way to go; it would be nice to see how this affects speed.
>> Chris
> Ok, I seem to have a functioning patch. I was also concerned about  
> performance; I assume Lincoln's using a constant for the pack  
> format because it optimizes compilation of the _pack() and _unpack 
> () methods. So rather than make the format a variable, I made the  
> methods themselves variant. The code looks at the file(s) being  
> indexed, and if any file exceeds 4Gb it will use 64-bit packing  
> (for both file offset and sequence length - being paranoid on the  
> later, in case we get Martian genomes with chromosomes over 4Gb in  
> length). I have not tested it for a directory of multiple files,  
> but it should still work. Two packing formats are defined as  
> constants, and the pack / unpack methods are bifurcated to use one  
> or the other. I do not have a good feeling for the performance  
> difference in calling a method directly or by de-referencing a  
> method reference, but I assume it is minuscule.
>
> -s is used for the file size test - I assume that is nuclear-hard  
> portable across platforms?
>
> I didn't figure out a good reason for the _pack/_unpack calls to  
> remain object methods, so they're not in the patch.
>
>
> The patch below is against:
> # $Id: Fasta.pm,v 1.44 2006/07/17 10:39:37 sendu Exp $
>
>
> --- /net/thegeneral/home/tilfordc/Fasta.pm    2006-08-07  
> 14:01:27.000000000 -0400
> +++ /stf/biocgi/tilfordc/patch_lib/Bio/DB/Fasta.pm    2006-08-07  
> 14:07:52.033844532 -0400
> @@ -418,7 +418,8 @@
> *ids = \&get_all_ids;
> *get_seq_by_primary_id = *get_Seq_by_acc  = \&get_Seq_by_id;
> -use constant STRUCT =>'NNnnCa*';
> +use constant STRUCT    =>'NNnnCa*';
> +use constant STRUCTBIG =>'QQnnCa*'; # 64-bit file offset and seq  
> length
> use constant DNA     => 1;
> use constant RNA     => 2;
> use constant PROTEIN => 3;
> @@ -568,6 +569,7 @@
>   # get the most recent modification time of any of the contents
>   my $modtime = 0;
>   my %modtime;
> +  $self->set_pack_method( @files );
>   foreach (@files) {
>     my $m = (stat($_))[9];
>     $modtime{$_} = $m;
> @@ -612,6 +614,32 @@
>   return Bio::PrimarySeq::Fasta->new($self,$id);
> }
> +=head2 set_pack_method
> +
> + Title   : set_pack_method
> + Usage   : $db->set_pack_method( @files )
> + Function: Determines whether data packing uses 32 or 64 bit integers
> + Returns :
> + Args    : one or more file paths
> +
> +=cut
> +
> +sub set_pack_method {
> +  my $self = shift;
> +  # Find the maximum file size:
> +  my ($maxsize) = sort { $b <=> $a } map { -s $_ } @_;
> +  my $fourGB    = (2 ** 32) - 1;
> +
> +  if ($maxsize > $fourGB) {
> +      # At least one file exceeds 4Gb - we will need to use 64 bit  
> ints
> +      $self->{packmeth}   = \&_packBig;
> +      $self->{unpackmeth} = \&_unpackBig;
> +  } else {
> +      $self->{packmeth}   = \&_pack;
> +      $self->{unpackmeth} = \&_unpack;
> +  }
> +}
> +
> =head2 index_file
>  Title   : index_file
> @@ -629,6 +657,7 @@
>   my $file = shift;
>   my $force_reindex = shift;
> +  $self->set_pack_method( $file );
>   my $index = $self->index_name($file);
>   # if caller has requested reindexing, then unlink the index
>   unlink $index if $force_reindex;
> @@ -716,9 +745,9 @@
>       if ($id) {
>     my $seqlength    = $pos - $offset - length($_);
>     $seqlength      -= $termination_length * $seq_lines;
> -    $offsets->{$id}  = $self->_pack($offset,$seqlength,
> -                    $linelength,$firstline,
> -                    $type,$base);
> +    $offsets->{$id}  = &{$self->{packmeth}}($offset,$seqlength,
> +                                                $linelength, 
> $firstline,
> +                                                $type,$base);
>       }
>       $id = ref($self->{makeid}) eq 'CODE' ? $self->{makeid}-> 
> ($_) : $1;
>       ($offset,$firstline,$linelength) = ($pos,length($_),0);
> @@ -746,10 +775,10 @@
>       }
>       $seqlength -= $termination_length * $seq_lines;
>     };
> -    $offsets->{$id} = $self->_pack($offset,$seqlength,
> -                   $linelength,$firstline,
> -                   $type,$base);
> -  }
> +    $offsets->{$id} = &{$self->{packmeth}}($offset,$seqlength,
> +                                           $linelength,$firstline,
> +                                           $type,$base);
> +}
>   $offsets->{__termination_length} = $termination_length;
>   return \%offsets;
> }
> @@ -770,35 +799,35 @@
>   my $self = shift;
>   my $id   = shift;
>   my $offset = $self->{offsets}{$id} or return;
> -  ($self->_unpack($offset))[0];
> +  (&{$self->{unpackmeth}}($offset))[0];
> }
> sub length {
>   my $self = shift;
>   my $id   = shift;
>   my $offset = $self->{offsets}{$id} or return;
> -  ($self->_unpack($offset))[1];
> +  (&{$self->{unpackmeth}}($offset))[1];
> }
> sub linelen {
>   my $self = shift;
>   my $id   = shift;
>   my $offset = $self->{offsets}{$id} or return;
> -  ($self->_unpack($offset))[2];
> +  (&{$self->{unpackmeth}}($offset))[2];
> }
> sub headerlen {
>   my $self = shift;
>   my $id   = shift;
>   my $offset = $self->{offsets}{$id} or return;
> -  ($self->_unpack($offset))[3];
> +  (&{$self->{unpackmeth}}($offset))[3];
> }
> sub alphabet {
>   my $self = shift;
>   my $id   = shift;
>   my $offset = $self->{offsets}{$id} or return;
> -  my $type = ($self->_unpack($offset))[4];
> +  my $type = (&{$self->{unpackmeth}}($offset))[4];
>   return $type == DNA ? 'dna'
>          : $type == RNA ? 'rna'
>          : 'protein';
> @@ -818,7 +847,7 @@
>   my $self = shift;
>   my $id   = shift;
>   my $offset = $self->{offsets}{$id} or return;
> -  $self->fileno2path(($self->_unpack($offset))[5]);
> +  $self->fileno2path((&{$self->{unpackmeth}}($offset))[5]);
> }
> sub fileno2path {
> @@ -899,7 +928,7 @@
>   my $self = shift;
>   my $id   = shift;
>   my ($offset,$seqlength,$linelength,$firstline,$type,$file)
> -    = $self->_unpack($self->{offsets}{$id}) or return;
> +    = &{$self->{unpackmeth}}($self->{offsets}{$id}) or return;
>   $offset -= $firstline;
>   my $data;
>   my $fh = $self->fh($id) or return;
> @@ -914,7 +943,7 @@
>   my $self = shift;
>   my $id   = shift;
>   my $a    = shift()-1;
> -  my ($offset,$seqlength,$linelength,$firstline,$type,$file) =  
> $self->_unpack($self->{offsets}{$id});
> +  my ($offset,$seqlength,$linelength,$firstline,$type,$file) = & 
> {$self->{unpackmeth}}($self->{offsets}{$id});
>   $a = 0            if $a < 0;
>   $a = $seqlength-1 if $a >= $seqlength;
>   my $tl = $self->{offsets}{__termination_length};
> @@ -940,15 +969,21 @@
> }
> sub _pack {
> -  shift;
>   pack STRUCT, at _;
> }
> +sub _packBig {
> +  pack STRUCTBIG, at _;
> +}
> +
> sub _unpack {
> -  shift;
>   unpack STRUCT,shift;
> }
> +sub _unpackBig {
> +  unpack STRUCTBIG,shift;
> +}
> +
> sub _type {
>   shift;
>   local $_ = shift;
>
>
>
>
>
>>
>> On Aug 7, 2006, at 11:05 AM, Charles Tilford wrote:
>>
>>> I just found out that Bio::DB::Fasta has an inherit 4GB file size  
>>> limit in it. This is due to how indexing information is stored.  
>>> The module pack()s information using this format:
>>>
>>> use constant STRUCT =>'NNnnCa*';
>>>
>>> ... where the first token is the file offset. N = 32-bit unsigned  
>>> integer, and rolls-over when the file position passes the 4GB  
>>> mark, resulting in garbage out for those entries. Changing the  
>>> packing format to:
>>>
>>> use constant STRUCT =>'QNnnCa*';
>>>
>>> ...solves the problem (Q = 64-bit unsigned int). We have several  
>>> genomic files (ensembl dumps) where this is an issue:
>>>
>>> -rw-rw-r--  1 kirovs   bioinfo 7.2G Jul 13 12:28  
>>> pan_troglodytes.genome.CHIMP1A.fa
>>> -rw-rw-r--  1 kirovs   bioinfo 6.8G Jul 13 12:25  
>>> monodelphis_domestica.genome.BROADO3.fa
>>> -rw-rw-r--  1 kirovs   bioinfo 5.0G Jul 13 12:26  
>>> mus_musculus.genome.NCBIM36.fa
>>> -rw-rw-r--  1 kirovs   bioinfo 4.6G Aug  2 15:31  
>>> bos_taurus.genome.Btau2.fa
>>> -rw-rw-r--  1 kirovs   bioinfo 4.1G Jul 13 12:22  
>>> danio_rerio.genome.ZFISH6.fa
>>>
>>> These are not really large genomes, but have a fair number of  
>>> unassembled (duplicitous) fragments in them, which bump up the  
>>> file size. Some fully assembled genomes will probably eventually  
>>> top the 4GB mark, anyway.
>>>
>>> Unfortunately, this raises a backward compatibility issue, since  
>>> an index packed with 'N' will fail when unpacked with 'Q'.  
>>> Perhaps the module could dynamically bifurcate the packing  
>>> structure based on a file size test?
>>>
>>> The second token is for the sequence length, I can't imagine a  
>>> single sequence exceeding 4Gb, so it's probably safe - yes?  
>>> Should it also be Q in the event that biology someday exceeds our  
>>> current imagination?
>>>
>>> Thanks,
>>> CAT
>>>
>>> -- 
>>> Charles Tilford, Bioinformatics-Applied Genomics
>>> Bristol-Myers Squibb PRI, Hopewell 3A039
>>> P.O. Box 5400, Princeton, NJ 08543-5400, (609) 818-3213
>>> charles.tilford at bms.com <mailto:charles.tilford at bms.com>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org <mailto:Bioperl-l at lists.open-bio.org>
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> Christopher Fields
>> Postdoctoral Researcher
>> Lab of Dr. Robert Switzer
>> Dept of Biochemistry
>> University of Illinois Urbana-Champaign
>>
>>
>>
>
> -- 
> Charles Tilford, Bioinformatics-Applied Genomics
> Bristol-Myers Squibb PRI, Hopewell 3A039
> P.O. Box 5400, Princeton, NJ 08543-5400, (609) 818-3213
> charles.tilford at bms.com

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign