[Bioperl-l] Bio::DB::Fasta check for ragged line widths

Don Gilbert gilbertd at cricket.bio.indiana.edu
Sat Apr 7 03:31:29 UTC 2007


Dear Bioperlers,

There is a hidden issue with Bio::DB::Fasta in that it assumes Fasta
files have fixed line widths, but that isn't a requirement of Fasta
format. The documentation notes this package requirement, but I was
bitten by this, and I'd guess not many people check their data (esp.
if from someone else) to see it meets this requirement.

Simple tools can easily produce fasta with ragged line formatting:
e.g. genome assemblers that paste together contig fasta with spacers
to make assemblies.

It would be nice if B:D:Fasta would check and die when it can't handle
this ragged input.  Here is a suggested addition:

  package Bio::DB::Fasta;

=head1 DESCRIPTION
  
  Entries may have any line length up to 65,536 characters, and
  different line lengths are allowed in the same file.  However, within
  a sequence entry, all lines must be the same length except for the
  last.  
+ An error will be thrown if this is not the case.

=cut

  use constant DIE_ON_MISSMATCHED_LINES => 1; # if you want 
  
  sub calculate_offsets {
  
     my ($offset,$id,$linelength,$type,$firstline,$count,$termination_length,%offsets);
  +  my ($l3_len,$l2_len,$l_len)=(0,0,0);
  
         $self->_check_linelength($linelength);
  +      ($l3_len,$l2_len,$l_len)=(0,0,0);
       } else {
  +      $l3_len= $l2_len; $l2_len= $l_len; $l_len= length($_); # need to check every line :(
  +      if(DIE_ON_MISSMATCHED_LINES &&
  +        $l3_len>0 && $l2_len>0 && $l3_len!=$l2_len) {
  +         my $fap= substr($_,0,20)."..";
  +         $self->throw("Each line of the fasta entry must be the same length except the last.
  +  Line above #$. '$fap' is $l2_len != $l3_len chars.");
  +         }
  
         $linelength ||= length($_);
  
-- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
-- gilbertd at indiana.edu--http://marmot.bio.indiana.edu/



More information about the Bioperl-l mailing list