[Bioperl-l] Bio::DB::Fasta check for ragged line widths

Hilmar Lapp hlapp at gmx.net
Sat Apr 7 16:42:13 UTC 2007


Wouldn't it be easier (and more robust) to just reformat the file to  
meet the constant line width requirement? The code required to do  
that should be fewer lines than your addition below, I think.

For example, one could do a fast first-pass through the file simply  
checking that all sequence lines not followed by a description line  
or eof have the same length, stopping at the first line that fails  
the test. If unequal lengths, use Bio::SeqIO to read and write back  
out the fasta file, then continue as for well-formatted files.

	-hilmar

On Apr 6, 2007, at 11:31 PM, Don Gilbert wrote:

>
> Dear Bioperlers,
>
> There is a hidden issue with Bio::DB::Fasta in that it assumes Fasta
> files have fixed line widths, but that isn't a requirement of Fasta
> format. The documentation notes this package requirement, but I was
> bitten by this, and I'd guess not many people check their data (esp.
> if from someone else) to see it meets this requirement.
>
> Simple tools can easily produce fasta with ragged line formatting:
> e.g. genome assemblers that paste together contig fasta with spacers
> to make assemblies.
>
> It would be nice if B:D:Fasta would check and die when it can't handle
> this ragged input.  Here is a suggested addition:
>
>   package Bio::DB::Fasta;
>
> =head1 DESCRIPTION
>
>   Entries may have any line length up to 65,536 characters, and
>   different line lengths are allowed in the same file.  However,  
> within
>   a sequence entry, all lines must be the same length except for the
>   last.
> + An error will be thrown if this is not the case.
>
> =cut
>
>   use constant DIE_ON_MISSMATCHED_LINES => 1; # if you want
>
>   sub calculate_offsets {
>
>      my ($offset,$id,$linelength,$type,$firstline,$count, 
> $termination_length,%offsets);
>   +  my ($l3_len,$l2_len,$l_len)=(0,0,0);
>
>          $self->_check_linelength($linelength);
>   +      ($l3_len,$l2_len,$l_len)=(0,0,0);
>        } else {
>   +      $l3_len= $l2_len; $l2_len= $l_len; $l_len= length($_); #  
> need to check every line :(
>   +      if(DIE_ON_MISSMATCHED_LINES &&
>   +        $l3_len>0 && $l2_len>0 && $l3_len!=$l2_len) {
>   +         my $fap= substr($_,0,20)."..";
>   +         $self->throw("Each line of the fasta entry must be the  
> same length except the last.
>   +  Line above #$. '$fap' is $l2_len != $l3_len chars.");
>   +         }
>
>          $linelength ||= length($_);
>
> -- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
> -- gilbertd at indiana.edu--http://marmot.bio.indiana.edu/
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================








More information about the Bioperl-l mailing list