[Bioperl-l] Found bug in fasta.pm

Aaron J Mackey ajm6q@virginia.edu
Thu, 9 Nov 2000 09:02:39 -0500 (EST)


We have found the most reliable way to parse fasta format is via this
idiom:

{
  local $/ = "\n>";
  while(<>) {
    chomp;                 # remove trailing "\n>"
    my ($id, $desc, $seq) = 
	$_ =~ m/^>?        # beginning >, only 1st seq.
                (\S+)\s+   # identifier
                ([^\n]+)\n # description line
                (.*)$      # sequence
               /sox;       # multiline, compile-once, ignore-whitespace
    $seq =~ s/\W//g;       # get rid of newlines, spaces, etc.

    # ok, now do something with this ...

  }
}

Note that no line-caching is necessary - the input-record-seperator $/
takes care of stopping before a new sequence is read.  This works nicely
when your filehandle is not seekable (pipes, etc).

I'm not sure why SeqIO/fasta.pm was migrated to a read-line-by-line
architecture (thus necessitating the nitty-gritty mentioned below), but
I'd be happy to review any proposed solution (we've faced most of the
pathological cases before - '>'s in descriptions, in the sequence, on
lines by themselves (prefaced with whitespace, so not truly a new record),
etc. etc. - it's amazing what people will try to get away with).

-Aaron

On Wed, 8 Nov 2000, Elia Stupka wrote:

> I found a bug, which was introduced in main trunk, and then backported to
> branch-06. I have commented out these lines, and I leave it up to others
> if they want to do something about this.
> 
> Elia Stupka (EnsEMBL)
> 
> these are the lines:
>  # Jason applying HL's patch from 25/05/2000 
>       # (on 02/10/2000)
>       # a greater sign not preceded by a newline indicates that there is a
>       # '>' within the description, so we need more to complete the record
>       return unless defined($next_rec = $self->_readline());
>       $entry .= $next_rec;  
> 
> What shall we do?
> 
> Elia
> 
> **************************
> tel:    +44 1223 49 44 31
> mobile: +44 7971 59 03 69
> fax:    +44 1223 49 44 68
> **************************
> 
> 
> 
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
> 

-- 
 o ~   ~   ~   ~   ~   ~  o
/ Aaron J Mackey           \
\  Dr. Pearson Laboratory  / 
 \ University of Virginia  \     
 /  (804) 924-2821          \
 \  amackey@virginia.edu    /
  o ~   ~   ~   ~   ~   ~  o