[Bioperl-l] Found bug in fasta.pm

Thu, 9 Nov 2000 11:41:07 -0500 (EST)

I've tried to implement this approach in our current SeqIO::fasta.  It
appears to work fairly simply, I'll test it some more and check it in to 
the head if I feel that it works.  Elia, is there any way you could send
me the offending fasta file so I can try out the improved code.  If anyone
else is attacking this, please speak up so I don't step on toes.

On Thu, 9 Nov 2000, Aaron J Mackey wrote:

> 
> We have found the most reliable way to parse fasta format is via this
> idiom:
> 
> {
>   local $/ = "\n>";
>   while(<>) {
>     chomp;                 # remove trailing "\n>"
>     my ($id, $desc, $seq) = 
> 	$_ =~ m/^>?        # beginning >, only 1st seq.
>                 (\S+)\s+   # identifier
>                 ([^\n]+)\n # description line
>                 (.*)$      # sequence
>                /sox;       # multiline, compile-once, ignore-whitespace
>     $seq =~ s/\W//g;       # get rid of newlines, spaces, etc.
> 
>     # ok, now do something with this ...
> 
>   }
> }
> 
> Note that no line-caching is necessary - the input-record-seperator $/
> takes care of stopping before a new sequence is read.  This works nicely
> when your filehandle is not seekable (pipes, etc).
> 
> I'm not sure why SeqIO/fasta.pm was migrated to a read-line-by-line
> architecture (thus necessitating the nitty-gritty mentioned below), but
> I'd be happy to review any proposed solution (we've faced most of the
> pathological cases before - '>'s in descriptions, in the sequence, on
> lines by themselves (prefaced with whitespace, so not truly a new record),
> etc. etc. - it's amazing what people will try to get away with).
> 
> -Aaron
> 
> On Wed, 8 Nov 2000, Elia Stupka wrote:
> 
> > I found a bug, which was introduced in main trunk, and then backported to
> > branch-06. I have commented out these lines, and I leave it up to others
> > if they want to do something about this.
> > 
> > Elia Stupka (EnsEMBL)
> > 
> > these are the lines:
> >  # Jason applying HL's patch from 25/05/2000 
> >       # (on 02/10/2000)
> >       # a greater sign not preceded by a newline indicates that there is a
> >       # '>' within the description, so we need more to complete the record
> >       return unless defined($next_rec = $self->_readline());
> >       $entry .= $next_rec;  
> > 
> > What shall we do?
> > 
> > Elia
> > 
> > **************************
> > tel:    +44 1223 49 44 31
> > mobile: +44 7971 59 03 69
> > fax:    +44 1223 49 44 68
> > **************************
> > 
> > 
> > 
> > 
> > 
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
> > 
> 
> -- 
>  o ~   ~   ~   ~   ~   ~  o
> / Aaron J Mackey           \
> \  Dr. Pearson Laboratory  / 
>  \ University of Virginia  \     
>  /  (804) 924-2821          \
>  \  amackey@virginia.edu    /
>   o ~   ~   ~   ~   ~   ~  o
> 
> 
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
> 

Jason Stajich
jason@chg.mc.duke.edu
http://galton.mc.duke.edu/~jason/
(919)684-1806 (office) 
(919)684-2275 (fax) 
Center for Human Genetics - Duke University Medical Center
http://wwwchg.mc.duke.edu/