[Bioperl-l] Found bug in fasta.pm

Thu, 09 Nov 2000 09:34:57 -0800

Jason Stajich wrote:
> 
> I've tried to implement this approach in our current SeqIO::fasta.  It
> appears to work fairly simply, I'll test it some more and check it in to
> the head if I feel that it works.  Elia, is there any way you could send
> me the offending fasta file so I can try out the improved code.  If anyone
> else is attacking this, please speak up so I don't step on toes.
> 
> On Thu, 9 Nov 2000, Aaron J Mackey wrote:
> 
> >
> > We have found the most reliable way to parse fasta format is via this
> > idiom:
> >
> > {
> >   local $/ = "\n>";
> >   while(<>) {
> >     chomp;                 # remove trailing "\n>"
> >     my ($id, $desc, $seq) =
> >       $_ =~ m/^>?        # beginning >, only 1st seq.
> >                 (\S+)\s+   # identifier
> >                 ([^\n]+)\n # description line
> >                 (.*)$      # sequence
> >                /sox;       # multiline, compile-once, ignore-whitespace

To be honest, I'm not so happy with having the above expression literally
in the code because IMHO it appears to be too strict: a description must be
present, as must be the id.

Fasta seqs frequently come without a description, and through
web-interfaces often even without an id. 

In general, the existing expressions for dissecting the seq record suited
everyone so far, and fasta.pm is probably one of the most heavily used
modules in BioPerl. We're going through tens of thousands of seqs in one
file with that module without any problem. So, I actually suspect that
there was something weird with the source files Elia used.

Anyway, the code should be modified to take advantage of setting $/ to
"\n>".

	Hilmar

-- 
-----------------------------------------------------------------
Hilmar Lapp                                email: hlapp@gmx.net
GNF, San Diego, Ca. 92122                  phone: +1 858 812 1757
-----------------------------------------------------------------