[Bioperl-l] Found bug in fasta.pm
Jason Stajich
jason@chg.mc.duke.edu
Thu, 9 Nov 2000 11:41:07 -0500 (EST)
I've tried to implement this approach in our current SeqIO::fasta. It
appears to work fairly simply, I'll test it some more and check it in to
the head if I feel that it works. Elia, is there any way you could send
me the offending fasta file so I can try out the improved code. If anyone
else is attacking this, please speak up so I don't step on toes.
On Thu, 9 Nov 2000, Aaron J Mackey wrote:
>
> We have found the most reliable way to parse fasta format is via this
> idiom:
>
> {
> local $/ = "\n>";
> while(<>) {
> chomp; # remove trailing "\n>"
> my ($id, $desc, $seq) =
> $_ =~ m/^>? # beginning >, only 1st seq.
> (\S+)\s+ # identifier
> ([^\n]+)\n # description line
> (.*)$ # sequence
> /sox; # multiline, compile-once, ignore-whitespace
> $seq =~ s/\W//g; # get rid of newlines, spaces, etc.
>
> # ok, now do something with this ...
>
> }
> }
>
> Note that no line-caching is necessary - the input-record-seperator $/
> takes care of stopping before a new sequence is read. This works nicely
> when your filehandle is not seekable (pipes, etc).
>
> I'm not sure why SeqIO/fasta.pm was migrated to a read-line-by-line
> architecture (thus necessitating the nitty-gritty mentioned below), but
> I'd be happy to review any proposed solution (we've faced most of the
> pathological cases before - '>'s in descriptions, in the sequence, on
> lines by themselves (prefaced with whitespace, so not truly a new record),
> etc. etc. - it's amazing what people will try to get away with).
>
> -Aaron
>
> On Wed, 8 Nov 2000, Elia Stupka wrote:
>
> > I found a bug, which was introduced in main trunk, and then backported to
> > branch-06. I have commented out these lines, and I leave it up to others
> > if they want to do something about this.
> >
> > Elia Stupka (EnsEMBL)
> >
> > these are the lines:
> > # Jason applying HL's patch from 25/05/2000
> > # (on 02/10/2000)
> > # a greater sign not preceded by a newline indicates that there is a
> > # '>' within the description, so we need more to complete the record
> > return unless defined($next_rec = $self->_readline());
> > $entry .= $next_rec;
> >
> > What shall we do?
> >
> > Elia
> >
> > **************************
> > tel: +44 1223 49 44 31
> > mobile: +44 7971 59 03 69
> > fax: +44 1223 49 44 68
> > **************************
> >
> >
> >
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
> >
>
> --
> o ~ ~ ~ ~ ~ ~ o
> / Aaron J Mackey \
> \ Dr. Pearson Laboratory /
> \ University of Virginia \
> / (804) 924-2821 \
> \ amackey@virginia.edu /
> o ~ ~ ~ ~ ~ ~ o
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>
Jason Stajich
jason@chg.mc.duke.edu
http://galton.mc.duke.edu/~jason/
(919)684-1806 (office)
(919)684-2275 (fax)
Center for Human Genetics - Duke University Medical Center
http://wwwchg.mc.duke.edu/