[Bioperl-l] Bio::SeqIO::fasta.pm
Peter Kos
kos@rite.or.jp" <kos@rite.or.jp
Sun, 25 Aug 2002 17:21:29 +0900
Hi,
> Any votes from the core or from people on the list? I don't have a
> strong view here. I recall having seen somewhere that indeed the
> presence of the name is optional, but I'm not sure.
>
> One could argue for tolerant reader, strict writer ...
Hi,
There seems to be two or three different thing under the name "fasta
format"
- In an abstract way, I think the only thing which is mandatory in a
fasta file is the ">" as the first character in the first line. And
the omission of the characters outside of the given set of alphabet
in the sequence except for the white spaces.
Therefore the description is not mandatory and even the length of the
sequence may be 0
- another question is what makes sense and what does not. It seems to
be solely the user's business what he/she wants and whether it is
necessary to identify the sequence(s).
By the way, there is no such thing as difference between ID and
description in a fasta file as such. There may be a gift from EMBL or
SwissProt in the beginning the defline, but there are other sources
of sequence files and nobody should ask for any specific structure of
that line. There other file formats for exact structure of the ID and
other annotations, and it is nice if one tries to keep as much info
as possible when converting to fasta files, but that line should be
considered basically free of format.
- the third question is how a fasta file is defined and/or
interpreted in bioperl. If there is a decision being made that a
non-whitespace string is mandatory in the first line, then it should
be completely OK as long as the users are aware of this, so that at
least a dummy ID should be generated in case there is nothing else to
use. Either in the program, which produces the file; or in a filter,
which reads the file and fills the empty ">" with something.
By the way, would it be a problem doing this within blast.pm?
Like after the regex parsing the defline,
$id = "UNDEFINED" if ($id eq ""); # or a space or a tab in place of
UNDEFINED
or something. So that downstream programs can not complain and the
module does not crash either.
Of course writing back the file read this way would create a file
different from the original, so it may be a completely stupid idea.
On the other hand, the necessity to write a (non-bioperl) script for
repairing the fasta file with missing descriptions (which is not a
big challenge) would diminish the beauty of the
$seq = $seqio -> next_seq();
Cheers
Peter
>
> -hilmar
>
> > -----Original Message-----
> > From: Wiepert, Mathieu [mailto:Wiepert.Mathieu@mayo.edu]
> > Sent: Friday, August 23, 2002 9:27 AM
> > To: 'bioperl-l@bioperl.org'
> > Subject: FW: [Bioperl-l] Bio::SeqIO::fasta.pm
> >
> >
> > I was just testing something out, and it was accepting a
> > blank in the header. If I actually wanted raw sequence, you
> > are correct of course.
> >
> > I found an old thread that made me think a > with a space or
> > nothing after it was valid, which was why I asked. I hadn't
> > found anything recent, though I could have missed it. The
> > thread never said what was the agreed upon standard?
> >
> > http://bioperl.org/pipermail/bioperl-guts-l/1999-
> > November/001311.html
> >
> > Mathieu Wiepert
> > Medical Information Resources
> > Mayo Foundation
> > (507) 266-2317 Fax (507)-284-0360
> > wiepert.mathieu@mayo.edu
> >
> >
> > > -----Original Message-----
> > > From: Hilmar Lapp [mailto:hlapp@gnf.org]
> > > Sent: Friday, August 23, 2002 11:23 AM
> > > To: Wiepert, Mathieu
> > > Cc: Bioperl
> > > Subject: RE: [Bioperl-l] Bio::SeqIO::fasta.pm
> > >
> > >
> > > Didn't someone post a FASTA definition document or link a
> > > while ago? AFAIK the ID is mandatory, you can't just have an
> > > empty line.
> > >
> > > Have you tried reading your seqs as format raw? Not sure
> > > whether that one strips all non-seq characters, and whether
> > > it can handle multiple seqs, but generally speaking 'raw'
> > > format is what you would do if you really don't care about
> > > anything except the sequence itself.
> > >
> > > -hilmar
> > >
> > > > -----Original Message-----
> > > > From: Wiepert, Mathieu [mailto:Wiepert.Mathieu@mayo.edu]
> > > > Sent: Friday, August 23, 2002 7:50 AM
> > > > Cc: Bioperl
> > > > Subject: [Bioperl-l] Bio::SeqIO::fasta.pm
> > > >
> > > >
> > > > Hi,
> > > >
> > > > Is it valid to have a fasta file with no header info (other
> > > > than the >, with no spaces after). I.e. something like
> > > > >
> > > > ACACACACA
> > > >
> > > >
> > > > Would lead to blank primary_id, not sure what effect that
> > > > would have down the line.
> > > >
> > > > I ask because fasta.pm throws the "Can't parse fasta header"
> > > > error when there *is* a space after the >, but goes merrily
> > > > on if there is a \n. And then dies later. Just wasn't sure
> > > > what the expected agreed upon behavior is.
> > > >
> > > > I ran into this because I was testing a few sequence blasts,
> > > > and don't need a header really. I figure it's not much use
> > > > to have a fasta file with a bunch of "empty" header lines,
> > > > but fasta only has a comment
> > > >
> > > > # FIX incase no space between > and name \AE
> > > >
> > > > So I wasn't sure what the intent is.
> > > >
> > > > -Mat
> > > > _______________________________________________
> > > > Bioperl-l mailing list
> > > > Bioperl-l@bioperl.org
> > > > http://bioperl.org/mailman/listinfo/bioperl-l
> > > >
> > >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
> >
..................................................................
..........
Peter B. Kos
(RITE)
E-mail: kos@rite.or.jp