Bioperl: Re: Bio::Tools::Blast
Lincoln Stein
lstein@cshl.org
Fri, 28 Aug 1998 09:26:34 -0400
Hi Steve, et al.,
It's news to me that the FASTA format is defined in the FASTA package.
I've searched the docs several times in the past, and I've searched it
again just now and still can't find it. Steve, can you point me to
the correct file? .c and .h files DON'T count!
There is some documentation on FASTA at GenBank's site, but it is
totally underspecified. It basically says what characters are valid
in DNA and Peptide sequences.
In any case, what annoys me is that many people are using the
description field to encapsulate meta-data, but nobody is doing it in
the same way. Even at the NCBI, I see different conventions. For
example, Greg Schuler has a simple tag=value notation, but FASTA files
produced by other NCBI scientists use the | symbol to delineate
positional parameters. A real mess.
My original comments stand.
Is there an ASN.1 parser in BioPerl?
Lincoln
Steven E. Brenner writes:
>
> I'm not sure why people seem to think that the FASTA format isn't defined.
> Bill Pearson does define it in his FASTA package (albeit not as precisely
> as one might like). It consists of the following items in sequence:
>
> 1) '>' character
> 2) identifier string without whitespace
> 3) whitespace other than CR/LF
> 4) description -- free text. optional
> 5) CR and/or LF
> 6) sequence, possibly including CR/LF
>
>
> In practice, there are some additional conventions, which aren't documented
> 1) whitespace often is permitted (but not desired) between the '>' and the
> identifier
> 2) the sequence is broken into 60 characters per line
> 3) one can terminate the sequence with a '*' character, but this is
> not desirable
> 4) once can store mutiple alignments by putting dashes in the
> sequences
>
>
> NCBI has added additional information to the FASTA format, by using
> structured identifiers, which indicate what source database an entry came
> from. However, this format is entirely backwards compatible with the
> standard FASTA format. I have seen this structure documented somewhere on
> their web site.
>
> I think that there are more than enough file formats out there, and I
> would be highly reluctant to introduce a new one, unless it served a very
> specific need. For uses where little information is needed besides the
> sequence and identifier, I think that FASTA has the benefits of simplicity
> and convenience. Moreover, it is ubiquitous: virtually every database I
> know of is available in compliant FASTA format, in part because it is so
> easy to accurately produce. (PIR is a notable exception).
>
> Steve
>
>
>
>
>
> On Thu, 27 Aug 1998, Lincoln Stein wrote:
>
> > Why don't we come up with a better sequence file standard that
> > encapsulates these ideas of identifier, description, source database,
> > etc., rather than relying on ad hoc conventions in the FASTA format?
> > To deal with legacy FASTA files we could have a little reusable
> > conversion filter to do the up-front work.
> >
> > Lincoln
> >
> > Georg Fuellen writes:
> > >
> > > Eitan wrote,
> > > > I would like to warn all Bio::PreSeq::parse_fasta() users. Some fasta
> > >
> > > Unless I'm mistaken, there is no reason for any warning, see below.
> > >
> > > > databases (such as RepBase, if I'm not mistaken) are using non \S letters
> > > > in their naming scheme. Most fasta parsers fail when they see
> > > > >gb|AC000254 blah blah blah
> > > > >AC000254_1 blash blah blah
> > >
> > > I get:
> > > DB<54> $head = ">gb|AC000254 blah blah blah"
> > >
> > > DB<55> ($id, $desc) = $head =~ /^>[ \t]*(\S*)[ \t]*(.*)$/;
> > >
> > > DB<56> x ($id, $desc)
> > > 0 'gb|AC000254'
> > > 1 'blah blah blah'
> > >
> > > I think this works as it should ?!
> > > Non \S letters (i.e. whitespace letters, since \S matches
> > > a non-whitespace character) are [ \t\n\r\f]. (that's not the same as \W)
> > > When I worked on the fasta pattern-matchings, the expression
> > > used above seemed to be the most general I could come up with--
> > > and it seems to me that the unusual case of ``identifyers with
> > > space'' can only be dealt with by using a special flag that
> > > says ``use the whole header as an id'' which means that we
> > > assume the description is part of the id.
> > > However, I'd suggest the user should take care of this case.
> > >
> > > > In my case I overcome this with sed 's/^>gb|//' etc. or with perl
> > > > scripts. It may pose a serious problem though if you want the
> > > > Bio::PreSeq package to be universal.
> > >
> > > Please respond if I'm overlooking the problem that you seem to see.
> > >
> > > best wishes,
> > > Georg Fuellen,
> > > Univ. Bielefeld, Research Group in Practical Comp. Science
> > > http://www.techfak.uni-bielefeld.de/bcd/welcome.html
> > >
> > > >
> > > > Eitan.
> > > >
> > > >
> > > > ======================================================================
> > > > Eitan Rubin,
> > > > Plant Genetics, Weizmann Inst of Science, Rehovot, Israel.
> > > > EMail: bcrubin@dapsas1.weizmann.ac.il
> > > > Tel: (00972)-(8)9342421 Fax: (00972)-(8)9344181
> > > > EitanR@BioMOO (http://bioinfo.weizmann.ac.il/BioMOO) - visit
> > > > the
> > > > GCG help desk
> > > >
> > > > in vivo -> in vitro -> in silico
> > > > ======================================================================
> > > >
> > > > On Wed, 26 Aug 1998, Steve A. Chervitz wrote:
> > > >
> > > > >
> > > > > Lincoln,
> > > > >
> > > > > Spaces are not permitted in identifiers in Blast.pm. In the Fasta
> > > > > files I've seen, a space is used to separate the identifier from the
> > > > > description line. Here's how Bio::PreSeq::parse_fasta() grabs the
> > > > > identifier and description:
> > > > >
> > > > > ($self->{"id"}, $self->{"desc"}) = $head =~ /^>[ \t]*(\S*)[ \t]*(.*)$/;
> > > > >
> > > > > BTW, I just updated the Blast distribution (now 0.061). It includes
> > > > > an important memory management fix that helps when crunching lots of
> > > > > reports.
> > > > >
> > > > > Steve Chervitz
> > > > > sac@genome.stanford.edu
> > > > >
> > > > >
> > > > > On 26 Aug 1998, Lincoln Stein wrote:
> > > > >
> > > > > > Hi Steve,
> > > > > >
> > > > > > Does Blast.pm not deal correctly with sequence identifiers that
> > > > > > contain spaces? I just tried to blast a database made from
> > > > > > identifiers like this:
> > > > > >
> > > > > > >notch4 exon #1
> > > > > > atgcagccccagttgctgctgctgctgctcttgccactcaatttccctgtcatcctgacc
> > > > > > agag
> > > > > >
> > > > > > >notch4 exon #2
> > > > > > agcttctgtgtggaggatccccagagccctgtgccaacggaggcacctgcctgaggctat
> > > > > > ctcggggacaagggatctgcca
> > > > > >
> > > > > > >notch4 exon #3
> > > > > > gtgtgcccctggatttctgggtgagacttgccagtttcctgacccctgcagggataccca
> > > > > > actctgcaagaatggtggcagctgccaagccctgctccccacacccccaagctcccgtag
> > > > > > tcctacttctccactgacccctcacttctcctgcacctgcccctctggcttcaccggtga
> > > > > > tcgatgccaaacccatctggaagagctctgtccaccttctttctgttccaacgggggtca
> > > > > > ctgctatgttcaggcctcaggccgcccacagtgctcctgcgagcctgggtggacag
> > > > > >
> > > > > > but I only got "notch4" as the hit. When I changed the spaces to
> > > > > > dots, I got the full identifier.
> > >
> > >
> > --
> > ========================================================================
> > Lincoln D. Stein Cold Spring Harbor Laboratory
> > lstein@cshl.org Cold Spring Harbor, NY
> > ========================================================================
> > =========== Bioperl Project Mailing List Message Footer =======
> > Project URL: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/
> > For info about how to (un)subscribe, where messages are archived, etc:
> > http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
> > ====================================================================
> >
>
--
========================================================================
Lincoln D. Stein Cold Spring Harbor Laboratory
lstein@cshl.org Cold Spring Harbor, NY
========================================================================
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================