Bioperl: Re: Bio::Tools::Blast (fwd)
Steven E. Brenner
brenner@hyper.stanford.edu
Tue, 1 Sep 1998 18:02:20 -0700
---------- Forwarded message ----------
Date: Mon, 31 Aug 1998 09:10:17 -0400 (EDT)
From: Tom Madden <madden@corin.nlm.nih.gov>
To: brenner@hyper.stanford.edu,
lstein@cshl.org
Cc: francis@corin.nlm.nih.gov
Subject: Re: Bioperl: Re: Bio::Tools::Blast
Steve, Lincoln:
Francis forwarded your correspondence about FASTA to some of
us at the NCBI. The document describing the usage of "|" in the
identifier is:
ftp://ncbi.nlm.nih.gov/blast/db/README
This is the convention used by the BLAST databases at the NCBI and NCBI toolkit
routines can map the token to C-structures reliably. I can't promise that
someone (even at the NCBI) isn't using "|" differently. I once had a
discussion with someone about the usage of "|" and whether any tokens
should be allowed there. I won't make that mistake again soon...
cheers,
Tom
> ----- Begin Included Message -----
>
> >From owner-vsns-bcd-perl@lists.uni-bielefeld.de Fri Aug 28 09:48:45 1998
> Date: Fri, 28 Aug 1998 09:26:34 -0400
> From: Lincoln Stein <
>
> To: "Steven E. Brenner" <brenner@hyper.stanford.edu>
> Cc: vsns-bcd-perl@lists.uni-bielefeld.de
> Subject: Re: Bioperl: Re: Bio::Tools::Blast
> X-URL: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/
> X-Administrativa: see footer; if necessary, email to
vsns-bcd-perl-owner@lists.uni-bielefeld.de
>
> Hi Steve, et al.,
>
> It's news to me that the FASTA format is defined in the FASTA package.
> I've searched the docs several times in the past, and I've searched it
> again just now and still can't find it. Steve, can you point me to
> the correct file? .c and .h files DON'T count!
>
> There is some documentation on FASTA at GenBank's site, but it is
> totally underspecified. It basically says what characters are valid
> in DNA and Peptide sequences.
>
> In any case, what annoys me is that many people are using the
> description field to encapsulate meta-data, but nobody is doing it in
> the same way. Even at the NCBI, I see different conventions. For
> example, Greg Schuler has a simple tag=value notation, but FASTA files
> produced by other NCBI scientists use the | symbol to delineate
> positional parameters. A real mess.
>
> My original comments stand.
>
> Is there an ASN.1 parser in BioPerl?
>
> Lincoln
>
> Steven E. Brenner writes:
> >
> > I'm not sure why people seem to think that the FASTA format isn't defined.
> > Bill Pearson does define it in his FASTA package (albeit not as precisely
> > as one might like). It consists of the following items in sequence:
> >
> > 1) '>' character
> > 2) identifier string without whitespace
> > 3) whitespace other than CR/LF
> > 4) description -- free text. optional
> > 5) CR and/or LF
> > 6) sequence, possibly including CR/LF
> >
> >
> > In practice, there are some additional conventions, which aren't documented
> > 1) whitespace often is permitted (but not desired) between the '>' and the
> > identifier
> > 2) the sequence is broken into 60 characters per line
> > 3) one can terminate the sequence with a '*' character, but this is
> > not desirable
> > 4) once can store mutiple alignments by putting dashes in the
> > sequences
> >
> >
> > NCBI has added additional information to the FASTA format, by using
> > structured identifiers, which indicate what source database an entry came
> > from. However, this format is entirely backwards compatible with the
> > standard FASTA format. I have seen this structure documented somewhere on
> > their web site.
> >
> > I think that there are more than enough file formats out there, and I
> > would be highly reluctant to introduce a new one, unless it served a very
> > specific need. For uses where little information is needed besides the
> > sequence and identifier, I think that FASTA has the benefits of simplicity
> > and convenience. Moreover, it is ubiquitous: virtually every database I
> > know of is available in compliant FASTA format, in part because it is so
> > easy to accurately produce. (PIR is a notable exception).
> >
> > Steve
> >
> >
> >
> >
> >
> > On Thu, 27 Aug 1998, Lincoln Stein wrote:
> >
> > > Why don't we come up with a better sequence file standard that
> > > encapsulates these ideas of identifier, description, source database,
> > > etc., rather than relying on ad hoc conventions in the FASTA format?
> > > To deal with legacy FASTA files we could have a little reusable
> > > conversion filter to do the up-front work.
> > >
> > > Lincoln
> > >
> > > Georg Fuellen writes:
> > > >
> > > > Eitan wrote,
> > > > > I would like to warn all Bio::PreSeq::parse_fasta() users. Some
fasta
> > > >
> > > > Unless I'm mistaken, there is no reason for any warning, see below.
> > > >
> > > > > databases (such as RepBase, if I'm not mistaken) are using non \S
letters
> > > > > in their naming scheme. Most fasta parsers fail when they see
> > > > > >gb|AC000254 blah blah blah
> > > > > >AC000254_1 blash blah blah
> > > >
> > > > I get:
> > > > DB<54> $head = ">gb|AC000254 blah blah blah"
> > > >
> > > > DB<55> ($id, $desc) = $head =~ /^>[ \t]*(\S*)[ \t]*(.*)$/;
> > > >
> > > > DB<56> x ($id, $desc)
> > > > 0 'gb|AC000254'
> > > > 1 'blah blah blah'
> > > >
> > > > I think this works as it should ?!
> > > > Non \S letters (i.e. whitespace letters, since \S matches
> > > > a non-whitespace character) are [ \t\n\r\f]. (that's not the same as
\W)
> > > > When I worked on the fasta pattern-matchings, the expression
> > > > used above seemed to be the most general I could come up with--
> > > > and it seems to me that the unusual case of ``identifyers with
> > > > space'' can only be dealt with by using a special flag that
> > > > says ``use the whole header as an id'' which means that we
> > > > assume the description is part of the id.
> > > > However, I'd suggest the user should take care of this case.
> > > >
> > > > > In my case I overcome this with sed 's/^>gb|//' etc. or with perl
> > > > > scripts. It may pose a serious problem though if you want the
> > > > > Bio::PreSeq package to be universal.
> > > >
> > > > Please respond if I'm overlooking the problem that you seem to see.
> > > >
> > > > best wishes,
> > > > Georg Fuellen,
> > > > Univ. Bielefeld, Research Group in Practical Comp. Science
> > > > http://www.techfak.uni-bielefeld.de/bcd/welcome.html
> > > >
> > > > >
> > > > > Eitan.
> > > > >
> > > > >
> > > > >
======================================================================
> > > > > Eitan Rubin,
> > > > > Plant Genetics, Weizmann Inst of Science, Rehovot, Israel.
> > > > > EMail: bcrubin@dapsas1.weizmann.ac.il
> > > > > Tel: (00972)-(8)9342421 Fax: (00972)-(8)9344181
> > > > > EitanR@BioMOO (http://bioinfo.weizmann.ac.il/BioMOO) - visit
> > > > > the
> > > > > GCG help desk
> > > > >
> > > > > in vivo -> in vitro -> in silico
> > > > >
======================================================================
> > > > >
> > > > > On Wed, 26 Aug 1998, Steve A. Chervitz wrote:
> > > > >
> > > > > >
> > > > > > Lincoln,
> > > > > >
> > > > > > Spaces are not permitted in identifiers in Blast.pm. In the Fasta
> > > > > > files I've seen, a space is used to separate the identifier from
the
> > > > > > description line. Here's how Bio::PreSeq::parse_fasta() grabs the
> > > > > > identifier and description:
> > > > > >
> > > > > > ($self->{"id"}, $self->{"desc"}) = $head =~ /^>[ \t]*(\S*)[
\t]*(.*)$/;
> > > > > >
> > > > > > BTW, I just updated the Blast distribution (now 0.061). It
includes
> > > > > > an important memory management fix that helps when crunching lots
of
> > > > > > reports.
> > > > > >
> > > > > > Steve Chervitz
> > > > > > sac@genome.stanford.edu
> > > > > >
> > > > > >
> > > > > > On 26 Aug 1998, Lincoln Stein wrote:
> > > > > >
> > > > > > > Hi Steve,
> > > > > > >
> > > > > > > Does Blast.pm not deal correctly with sequence identifiers that
> > > > > > > contain spaces? I just tried to blast a database made from
> > > > > > > identifiers like this:
> > > > > > >
> > > > > > > >notch4 exon #1
> > > > > > > atgcagccccagttgctgctgctgctgctcttgccactcaatttccctgtcatcctgacc
> > > > > > > agag
> > > > > > >
> > > > > > > >notch4 exon #2
> > > > > > > agcttctgtgtggaggatccccagagccctgtgccaacggaggcacctgcctgaggctat
> > > > > > > ctcggggacaagggatctgcca
> > > > > > >
> > > > > > > >notch4 exon #3
> > > > > > > gtgtgcccctggatttctgggtgagacttgccagtttcctgacccctgcagggataccca
> > > > > > > actctgcaagaatggtggcagctgccaagccctgctccccacacccccaagctcccgtag
> > > > > > > tcctacttctccactgacccctcacttctcctgcacctgcccctctggcttcaccggtga
> > > > > > > tcgatgccaaacccatctggaagagctctgtccaccttctttctgttccaacgggggtca
> > > > > > > ctgctatgttcaggcctcaggccgcccacagtgctcctgcgagcctgggtggacag
> > > > > > >
> > > > > > > but I only got "notch4" as the hit. When I changed the spaces
to
> > > > > > > dots, I got the full identifier.
> > > >
> > > >
> > > --
> > > ========================================================================
> > > Lincoln D. Stein Cold Spring Harbor Laboratory
> > > lstein@cshl.org Cold Spring Harbor, NY
> > > ========================================================================
> > > =========== Bioperl Project Mailing List Message Footer =======
> > > Project URL: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/
> > > For info about how to (un)subscribe, where messages are archived, etc:
> > > http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
> > > ====================================================================
> > >
> >
> --
> ========================================================================
> Lincoln D. Stein Cold Spring Harbor Laboratory
> lstein@cshl.org Cold Spring Harbor, NY
> ========================================================================
> =========== Bioperl Project Mailing List Message Footer =======
> Project URL: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/
> For info about how to (un)subscribe, where messages are archived, etc:
> http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
> ====================================================================
>
>
> ----- End Included Message -----
>
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================