Bioperl: Re: Bio::Tools::Blast (fwd)

Steven E. Brenner brenner@hyper.stanford.edu
Tue, 1 Sep 1998 18:02:20 -0700



---------- Forwarded message ----------
Date: Mon, 31 Aug 1998 09:10:17 -0400 (EDT)
From: Tom Madden <madden@corin.nlm.nih.gov>
To: brenner@hyper.stanford.edu,
    lstein@cshl.org
Cc: francis@corin.nlm.nih.gov
Subject: Re: Bioperl: Re: Bio::Tools::Blast

Steve, Lincoln:

	Francis forwarded your correspondence about FASTA to some of
us at the NCBI.  The document describing the usage of "|" in the
identifier is:

ftp://ncbi.nlm.nih.gov/blast/db/README

This is the convention used by the BLAST databases at the NCBI and NCBI toolkit
routines can map the token to C-structures reliably.  I can't promise that
someone (even at the NCBI) isn't using "|" differently. I once had a
discussion with someone about the usage of "|" and whether any tokens 
should be allowed there.  I won't make that mistake again soon...

cheers,

Tom


> ----- Begin Included Message -----
> 
> >From owner-vsns-bcd-perl@lists.uni-bielefeld.de Fri Aug 28 09:48:45 1998
> Date: Fri, 28 Aug 1998 09:26:34 -0400
> From: Lincoln Stein <
>
> To: "Steven E. Brenner" <brenner@hyper.stanford.edu>
> Cc: vsns-bcd-perl@lists.uni-bielefeld.de
> Subject: Re: Bioperl: Re: Bio::Tools::Blast
> X-URL: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/
> X-Administrativa: see footer; if necessary, email to 
vsns-bcd-perl-owner@lists.uni-bielefeld.de
> 
> Hi Steve, et al.,
> 
> It's news to me that the FASTA format is defined in the FASTA package.
> I've searched the docs several times in the past, and I've searched it
> again just now and still can't find it.  Steve, can you point me to
> the correct file?  .c and .h files DON'T count!
> 
> There is some documentation on FASTA at GenBank's site, but it is
> totally underspecified.  It basically says what characters are valid
> in DNA and Peptide sequences.
> 
> In any case, what annoys me is that many people are using the
> description field to encapsulate meta-data, but nobody is doing it in
> the same way.  Even at the NCBI, I see different conventions.  For
> example, Greg Schuler has a simple tag=value notation, but FASTA files
> produced by other NCBI scientists use the | symbol to delineate
> positional parameters.  A real mess.
> 
> My original comments stand.  
> 
> Is there an ASN.1 parser in BioPerl?
> 
> Lincoln
> 
> Steven E. Brenner writes:
>  > 
>  > I'm not sure why people seem to think that the FASTA format isn't defined.
>  > Bill Pearson does define it in his FASTA package (albeit not as precisely
>  > as one might like). It consists of the following items in sequence:
>  > 
>  > 1) '>' character
>  > 2) identifier string without whitespace
>  > 3) whitespace other than CR/LF
>  > 4) description -- free text.  optional
>  > 5) CR and/or LF
>  > 6) sequence, possibly including CR/LF
>  > 
>  > 
>  > In practice, there are some additional conventions, which aren't documented
>  > 1) whitespace often is permitted (but not desired) between the '>' and the
>  >    identifier
>  > 2) the sequence is broken into 60 characters per line
>  > 3) one can terminate the sequence with a '*' character, but this is
>  >    not desirable
>  > 4) once can store mutiple alignments by putting dashes in the
>  >    sequences
>  > 
>  > 
>  > NCBI has added additional information to the FASTA format, by using
>  > structured identifiers, which indicate what source database an entry came
>  > from. However, this format is entirely backwards compatible with the
>  > standard FASTA format.  I have seen this structure documented somewhere on
>  > their web site.
>  > 
>  > I think that there are more than enough file formats out there, and I
>  > would be highly reluctant to introduce a new one, unless it served a very
>  > specific need.  For uses where little information is needed besides the
>  > sequence and identifier, I think that FASTA has the benefits of simplicity
>  > and convenience.  Moreover, it is ubiquitous: virtually every database I
>  > know of is available in compliant FASTA format, in part because it is so
>  > easy to accurately produce. (PIR is a notable exception).
>  > 
>  > Steve
>  > 
>  > 
>  > 
>  > 
>  > 
>  > On Thu, 27 Aug 1998, Lincoln Stein wrote:
>  > 
>  > > Why don't we come up with a better sequence file standard that
>  > > encapsulates these ideas of identifier, description, source database,
>  > > etc., rather than relying on ad hoc conventions in the FASTA format?
>  > > To deal with legacy FASTA files we could have a little reusable
>  > > conversion filter to do the up-front work.
>  > > 
>  > > Lincoln
>  > > 
>  > > Georg Fuellen writes:
>  > >  > 
>  > >  > Eitan wrote,
>  > >  > > I would like to warn all Bio::PreSeq::parse_fasta() users. Some 
fasta 
>  > >  > 
>  > >  > Unless I'm mistaken, there is no reason for any warning, see below.
>  > >  > 
>  > >  > > databases (such as RepBase, if I'm not mistaken) are using non \S 
letters 
>  > >  > > in their naming scheme. Most fasta parsers fail when they see
>  > >  > > >gb|AC000254 blah blah blah
>  > >  > > >AC000254_1 blash blah blah
>  > >  > 
>  > >  > I get:
>  > >  >   DB<54> $head = ">gb|AC000254 blah blah blah"          
>  > >  > 
>  > >  >   DB<55> ($id, $desc) = $head =~ /^>[ \t]*(\S*)[ \t]*(.*)$/;
>  > >  > 
>  > >  >   DB<56> x ($id, $desc)                                     
>  > >  > 0  'gb|AC000254'
>  > >  > 1  'blah blah blah'
>  > >  > 
>  > >  > I think this works as it should ?!
>  > >  > Non \S letters (i.e. whitespace letters, since \S matches
>  > >  > a non-whitespace character) are [ \t\n\r\f]. (that's not the same as 
\W)
>  > >  > When I worked on the fasta pattern-matchings, the expression
>  > >  > used above seemed to be the most general I could come up with--
>  > >  > and it seems to me that the unusual case of ``identifyers with
>  > >  > space'' can only be dealt with by using a special flag that
>  > >  > says ``use the whole header as an id'' which means that we
>  > >  > assume the description is part of the id.
>  > >  > However, I'd suggest the user should take care of this case.
>  > >  > 
>  > >  > > In my case I overcome this with sed 's/^>gb|//' etc. or with perl 
>  > >  > > scripts. It may pose a serious problem though if you want the 
>  > >  > > Bio::PreSeq package to be universal.
>  > >  > 
>  > >  > Please respond if I'm overlooking the problem that you seem to see.
>  > >  > 
>  > >  > best wishes,
>  > >  > Georg Fuellen,
>  > >  > Univ. Bielefeld, Research Group in Practical Comp. Science
>  > >  > http://www.techfak.uni-bielefeld.de/bcd/welcome.html
>  > >  > 
>  > >  > > 
>  > >  > >        Eitan.  
>  > >  > > 
>  > >  > > 
>  > >  > > 
======================================================================
>  > >  > > Eitan Rubin,
>  > >  > > Plant Genetics, Weizmann Inst of Science, Rehovot, Israel.  
>  > >  > > EMail: bcrubin@dapsas1.weizmann.ac.il
>  > >  > > Tel: (00972)-(8)9342421 Fax: (00972)-(8)9344181
>  > >  > > EitanR@BioMOO (http://bioinfo.weizmann.ac.il/BioMOO) - visit 
>  > >  > >                             the 
>  > >  > > GCG help desk
>  > >  > > 
>  > >  > > in vivo -> in vitro -> in silico
>  > >  > > 
======================================================================
>  > >  > > 
>  > >  > > On Wed, 26 Aug 1998, Steve A. Chervitz wrote:
>  > >  > > 
>  > >  > > > 
>  > >  > > > Lincoln, 
>  > >  > > > 
>  > >  > > > Spaces are not permitted in identifiers in Blast.pm. In the Fasta
>  > >  > > > files I've seen, a space is used to separate the identifier from 
the 
>  > >  > > > description line. Here's how Bio::PreSeq::parse_fasta() grabs the 
>  > >  > > > identifier and description:
>  > >  > > > 
>  > >  > > > ($self->{"id"}, $self->{"desc"}) = $head =~ /^>[ \t]*(\S*)[ 
\t]*(.*)$/;
>  > >  > > > 
>  > >  > > > BTW, I just updated the Blast distribution (now 0.061). It 
includes   
>  > >  > > > an important memory management fix that helps when crunching lots 
of 
>  > >  > > > reports. 
>  > >  > > > 
>  > >  > > > Steve Chervitz
>  > >  > > > sac@genome.stanford.edu
>  > >  > > > 
>  > >  > > > 
>  > >  > > > On 26 Aug 1998, Lincoln Stein wrote:
>  > >  > > > 
>  > >  > > > > Hi Steve,
>  > >  > > > > 
>  > >  > > > > Does Blast.pm not deal correctly with sequence identifiers that
>  > >  > > > > contain spaces?  I just tried to blast a database made from
>  > >  > > > > identifiers like this:
>  > >  > > > > 
>  > >  > > > > >notch4 exon #1
>  > >  > > > > atgcagccccagttgctgctgctgctgctcttgccactcaatttccctgtcatcctgacc
>  > >  > > > > agag
>  > >  > > > > 
>  > >  > > > > >notch4 exon #2
>  > >  > > > > agcttctgtgtggaggatccccagagccctgtgccaacggaggcacctgcctgaggctat
>  > >  > > > > ctcggggacaagggatctgcca
>  > >  > > > > 
>  > >  > > > > >notch4 exon #3
>  > >  > > > > gtgtgcccctggatttctgggtgagacttgccagtttcctgacccctgcagggataccca
>  > >  > > > > actctgcaagaatggtggcagctgccaagccctgctccccacacccccaagctcccgtag
>  > >  > > > > tcctacttctccactgacccctcacttctcctgcacctgcccctctggcttcaccggtga
>  > >  > > > > tcgatgccaaacccatctggaagagctctgtccaccttctttctgttccaacgggggtca
>  > >  > > > > ctgctatgttcaggcctcaggccgcccacagtgctcctgcgagcctgggtggacag
>  > >  > > > > 
>  > >  > > > > but I only got "notch4" as the hit.  When I changed the spaces 
to
>  > >  > > > > dots, I got the full identifier.
>  > >  > 
>  > >  > 
>  > > -- 
>  > > ========================================================================
>  > > Lincoln D. Stein                           Cold Spring Harbor Laboratory
>  > > lstein@cshl.org			                  Cold Spring Harbor, NY
>  > > ========================================================================
>  > > =========== Bioperl Project Mailing List Message Footer =======
>  > > Project URL: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/
>  > > For info about how to (un)subscribe, where messages are archived, etc:
>  > > http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
>  > > ====================================================================
>  > > 
>  > 
> -- 
> ========================================================================
> Lincoln D. Stein                           Cold Spring Harbor Laboratory
> lstein@cshl.org			                  Cold Spring Harbor, NY
> ========================================================================
> =========== Bioperl Project Mailing List Message Footer =======
> Project URL: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/
> For info about how to (un)subscribe, where messages are archived, etc:
> http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
> ====================================================================
> 
> 
> ----- End Included Message -----
> 

=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================