Bioperl: Re: Bio::Tools::Blast

Georg Fuellen fuellen@dali.Mathematik.Uni-Bielefeld.DE
Thu, 27 Aug 1998 10:34:01 +0000 (GMT)


Eitan wrote,
> I would like to warn all Bio::PreSeq::parse_fasta() users. Some fasta 

Unless I'm mistaken, there is no reason for any warning, see below.

> databases (such as RepBase, if I'm not mistaken) are using non \S letters 
> in their naming scheme. Most fasta parsers fail when they see
> >gb|AC000254 blah blah blah
> >AC000254_1 blash blah blah

I get:
  DB<54> $head = ">gb|AC000254 blah blah blah"          

  DB<55> ($id, $desc) = $head =~ /^>[ \t]*(\S*)[ \t]*(.*)$/;

  DB<56> x ($id, $desc)                                     
0  'gb|AC000254'
1  'blah blah blah'

I think this works as it should ?!
Non \S letters (i.e. whitespace letters, since \S matches
a non-whitespace character) are [ \t\n\r\f]. (that's not the same as \W)
When I worked on the fasta pattern-matchings, the expression
used above seemed to be the most general I could come up with--
and it seems to me that the unusual case of ``identifyers with
space'' can only be dealt with by using a special flag that
says ``use the whole header as an id'' which means that we
assume the description is part of the id.
However, I'd suggest the user should take care of this case.

> In my case I overcome this with sed 's/^>gb|//' etc. or with perl 
> scripts. It may pose a serious problem though if you want the 
> Bio::PreSeq package to be universal.

Please respond if I'm overlooking the problem that you seem to see.

best wishes,
Georg Fuellen,
Univ. Bielefeld, Research Group in Practical Comp. Science
http://www.techfak.uni-bielefeld.de/bcd/welcome.html

> 
>        Eitan.  
> 
> 
> ======================================================================
> Eitan Rubin,
> Plant Genetics, Weizmann Inst of Science, Rehovot, Israel.  
> EMail: bcrubin@dapsas1.weizmann.ac.il
> Tel: (00972)-(8)9342421 Fax: (00972)-(8)9344181
> EitanR@BioMOO (http://bioinfo.weizmann.ac.il/BioMOO) - visit 
>                             the 
> GCG help desk
> 
> in vivo -> in vitro -> in silico
> ======================================================================
> 
> On Wed, 26 Aug 1998, Steve A. Chervitz wrote:
> 
> > 
> > Lincoln, 
> > 
> > Spaces are not permitted in identifiers in Blast.pm. In the Fasta
> > files I've seen, a space is used to separate the identifier from the 
> > description line. Here's how Bio::PreSeq::parse_fasta() grabs the 
> > identifier and description:
> > 
> > ($self->{"id"}, $self->{"desc"}) = $head =~ /^>[ \t]*(\S*)[ \t]*(.*)$/;
> > 
> > BTW, I just updated the Blast distribution (now 0.061). It includes   
> > an important memory management fix that helps when crunching lots of 
> > reports. 
> > 
> > Steve Chervitz
> > sac@genome.stanford.edu
> > 
> > 
> > On 26 Aug 1998, Lincoln Stein wrote:
> > 
> > > Hi Steve,
> > > 
> > > Does Blast.pm not deal correctly with sequence identifiers that
> > > contain spaces?  I just tried to blast a database made from
> > > identifiers like this:
> > > 
> > > >notch4 exon #1
> > > atgcagccccagttgctgctgctgctgctcttgccactcaatttccctgtcatcctgacc
> > > agag
> > > 
> > > >notch4 exon #2
> > > agcttctgtgtggaggatccccagagccctgtgccaacggaggcacctgcctgaggctat
> > > ctcggggacaagggatctgcca
> > > 
> > > >notch4 exon #3
> > > gtgtgcccctggatttctgggtgagacttgccagtttcctgacccctgcagggataccca
> > > actctgcaagaatggtggcagctgccaagccctgctccccacacccccaagctcccgtag
> > > tcctacttctccactgacccctcacttctcctgcacctgcccctctggcttcaccggtga
> > > tcgatgccaaacccatctggaagagctctgtccaccttctttctgttccaacgggggtca
> > > ctgctatgttcaggcctcaggccgcccacagtgctcctgcgagcctgggtggacag
> > > 
> > > but I only got "notch4" as the hit.  When I changed the spaces to
> > > dots, I got the full identifier.


=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================