Bioperl: Re: Bio::Tools::Blast
Georg Fuellen
fuellen@dali.Mathematik.Uni-Bielefeld.DE
Thu, 27 Aug 1998 10:34:01 +0000 (GMT)
Eitan wrote,
> I would like to warn all Bio::PreSeq::parse_fasta() users. Some fasta
Unless I'm mistaken, there is no reason for any warning, see below.
> databases (such as RepBase, if I'm not mistaken) are using non \S letters
> in their naming scheme. Most fasta parsers fail when they see
> >gb|AC000254 blah blah blah
> >AC000254_1 blash blah blah
I get:
DB<54> $head = ">gb|AC000254 blah blah blah"
DB<55> ($id, $desc) = $head =~ /^>[ \t]*(\S*)[ \t]*(.*)$/;
DB<56> x ($id, $desc)
0 'gb|AC000254'
1 'blah blah blah'
I think this works as it should ?!
Non \S letters (i.e. whitespace letters, since \S matches
a non-whitespace character) are [ \t\n\r\f]. (that's not the same as \W)
When I worked on the fasta pattern-matchings, the expression
used above seemed to be the most general I could come up with--
and it seems to me that the unusual case of ``identifyers with
space'' can only be dealt with by using a special flag that
says ``use the whole header as an id'' which means that we
assume the description is part of the id.
However, I'd suggest the user should take care of this case.
> In my case I overcome this with sed 's/^>gb|//' etc. or with perl
> scripts. It may pose a serious problem though if you want the
> Bio::PreSeq package to be universal.
Please respond if I'm overlooking the problem that you seem to see.
best wishes,
Georg Fuellen,
Univ. Bielefeld, Research Group in Practical Comp. Science
http://www.techfak.uni-bielefeld.de/bcd/welcome.html
>
> Eitan.
>
>
> ======================================================================
> Eitan Rubin,
> Plant Genetics, Weizmann Inst of Science, Rehovot, Israel.
> EMail: bcrubin@dapsas1.weizmann.ac.il
> Tel: (00972)-(8)9342421 Fax: (00972)-(8)9344181
> EitanR@BioMOO (http://bioinfo.weizmann.ac.il/BioMOO) - visit
> the
> GCG help desk
>
> in vivo -> in vitro -> in silico
> ======================================================================
>
> On Wed, 26 Aug 1998, Steve A. Chervitz wrote:
>
> >
> > Lincoln,
> >
> > Spaces are not permitted in identifiers in Blast.pm. In the Fasta
> > files I've seen, a space is used to separate the identifier from the
> > description line. Here's how Bio::PreSeq::parse_fasta() grabs the
> > identifier and description:
> >
> > ($self->{"id"}, $self->{"desc"}) = $head =~ /^>[ \t]*(\S*)[ \t]*(.*)$/;
> >
> > BTW, I just updated the Blast distribution (now 0.061). It includes
> > an important memory management fix that helps when crunching lots of
> > reports.
> >
> > Steve Chervitz
> > sac@genome.stanford.edu
> >
> >
> > On 26 Aug 1998, Lincoln Stein wrote:
> >
> > > Hi Steve,
> > >
> > > Does Blast.pm not deal correctly with sequence identifiers that
> > > contain spaces? I just tried to blast a database made from
> > > identifiers like this:
> > >
> > > >notch4 exon #1
> > > atgcagccccagttgctgctgctgctgctcttgccactcaatttccctgtcatcctgacc
> > > agag
> > >
> > > >notch4 exon #2
> > > agcttctgtgtggaggatccccagagccctgtgccaacggaggcacctgcctgaggctat
> > > ctcggggacaagggatctgcca
> > >
> > > >notch4 exon #3
> > > gtgtgcccctggatttctgggtgagacttgccagtttcctgacccctgcagggataccca
> > > actctgcaagaatggtggcagctgccaagccctgctccccacacccccaagctcccgtag
> > > tcctacttctccactgacccctcacttctcctgcacctgcccctctggcttcaccggtga
> > > tcgatgccaaacccatctggaagagctctgtccaccttctttctgttccaacgggggtca
> > > ctgctatgttcaggcctcaggccgcccacagtgctcctgcgagcctgggtggacag
> > >
> > > but I only got "notch4" as the hit. When I changed the spaces to
> > > dots, I got the full identifier.
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================