[Bioperl-l] parsing blast output and obtaining seq info

Gopinath Ganji gopi@bioinfo.sickkids.on.ca
Tue, 31 Jul 2001 23:27:41 -0400


Thanks a ton, Jason! That works :-) Thought web retrieval would be perfect
:-)  Yeah, I am sure the network entrez option would be the way to go.
Here's another petty question : Once I do download and index the sequences,
how do I match a seq id(from my indexed database) to, say, a certain hit from
the blast output for further sequence manipulations (extract subseq, revcomp,
etc.).

Thanks again,
Gopi
[PS: I am using the Bio::Tools::Blast for parsing.]
Jason Stajich wrote:

> Which blast paring module are you using?  Bio::Tools::Blast or
> Bio::Tools::BPlite?
>
> DB::GenBank is not going to work if you are retrieving one sequence at a
> time over the course of a huge blast file - NCBI will shut off access to
> you because they'll think you are spamming their webserver.   Also the web
> retrieval of a sequence is notoriously problematic - it dies on everyone
> at intermittent points w/o good reason.
>
> We are working on ways to use ncbi's network entrez in perl to alleviate
> this problem, but might take a little while.  I'm also not sure that the
> latency here associated with fetching a sequence remotely is worth the
> 'up-to-dateness' implied when retrieving sequences from their server.  If
> you did a blast search on a db, you really want the actual sequence that
> was in that db anyways.
>
> You might be better served downloading the blast db which you are
> searching against, building an index using Bio::Index::Fasta and doing all
> your sequence retrieval locally - assuming you have enough disk space for
> this.
>
> I've suggested this to other people and haven't gotten major complaints
> about implementing it, perhaps a setup script for doing this should
> make its way into the scripts directory some time?
>
> -jason
> On Mon, 30 Jul 2001, Gopinath Ganji wrote:
>
> > Hey everyone,
> > Been teaching myself some bioperl to automate a few tasks in our project
> > for the past couple of weeks. The tasks at hand may be summarized as
> > follows:
> > - parse blast output for frac_aligned_query which is within the filter()
> >
> > - for each filtered hit, obtain seq (by using the
> > get_by_acc($hit->name()) , obtain subseq and for - (anti) sense strands,
> > revcom() the string obtained from subseq
> >
> > The blast output is in a multi-report format. For testing purposes, a
> > shorter version has been used which works problem free. However, upon
> > applying the script to the original file, I have frequently encountered
> > error messages like "subseq() not defined" or "revcom() not defined"
> > each time halting at a different line in the script. So when I actually
> > looked at the output generated by the script, the script seems to halt
> > at a different hit each time. Appears that the program crashes for some
> > reason - runs out of memory, perhaps? I do know that parse method is
> > known for memory leaks. However, the same scenario is duplicated for
> > merely obtaining seq using the Blat::DB::GenBank module (using the
> > get_by_acc method (see above)). Is this module known to suffer from
> > leaks or crashes?  Please clarify. Any help/suggestions/pointers would
> > be much appreciated.
> >
> > Thanks in advance,
> > Gopi
> >
> > [PS: Kindly let me know if I am not making any sense :-)]
> >
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
> >
>
> Jason Stajich
> jason@chg.mc.duke.edu
> Center for Human Genetics
> Duke University Medical Center
> http://www.chg.duke.edu/