[Bioperl-l] parsing blast output and obtaining seq info

Jason Stajich jason@chg.mc.duke.edu
Mon, 30 Jul 2001 17:45:11 -0400 (EDT)


Which blast paring module are you using?  Bio::Tools::Blast or
Bio::Tools::BPlite?
 
DB::GenBank is not going to work if you are retrieving one sequence at a
time over the course of a huge blast file - NCBI will shut off access to
you because they'll think you are spamming their webserver.   Also the web
retrieval of a sequence is notoriously problematic - it dies on everyone
at intermittent points w/o good reason. 

We are working on ways to use ncbi's network entrez in perl to alleviate
this problem, but might take a little while.  I'm also not sure that the
latency here associated with fetching a sequence remotely is worth the
'up-to-dateness' implied when retrieving sequences from their server.  If
you did a blast search on a db, you really want the actual sequence that
was in that db anyways.
 
You might be better served downloading the blast db which you are
searching against, building an index using Bio::Index::Fasta and doing all
your sequence retrieval locally - assuming you have enough disk space for
this.

I've suggested this to other people and haven't gotten major complaints
about implementing it, perhaps a setup script for doing this should
make its way into the scripts directory some time?

-jason
On Mon, 30 Jul 2001, Gopinath Ganji wrote:

> Hey everyone,
> Been teaching myself some bioperl to automate a few tasks in our project
> for the past couple of weeks. The tasks at hand may be summarized as
> follows:
> - parse blast output for frac_aligned_query which is within the filter()
> 
> - for each filtered hit, obtain seq (by using the
> get_by_acc($hit->name()) , obtain subseq and for - (anti) sense strands,
> revcom() the string obtained from subseq
> 
> The blast output is in a multi-report format. For testing purposes, a
> shorter version has been used which works problem free. However, upon
> applying the script to the original file, I have frequently encountered
> error messages like "subseq() not defined" or "revcom() not defined"
> each time halting at a different line in the script. So when I actually
> looked at the output generated by the script, the script seems to halt
> at a different hit each time. Appears that the program crashes for some
> reason - runs out of memory, perhaps? I do know that parse method is
> known for memory leaks. However, the same scenario is duplicated for
> merely obtaining seq using the Blat::DB::GenBank module (using the
> get_by_acc method (see above)). Is this module known to suffer from
> leaks or crashes?  Please clarify. Any help/suggestions/pointers would
> be much appreciated.
> 
> Thanks in advance,
> Gopi
> 
> [PS: Kindly let me know if I am not making any sense :-)]
> 
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
> 

Jason Stajich
jason@chg.mc.duke.edu
Center for Human Genetics
Duke University Medical Center 
http://www.chg.duke.edu/