[Bioperl-l] getting database hit sequences

Tobias Thierer bioperl-contact@computational-biology.net
Mon, 21 Oct 2002 03:49:12 +0200


Hi,

I am currently using StandAloneBlast for running and parsing Blast queries 
on a local database (Unigene). The Parser is BPLite, the standard one for 
StandAloneBlast.

I would like to manually align (Needleman-Wunsch) my query sequences with 
the database hits that Blast found, so somehow I need to get the sequence 
of the database hits (Subjects). I want the entire sequence that is in the 
database for this subject, not only the portion that formed a HSP with 
some part of my query sequence.

After excessive reading of the documentation I concluded that there is no 
direct possibility to do so. Looking at the blast reports that BPLite 
parses, I found out that  do not even contain this sequence, so there is 
no way for the parser to get it. But it must somehow be possible (aren't
other people interested in the sequences that scored hits???). I'd like
to assert that the Blast HSPs really cover all sequence similarities and
that Blast didn't just fail to extend it further.

What is the best (or at least some) way to get the entire sequence? The 
annotation of the database hits ($hsp->query->name or so) contains 
substrings of the form "/gi=some_gi_number", so I could perhaps extract 
the GI with a regexp. But how do I get the right sequence for a specific 
GI number efficiently? Parsing the entire database for every hit is 
O(num_sequences^2) and therefore much too slow to be feasible.

Is there any possibility to easily get the entire sequences that formed 
HSPs with my query sequence, preferable with Bioperl?

Any help would be greatly appreciated!

Regards,

	Tobias