[Bioperl-l] getting database hit sequences
Tobias Thierer
bioperl-contact@computational-biology.net
Mon, 21 Oct 2002 03:49:12 +0200
Hi,
I am currently using StandAloneBlast for running and parsing Blast queries
on a local database (Unigene). The Parser is BPLite, the standard one for
StandAloneBlast.
I would like to manually align (Needleman-Wunsch) my query sequences with
the database hits that Blast found, so somehow I need to get the sequence
of the database hits (Subjects). I want the entire sequence that is in the
database for this subject, not only the portion that formed a HSP with
some part of my query sequence.
After excessive reading of the documentation I concluded that there is no
direct possibility to do so. Looking at the blast reports that BPLite
parses, I found out that do not even contain this sequence, so there is
no way for the parser to get it. But it must somehow be possible (aren't
other people interested in the sequences that scored hits???). I'd like
to assert that the Blast HSPs really cover all sequence similarities and
that Blast didn't just fail to extend it further.
What is the best (or at least some) way to get the entire sequence? The
annotation of the database hits ($hsp->query->name or so) contains
substrings of the form "/gi=some_gi_number", so I could perhaps extract
the GI with a regexp. But how do I get the right sequence for a specific
GI number efficiently? Parsing the entire database for every hit is
O(num_sequences^2) and therefore much too slow to be feasible.
Is there any possibility to easily get the entire sequences that formed
HSPs with my query sequence, preferable with Bioperl?
Any help would be greatly appreciated!
Regards,
Tobias