[Biopython-dev] SearchIO HSP indexing

Wibowo Arindrarto w.arindrarto at gmail.com
Sun Feb 10 15:31:51 UTC 2013


Hi Colin,

>> As for your point about the alignment code:
>>
>> > I was wondering if there was any code in SearchIO to align high-scoring
>> > segment pairs against the same hit? I see the fragmentation code but
>> > that
>> > seems specific to BLAT results and when I look at the HSPFragments in
>> > the
>> > QueryResult object it does not seem to combine multiple HSPs against the
>> > same hit even if they are not overlapping.
>>
>> SearchIO relies on BLAST to do this ~ which has already grouped each
>> HSP aligning to the same database sequence in one group (all of which
>> is accessible through the Hit object). I've always assumed that if two
>> HSPs came from the same database entry (Hit), they are grouped into
>> one Hit by BLAST, regardless of whether they overlap or not. Have you
>> seen any results from BLAST that shows otherwise?
>>
>
> I have a couple of examples where BLAST doesn't combine the HSPs as you
> would expect. It seems to mainly occur because the HSP alignments overlap
> and to combine them would mean including more gaps in each hsp. For example,
> ftsK in E. coli (ftsK.blast) or aceF in E. coli (aceF.blast). In the second
> case, the first HSP spans the entire query and there are two additional HSPs
> that are overlapped by it.
>
> I know that BioPerl tries to align/tile (in Bio::Search::BlastUtils) the
> HSPs somewhat when required but some people are hesitant to use their method
> in certain situations (e.g., with tblastn results that overestimate some of
> the metrics). They also implement additional functionality so that the user
> could do a complete smith-waterman alignment if they wanted to.

Thanks for including the files!

At the moment, no, SearchIO doesn't have any code to 'assemble'/'tile'
overlapping HSPs. The fragment bits you're seeing in the BLAT parser
is simply the name we use to refer to noncontiguous blocks inside a
reported HSP.

We may be able to add some functions to return the intervals for such
overlapping HSPs, given a Hit object. But I'm a bit hesitant to go
further than that (i.e. to the point where we merge the statistics of
the each HSP to assign to the assembled HSP). This is mostly because
such assembly seems very specific to the program's statistics and
format (BLAST's merge would be different from BLAT? and BLAST XML's
merge may be different from tabular BLAST). If anything, perhaps these
functions deserve their own space in SearchUtils (taking parallels
from Bio.SeqIO and Bio.SeqUtils)?

regards,
Bow



More information about the Biopython-dev mailing list