[Bioperl-l] interesting blastxml issue

Jason Stajich jason@cgt.mc.duke.edu
Thu, 13 Dec 2001 14:06:03 -0500 (EST)


So when parsing NCBI blast output with blastx things are transformed in a
slightly different way in the xml vs the plain text output.

Normally we infer the strand of the query sequence in a blastx run by
looking at the start/end position and translating this to start always
less than end and updating the strand to -1 if start was > end.

However in the NCBI XML output from blastall (2.1.3) we get the following

[only relavent stuff shown from the actual blastx run]
<Hsp_query-from>400</Hsp_query-from>
<Hsp_query-to>621</Hsp_query-to>
<Hsp_query-frame>-3</Hsp_query-frame>

while in plain text we get (from bl2seq):

 Score = 53.5 bits (127), Expect(2) = 3e-12
 Identities = 27/74 (36%), Positives = 40/74 (53%)
 Frame = -3

Yet if I look at the blast output in original blast text mode I get this:
Query: 621 YVVDSYANVAASAISAKNMTRSLIGASVPLWITQLFHNLGFQYGGLLLALVSVVXXXXXX 442
           Y+++SY  +AASA++A    RS  GA  PL+   +F  +G  + GLLL L +
Sbjct: 508 YIIESYLLLAASAVAANTFMRSAFGACFPLFAGYMFRGMGIGWAGLLLGLFAAAMIPVPL 567

Query: 441 XXXYKGASVRKRSK 400
                G S+RK+SK
Sbjct: 568 LFLKYGESIRKKSK 581


So... I've dealt with it with the following big of logic in
Bio::SearchIO::SearchEventResultBuilder:
top of method 'end_hsp'.

   if( defined $data->{'queryframe'} && # this is here to protect from  undefs
       ( ( $data->{'queryframe'} < 0 &&
	   $data->{'querystart'} < $data->{'queryend'} ) ||
	 $data->{'queryframe'} > 0 &&
	 ( $data->{'querystart'} > $data->{'queryend'} ) )
       )
       {
	   # swap
	   ($data->{'querystart'},
	    $data->{'queryend'}) = ($data->{'queryend'},
				    $data->{'querystart'});
       }
   if( defined $data->{'subjectframe'} && # this is here to protect from undefs

       ( (defined $data->{'subjectframe'} && $data->{'subjectframe'} < 0
&&
	$data->{'subjectstart'} < $data->{'subjectend'} ) ||
       defined $data->{'subjectframe'} && $data->{'subjectframe'} > 0 &&
       ( $data->{'subjectstart'} > $data->{'subjectend'} ) )
     )
   {
       # swap
       ($data->{'subjectstart'},
	$data->{'subjectend'}) = ($data->{'subjectend'},
				  $data->{'subjectstart'});
   }


I'm going to commit it - but I wanted to throw it out there and explain
where this ugliness came from and see if anyone has issues with it.


-- 
Jason Stajich
Duke University
jason@cgt.mc.duke.edu