[Biopython-dev] [Bug 2051] XML Blast parser unusable with multiple queries and recent (2.2.13) blast - patch attached

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Thu Dec 28 20:25:25 UTC 2006


http://bugzilla.open-bio.org/show_bug.cgi?id=2051





------- Comment #10 from jmjoseph at andrew.cmu.edu  2006-12-28 15:25 -------
Thanks for your work.  I have had a chance to work with the CVS
NCBIXML parser.  I have not tested with output other than that from
blastall 2.2.13 and 2.2.15, so many of my suggestions may not be
acceptable for compatibility reasons.  Regardless, I do still see a
number of inconsistencies. I did address some of these in my previous
patches, but it's not worth the effort of starting with them.  Now,
though:

In Record.py, HSP.identities, HSP.gaps, and HSP.positives are still
defined as (None,None) tuples.  However, in NCBIXML.py, these
variables are set as integers.  I don't see a point of a tuple at all,
at least for NCBIXML.  (I realize it is used in NCBIStandalone.py).
Most importantly, the inconsistency makes it difficult to handle cases
when the parameter is not set.  It seems easiest, though, to just
retain the tuple format.

In the past, I worried that the order of tuple building for
self._blast.gap_penalties or ka_params could cause the tuple to have
an incorrect ordering.  I seem to remember hitting an issue where the
tuple was built with the wrong length, but I can't be specific.  In
general, it remains odd to me to not just use a list and set each
element respectively.  If necessary, one could convert to a tuple when
finished or use some other approach that does not rely upon order.

Why not use query_len, as defined in the XML file, or query_length
instead of query_letters as a variable name?  In
BlastParser._end_Iteration, self._blast.query_letters is set.  This is
not defined/documented in the Parameters class in Record.py.  Rather,
query_length is defined there.  In the Header class, though, the name
query_letters is used.  There also seems to be some confusion between
num_letters_in_database, num_sequences_in_database, database_letters,
and database_sequences.  Note that even if this naming is not
corrected, NCBIXML.py:186 is wrong with "self._blast_query_letters"
rather than "self._blast.query_letters".

Similarly, why store the bit score and E-value as 'bits' and
'_hsp.expect'/'descr.e' rather than just using bit_score and
evalue, as in the blast XML ouput?

I make use of <Hsp_align-len> in 2.2.13.  This value missing
entirely.

The parsing of <Hit_id> and <Hit_def> is confusing.  For example,
<Hit>
  <Hit_num>1</Hit_num>
  <Hit_id>gnl|BL_ORD_ID|0</Hit_id>
  <Hit_def>3377250</Hit_def>
  ...
results in _hit.title set to "gnl|BL_ORD_ID|0 3377250".  I would
rather they remain separate (or both methods be used).  

This is certainly not an exhaustive list.  I'm happy to provide
another patch correcting many of these inconsistencies.  At the
very least, the variable names defined in Record.py should be
used in NCBIXML.py.  May I modify at least the above names to
correspond more closely to the names used in the XML?  I know
I've found this particularly confusing.

-Jacob


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list