[Biopython-dev] my latest trick

Fri Jan 18 07:36:31 EST 2002

The FormatIO system now supports BLAST and WU-BLAST, although it
isn't yet finished (missing a few details, like locations and the
sequence name).  Here's what it looks like

First, autodetection of the file format

>>> import Bio
>>> format = Bio.formats["search"].identify("blastp.2.wu")
>>> format.name
'wu-blastp'
>>>

It isn't figuring things out from the extension - it's testing
the contents of the file.  Here's proof.

>>> import os
>>> os.symlink("blastp.2.wu", "unknown.dat")
>>> Bio.formats["search"].identify("unknown.dat").name
'wu-blastp'
>>>

And I can load this into memory as a 'Search' object.

>>> from Bio import Search
>>> search = Search.io.read("unknown.dat").next()
      # The 'next' is because the FormatIO system assumes multiple
      # records.  I have an idea of how to fix that.
>>> search.query.description
'YEL060C vacuolar protease B'
>>> search.algorithm.name
'BLASTP'
>>> len(search.hits)
12
>>>

>>> for hit in search.hits:
...     print hit.name, hit.description
...     print len(hit.hsps), "hsps"
...     for hsp in hit.hsps:
...             print hsp.query.seq
...             print hsp.homology.seq
...             print hsp.subject.seq
...

  (Lots printed out)

Many of the fields are stored, like

>>> for k, v in search.statistics.items():
...     print k, "=", repr(v)
...
total_time = ' 0.95u 0.08s 1.03t  Elapsed: 00:00:01'
num_threads = 1
posted_date = ' 3:27 PM PST Mar 9, 1998'
start_time = 'Mon Mar  9 15:57:59 1998'
num_dfa_states = ' 569 (112 KB)'
total_dfa_size = ' 358 KB (384 KB)'
database = ' /tmpblast/PDBUNIQ'
release_date = ' unknown'
format = ' BLAST'
num_sequences_in_database = 2335
num_sequences_satisying_E = 12
end_time = 'Mon Mar  9 15:58:00 1998'
neighborhood_generation_time = ' 0.01u 0.00s 0.01t  Elapsed: 00:00:00'
num_letters_in_database = 479690
search_cpu_time = ' 0.90u 0.05s 0.95t  Elapsed: 00:00:01'
database_title = ' PDBUNIQ'
>>>

>>> hsp.query.positives
56
>>> hsp.query.frac_positives
0.45528455284552843
>>>

Normalization is still a problem, as you can see from the untrimmed
strings.  And I don't quite get everything .. it's pretty tedious.
(There's a few fields in the hit I also don't handle, like what do
I do with 'P(2)' compared to 'P'?  Someone with a better technical
understanding of the details of the algorithms needs to help me.
Perhaps in Tuscon.)

Biggest thing missing is failure cases.  The data files I found
were all for successful runs.

The format expressions get hairy.  The biggest problems occur when
formatY is almost like formatX except for a small change in the
bottom of the expression tree.  Then the whole tree needs to be
reconstructed, which is noisy.  I'm thinking about possibilities
like

  blastn = Martel.replace_group(blastp, "hsp_info", blastn_hsp_info)

All these tree editing methods are ad hoc.  I keep wondering what
it would be like to convert the tree to a DOM then use the DOM
methods to manipulate the structure.  That'll have to wait for
Martel version 2.

                    Andrew
                    dalke at dalkescientific.com