[Biopython-dev] [Biopython] Google Summer of Code Project: SearchIO in Biopython

Sun Apr 29 11:00:42 UTC 2012

Hi Bow,

Thanks for updating the list. I'm replying just on the dev list
as I'm focusing on implementation discussion in this reply.

On Sat, Apr 28, 2012 at 1:08 PM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
> 1. My main biopython branch for development:
> https://github.com/bow/biopython/tree/searchio. Since I will be building on
> top of Peter's SearchIO branch (
> https://github.com/peterjc/biopython/tree/search-io-test), right now it
> only contains Peter's branch rebased against the latest master.

Just to be clear - you don't have to start from that branch ;)
http://lists.open-bio.org/pipermail/biopython-dev/2012-March/009468.html

As I said before, that may not be the best approach. The idea
behind that code was to focus on the HSPs (in BLAST terms),
and for the low level parsers to iterate over each HSP. Higher
level wrappers can then batch these up by query/subject, or
into the larger grouping of all the results for one query -
which was the exposed high level Bio.SearchIO.parse
function.

That branch introduced a SearchResult object which was
essentially something like a list or dict (like an OrderedDict
in some ways), with some (unnecessary?) error checking for
consistent contents (all from the same query). It also introduced
a TopMatches object which was essentially list list (again,
with some error checking for consistent contents).

The advantage of using simple objects (OrderedDict
and list) is simplicity and hopefully performance. But
specific classes have the advantage of allowing more
user friendly str/repr etc.

The idea on this branch of focusing on iteration over the
HSPs at the low level was it allowed a lot of flexibility, and
the low level parser could be used in conjunction with
indexing to see to a particular HSP and parse it, or goto
the results for a particular query+match and parse its
HSPs  (not implemented on my old branch, but that was
the plan).

However, while this makes perfect sense for say the BLAST
tabular output, it isn't quite such a good match for all the
possible datatypes.

For instance, BLAST plain text/html includes an e-value for
a query/subject combination which is calculated from all the
HSPs for that query/subject (taking into account order etc -
I'd have to check the O'Reilly BLAST book for the details).
This isn't in the tabular output, but the point is that it isn't a
property of the individual HSPs, but of the match (group of
HSPs).

I think we need to consider the other main formats, and if
all their important information lies at the HSP level or not.
Perhaps iteration at the query+match level (groups of
HSPs) would be best overall?

Bow - If some of that doesn't make sense, I can try to clarify
by email on the list, and/or we can talk about it at our next
video chat. Also see if you can get the BLAST book from
your library - it will probably be quite useful in this project
even though it describes the 'legacy' BLAST suite:

"BLAST" by Ian Korf, Mark Yandell, Joseph Bedell
Publisher: O'Reilly Media, Released: July 2003

Regards,

Peter