[Biopython-dev] [Biopython] Google Summer of Code Project: SearchIO in Biopython

Wibowo Arindrarto w.arindrarto at gmail.com
Mon Apr 30 10:08:52 UTC 2012


>> What I have in mind now is actually closer to iteration on the
>> query+subject level. To be clear first, the hierarchy of the objects
>> that I propose is this:
>>
>> * Search object, to represent the entire search session.
>> * Result object, to represent a search with one query against the
>> database. Depending on the number of queries, we could have one to
>> several Result objects contained in a Search.
>> * Hit object, to represent a sequence hit. Depending on the search, we
>> could also have multiple Hits in one Result object.
>> * and finally, HSP object, to represent individual alignments.
>>
>> Iteration is done on the Results level, so the information is parsed
>> on the search query level, not just a single HSPs (I wrote a  very
>> short description about what I'm planning the objects to be in here as
>> well: http://bit.ly/searchio-terms). I suppose if we aim for maximum
>> information parsing over performance and simplicity of the
>> format-specific parsers, this is the way to go. There are other
>> formats, too, that contains sequence-level search information not
>> present in the alignment (e.g. HMMER text output). What do you think
>> about this?
>
> That sounds good .
>
> If iteration is done on the Results level, when/how would your
> Search object be used?
>
> Peter

I'm thinking of using the Search object as the object returned by
SearchIO.parse or SearchIO.read. That way, we can store attributes
common to the different search queries in it. For example:

>>> search  = SearchIO.parse('blast_result.xml', 'blast-xml')
>>> search.format
'blast-xml'
>>> search.algorithm
'blastx'
>>> search.version
'2.2.26+'
>>> search.database
'refseq_protein'
>>> search.results
<generator object results at ....>

And iteration over the results would be done like this (for example):
>>> for result in search.results:
... print result.query, print len(result)

Additionaly, we can also define __iter__ and next for Search so we can
just do the following:
>>> for result in search:
... print result.query, print len(result)

What do you think?


Bow




More information about the Biopython-dev mailing list