[Biopython-dev] GSoC SearchIO project

Wed Mar 21 15:27:31 UTC 2012

Hello all,

I'm pleased to see that the GSoC SearchIO project idea I put up
has sparked some interest:

http://biopython.org/wiki/Google_Summer_of_Code

So far three students have inquired about it - and I will try to help
them all with feedback on their GSoC proposals if they decide
to apply. However, the review process is competitive (including
the other OBF projects like BioPerl etc), and we would not want
to select multiple students to work on the very same area. i.e. At
most one GSoC student would be funded to work on a Biopython
SearchIO.

Remember that the outline ideas we put up on the wiki are just
suggestions - GSoC applicants can and are encouraged to
come up with their own ideas. If you can think of something else
please get in touch - on this mailing list if you like, or directly
with an existing Biopython developer who you think might be
able to mentor you. The best projects will be those linked to the
kind of analysis you are already doing - e.g. in your degree
project.

Now for some more detail about what I had in mind for SearchIO.
I may be writing too much at this point, so I should stress that
you can suggest different approaches/ideas.

-------------------------------------------------------------

Grouping of SearchIO results: Terminology differs between
tools (which is going to complicate documenting this new code),
so here I will try to use the terms from BLAST.

BLAST output files can be very large - it is not uncommon to
search 20,0000 predicted genes against the NCBI NR
(non-redundant) database to perform some crude annotation
transfer. This means it is naive and impractical in general
to try to load an entire file in memory. Instead, following the
API of Bio.SeqIO, Bio.AlignIO, Bio.Phylo, etc we expect to
use an iterator approach.

In normal BLAST you compare one or more query sequences
(usually in a FASTA file) against one or more subject sequences
(in a BLAST database or a FASTA file). Depending on the score
thresholds, the number of matches will vary. Each query sequence
may match one or more subject sequence. Each query-subject
sequence may give more than one pairwise alignment (HSP).

For example, if your query sequence has a single copy of
domain X (at coordinates 10-100), but there is a similar subject
sequence with two copies of domain X (at coordinates 5-95 and
106-194), you might expect to get two pairwise alignments (HSPs),
one matching query:10-100 with subject:5-95, and the other
matching query:10-100 with subject:106-194.

This means that there are potentially multiple levels of structure
which we might want to iterate over. Simplistically I think that
Bio.SearchIO.parse(...) should iterate over the results for each
query. This means if your query file had 20,000 sequences,
using a for loop with Bio.SearchIO.parse(...) would give you
20,000 results.

[Well, up to 20,000 results. In some file formats like BLAST
tabular, when a query has no results, there is nothing in the
output file for it].

Also as in Bio.SeqIO etc, a sister convenience function
Bio.SearchIO.read(...) would be for when the results are
for one and only one query sequence.

As in Bio.SearchIO.parse(...), the Bio.SearchIO.parse(...) code
should make a single pass though the file WITHOUT using any
handle tell/seek commands. That way it can be used with any
handle object, including stdin (output piped directly into Python)
and network handles.

Each of these query sequence results would give you zero or
more match (subject) sequences, and for each matched sequence
there would be one or more pairwise alignment (possibly with
the sequence information as in BLAST XML, possibly not as
in standard BLAST tabular output).

-------------------------------------------------------------

With regards to the object hierarchy: For Bio.SeqIO everything
uses SeqRecord objects, which are designed to be extendable
via the annotations dictionary etc. For Bio.AlignIO everything
uses the same MultipleSequenceAlignment object (which uses
SeqRecord objects inside it). For Bio.Phylo, there is a common
base class for the trees, but there are also most specialised
subclasses for the more detailed tree file formats. The aim is
to have a common representation regardless of the file format,
making working with and converting between different file format
as easy as possible.

With Bio.SearchIO, some file formats include pairwise alignments
(e.g. FASTA -m10, BLAST XML, EMBOSS needle/water) while
others do not and only give you match positions as co-ordinates
(e.g. BLAST's standard 12 column tabular output). What I was
picturing was a base HSP (using BLAST's terminology) class
describing a match with coordinates, and a subclass which also
holds a pairwise alignments.

The idea would be we could in theory inter-convert between two
rich file formats like FASTA -m10 and BLAST XML, or from a
rich format to a simple format (e.g. BLAST XML to BLAST
tabular). However, you can't convert the standard BLAST
tabular output to BLAST XML because so much data is missing.

As part of the GSoC work, I would expect you to write unit
tests covering this kind of interconversion. It is not as easy
as it might first seem.

In fact, even converting from BLAST XML to the standard 12
column BLAST tabular output is surprisingly difficult. I wrote
a Python script to do this as a Galaxy Tool, available on the
Galaxy Tool Shed and on my Galaxy Bitbucket repository
under the tools branch, have a look at the comments about
sequence complexity masking and its impact:

https://bitbucket.org/peterjc/galaxy-central/src/default/tools/ncbi_blast_plus/blastxml_to_tabular.py
https://bitbucket.org/peterjc/galaxy-central/src/default/tools/ncbi_blast_plus/blastxml_to_tabular.xml
http://toolshed.g2.bx.psu.edu/

Note that Galaxy has a separate BLAST XML to tabular tool
which produces a different set of columns including some
useful ones that the BLAST+ command line tools don't offer:

https://bitbucket.org/galaxy/galaxy-central/src/default/tools/metag_tools/megablast_xml_parser.py
https://bitbucket.org/galaxy/galaxy-central/src/default/tools/metag_tools/megablast_xml_parser.xml

-------------------------------------------------------------

There is some existing code from when I was exploring
this idea last year on this branch, looking at parsing
FASTA -m10, BLAST XML, text (all using the existing
code in Biopython) and BLAST tabular. This may or may
not be helpful. Looking at it now there are things I would
do differently if I started again today.

https://github.com/peterjc/biopython/tree/search-io-test

-------------------------------------------------------------

Indexing: I'm assuming the reader is familiar with the existing
Bio.SeqIO.index(...) and Bio.SeqIO.index_db(...) functionality.
The idea there is you have a Python dictionary like object
were the keys are record identifiers, and the values are
SeqRecord objects which are parsed on demand. This works
by recording the file offset where the record begins.

At a minimum I would hope to be able to do the same for
SearchIO, where the keys would be query sequence identifiers,
and the values would be their search results. That should be
quite straightforward, but will still be quite a bit of work to
cover all the supported file formats.

What is more interesting (and this starts to link with the
object heirachy) is double indexing on query name and
subject name, to get the list of HSPs for the combination -
without having to parse any other results for that query.
Perhaps this could be done with a dictionary interface
where the keys can also be tuples of (query, subject)
identifiers - but that isn't the only option.

-------------------------------------------------------------

I hope I haven't scared you all!

Peter