[Biojava-dev] SearchIO module for biojava

Paolo Pavan paolo.pavan at gmail.com
Sat Aug 22 17:26:55 UTC 2015


Dear Biojava develpers,
Some times ago a debate came up about the opportunity of including in biojava a SearchIO module to define common data structures that model Hsp, Hits, Results making abstraction of the underlying search program. Andreas asked also for A) Ease of maintenance B) extensibility for BLAST variants C) general applicability for any database searches (potential to hook up BLAST alternatives)

I have uploaded a preview of the new adds in my biojava fork on GitHub, user paolopavan, SearchIO branch. Please give it a look. 

Also note that this is a required part of another module I have written that can potentially be of community interest: a biojava-run module, to bless it similarly to something already listened. This latter aims to be a generic module used to run an analysis performed by an external program. In my case I needed ncbi blast search. So the API was written to declare a database of biojava Sequence objects, pass a collection of query sequences and retrieve in output Result objects of the SearchIO module. 
I know from previous attempt echoed in the mailing list that the orientation of the project was to reimplement the blast algorithm in pure Java and I agree that it would be a great idea. But until now this project as far as I know is late and I solved the platform portability issue by including several binaries for all the platforms (well, the major) packaging all together in one jar file relying upon this great Java facility. 
Anyway, all this came later. 

Just to spend few technical comments on the SearchIO module:
- included in core module since it defines a new base data structure
- include a dependency from biojava-alignment. This is not compulsory, it is there since the alignment data structure is included in that package. In my opinion, moving this important data structure in core will solve this and avoid similar problems in the future. This is also the reason why I choose to add those new implemented Hits/Hsp etc directly in core, after all search is one of the most important tasks in bioinformatics.
- BlastXML parser is implemented in the BlastXMLQuery class. Maybe this name it is not so meaningful, it comes from the original class that is still there in biojava even if it seems not so much utilised, that I initially started to improve trying to remain tighter to the original project. From here also the use of the class XMLHelper and some deprecated tags I added. From the old thread I understood that there was not any "elective choice" of biojava for XML parsing, but anyway the job was already done with the XMLHelper module and so this class came to new life.
- it was designed to be easy to extend: add support for a new file format a developer must just write a single class that implements the ResultFactory interface (I have implemented also a blast tabular parser to show it). The Api for biojava user does not change, it is just:
        SearchIO reader = new SearchIO(new File("BlastReport.blastxml"), blastResultFactory);

- it is possible to auto recognise file formats relying upon standard file extension. Just try a different constructor:
        SearchIO reader = new SearchIO(new File("BlastReport.blastxml"));

- results are easily accessed through nested iterators that follow the concept that in a report there are one one or many results, every result contains one or many hits, one hit contains one or more hsp:

        for (Result result: reader){
            System.out.println(result.getQueryDef()+"("+ result.getQueryID()+")");
            for (Hit hit: result){
                System.out.print(hit.getHitDef());
                System.out.print("(");
                for (Hsp hsp: hit){
                    System.out.print(hsp.getHspEvalue()+",");
                }
                System.out.println(")");
            }
            
        }
- the use of common data structures allow the definition of common operations. For example I defined a retrieving of the hsp alignment as biojava alignment. 


If you agree that this feature would be interesting for the project I can send a pull request for the SearchIO part and then push on my GitHub also the run module. 
Just have a look to it!

Greetings!
Paolo

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biojava-dev/attachments/20150822/e0f7785c/attachment.html>


More information about the biojava-dev mailing list