[Biojava-dev] First draft of a remote blast service class

Scooter Willis HWillis at scripps.edu
Thu Jun 11 15:58:22 UTC 2009


Sylvain

My first reaction was that I was expecting BLAST code but came across
RemotePairwiseAlignementService which made me pause thinking I would be
looking at a sequence alignment code. RemoteBLASTService would be a
better description specific to doing Remote BLAST.

I agree that everything should be an enum if possible but encapsulated
in a single search/parameter class.

The enums should not have any URL specific association with the remote
service but should be abstracted to something that makes sense to a
developer wanting to use a service they know nothing about and don't
want to take the time to read. The query parameters should be defined as
a Java class that could be passed around to different service providers
and then internally to the service provider the values would be mapped
to the specific requirements of that service. Doing a quick view of the
form for NCBI BLASTN you have human readable labels that when the query
is submitted will map to a value that the programmer wanted to use as
short hand.   

http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&BLAST_PROGRAMS=me
gaBlast&PAGE_TYPE=BlastSearch&SHOW_DEFAULTS=on&BLAST_SPEC=&LINK_LOC=blas
ttab&LAST_PAGE=tblastx

If you click on blastn,blastp,blastx,tblastn, tblastx tabs on the above
link you will see that the forms are very similar but do have
variations. I would use each input form as the model for the class to do
the appropriate search. What is common to the 5 tabs would be in the
base abstract search class and any input requirements that are different
would go in an extended class. This gives you a generic class for
modeling the search parameters that is easily understood. The hard part
is then mapping the easy to understand version to the specific search
query parameters of a particular service. Either way you should be able
to pass the search class to different providers without knowing anything
about that specific service.

It would also be nice to have a listener interface so the class that is
responsible for doing the query also checks if the results are available
based on some poll value. The external calling code shouldn't need to
worry about bookkeeping of unique identifiers for a particular service
provider. The implementation class should hide all those details.

You also have the results returning in text, XML or HTML. It would be
nice if the results could be returned as a collection of
SeqSimilaritySearchResult and collection of SeqSimilaritySearchHit found
at http://www.biojava.org/wiki/BioJava:CookBook:Blast:Parser  This may
require you to parse the text/HTML/XML code in your implementation
class. This way you can tweak or adjust for anything specific to the
service provider. Other BLAST web services WSDL providers will return a
collection of Java classes specific to that implementation that then
need to be mapped to SeqSimilaritySearchResult and
SeqSimilaritySearchHit. The benefit is that API hides all the ugly
details from the developer who is using the BLAST service. 


NCBIBlast has a formal WSDL interface which may make the process easier
for you. http://bioinfo.unice.fr/web_services/Using_NCBI-Blast.html If
you click on this link
http://www.ebi.ac.uk/Tools/webservices/wsdl/WSNCBIBlast.wsdl you will
see all the web services magic that you hand off to your favorite IDE
and it writes the code for you. I did a quick test in Netbeans and they
are using Jax-RPC for the web service calls where I don't see a nice set
of Java classes for structured results. This means parsing a string. It
also appears they are providing a similar interface for WU-Blast
http://bioinfo.unice.fr/web_services/Using_WU-Blast.html#General_Informa
tion and http://www.ebi.ac.uk/Tools/webservices/wsdl/WSWUBlast.wsdl
The advantage of using the web service interface is that it should be
stable where you can't control changes they are making to the CGI form
submission which would break the biojava code.

Scooter



-----Original Message-----
From: biojava-dev-bounces at lists.open-bio.org
[mailto:biojava-dev-bounces at lists.open-bio.org] On Behalf Of Sylvain
Foisy
Sent: Thursday, June 11, 2009 9:52 AM
To: biojava-dev at lists.open-bio.org
Subject: [Biojava-dev] First draft of a remote blast service class

Hi to all,

I've been working on this for the past week or so and after discussing
this
with Andreas, I am putting my code here for critical review. I'll put
this
stuff in biojava-live as soon as Andreas can fix my SVN access.

First, an interface called RemotePairwiseAlignementSerivce defines the
basic
components of a remote service: sequence/database/progam/run
options/output
options. RemoteQBlastService implements this interface and runs remote
Qblast requests and creates output in either text, XML or HTML. At
present
time, regular blastall programs work, no blastpgp/megablast support yet.

I'll need some guidance to make it work on other type of web services
like
EBI.

Best regards

Sylvain

===================================================================

 Sylvain Foisy, Ph. D.
 Consultant Bio-informatique / Bioinformatics
 Diploide.net - TI pour la vie / IT for Life

 Courriel: sylvain.foisy at diploide.net
 Web: http://www.diploide.net
 Tel: (514) 893-4363
===================================================================

import java.io.InputStream;

import org.biojava.bio.BioException;
/**
 * This interface specifies minimal information needed to execute a
pairwise
alignment on a remote service.
 * 
 * Example of service: QBlast service at NCBI
 *                     Web Service at EBI
 * 
 * @author Sylvain Foisy
 * @since 1.8
 *
 */
public interface RemotePairwiseAlignementService {

    /**
     * This field specifies that the output format of results
     * is text.
     * 
     */
    public static final int TEXT = 0;

    /**
     * This field specifies that the output format of results
     * is XML.
     * 
     */
    public static final int XML = 1;

    /**
     * This field specifies that the output format of results
     * is HTML.
     * 
     */
    public static final int HTML = 2;

    /**
     * Setting the database to use for doing the pairwise alignment
     *  
     * @param db: a <code>String</code> with a valid database ID for the
service used.
     *  
     */
    public void setDatabase(String db);

    /**
     * Setting the sequence to be align for this for this request
     *  
     * @param seq: a <code>String</code> with a sequence to be aligned.
     *  
     */
    public void setSequence(String seq);

    /**
     * Setting the program to use for this pairwise alignment
     *  
     * @param prog: a <code>String</code> with a valid database ID for
the
service used.
     *  
     */
    public void setProgram(String prog);

    /**
     * Setting all other options to use for this pairwise alignment
     *  
     * @param db: a <code>String</code> with a valid database ID for the
service used.
     *  
     */    
    public void setAdvancedOptions(String str);
    
    /**
     * Doing the actual analysis on the instantiated service
     * 
     * @throws BioException
     */
    public void executeSearch() throws BioException;
    
    /**
     * Getting the actual alignment results from this instantiated
service
     * 
     * @return : an <code>InputStream</code> with the actual alignment
results
     * @throws BioException
     */
    public InputStream getAlignmentResults() throws BioException;
}

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;

import org.biojava.bio.BioException;

/**
 * RemoteQBlastService - A simple way of submitting BLAST request to the
QBlast
 * service at NCBI.
 * 
 * <p>
 * NCBI provides a Blast server through a CGI-BIN interface.
RemoteQBlastService simply
 * encapsulates an access to it by giving users access to get/set
methods to
fix
 * sequence, program and database as well as advanced options.
 * </p>
 * 
 * <p>
 * As of version 1.0, only blastall programs are usable. blastpgp and
megablast are high-priorities.
 * </p>
 * 
 * @author Sylvain Foisy
 * @version 1.0
 * @since 1.8
 * 
 * 
 */
public class RemoteQBlastService implements
RemotePairwiseAlignementService{

//    public static final int TEXT = 0;
//    public static final int XML = 1;
//    public static final int HTML = 2;

    private static String baseurl =
"http://www.ncbi.nlm.nih.gov/blast/Blast.cgi";
    private URL aUrl;
    private URLConnection uConn;
    private OutputStreamWriter fromQBlast;
    private BufferedReader rd;

    private String seq = null;
    private String prog = null;
    private String db = null;
    private String outputFormat = null;
    private String advanced = null;

    private String rid;
    private long step;
    private boolean done = false;
    private long start;

    public RemoteQBlastService() throws BioException {
        try {
            aUrl = new URL(baseurl);
            uConn = setQBlastProperties(aUrl.openConnection());

            outputFormat = "Text";
        }
        /*
         * Needed but should never be thrown since the URL is static and
known to exist
         */
        catch (MalformedURLException e) {
            throw new BioException("It looks like the URL for NCBI
QBlast
service is bad");
        }
        /*
         * Intercept if the program can't connect to QBlast service
         */
        catch (IOException e) {
            throw new BioException(
                    "Impossible to connect to QBlast service at this
time.
Check your network connection");
        }
    }

    /**
     * This method execute the Blast request via the Put command of the
CGI-BIN
     * interface. It gets the estimated time of completion by capturing
the
     * value of the RTOE variable and sets a loop that will check for
completion
     * of analysis at intervals specified by RTOE.
     * 
     * <p>
     * It also capture the value for the RID variable, necessary for
fetching
     * the actual results after completion.
     * </p>
     * 
     * @throws BioException
     *             if it is not possible to sent the BLAST command
     */
    public void executeSearch() throws BioException {

        if (seq == null || db == null || prog == null) {
            throw new BioException(
                    "Impossible to execute QBlast request. One or more
of
seq|db|prog has not been set");
        }
        /*
         * sending the command to execute the Blast analysis
         */
        String cmd = "CMD=Put&SERVICE=plain" + "&" + seq + "&" + prog +
"&"
                + db + "&" + "FORMAT_TYPE=HTML";

        if (advanced != null) {
            cmd += cmd + "&" + advanced;
        }

        try {

            uConn = setQBlastProperties(aUrl.openConnection());

            fromQBlast = new
OutputStreamWriter(uConn.getOutputStream());

            fromQBlast.write(cmd);
            fromQBlast.flush();

            // Get the response
            rd = new BufferedReader(new InputStreamReader(uConn
                    .getInputStream()));

            String line = "";

            while ((line = rd.readLine()) != null) {
                if (line.contains("RID")) {
                    String[] arr = line.split("=");
                    rid = arr[1].trim();
                } else if (line.contains("RTOE")) {
                    String[] arr = line.split("=");
                    step = Long.parseLong(arr[1].trim()) * 1000;
                    start = System.currentTimeMillis() + step;
                }
            }
        } catch (IOException e) {
            throw new BioException(
                    "Can't submit sequence to BLAST server at this
time.");
        }
        /*
         * Getting the info out of the NCBI system
         */
        while (!done) {
            long prez = System.currentTimeMillis();
            done = isReady(rid, prez);
        }
    }

    /**
     * <p>This method is used only for the executeBlastSearch method to
check for completion of
     * request using the NCBI specified RTOE variable</p>
     * 
     * @param id
     * @param present
     * @return
     */
    private boolean isReady(String id, long present) {

        boolean ready = false;
        String check = "CMD=Get&RID=" + id;
        /*
         * If present time is less than the start of the search added to
step
         * obtained from NCBI, just do nothing ;-)
         */
        if (present < start) {
            ;
        }
        /*
         * If we are at least step seconds in the future from the actual
call of
         * method executeBlastSearch()
         */
        else {
            try {
                uConn = setQBlastProperties(aUrl.openConnection());

                fromQBlast = new
OutputStreamWriter(uConn.getOutputStream());
                fromQBlast.write(check);
                fromQBlast.flush();

                rd = new BufferedReader(new InputStreamReader(uConn
                        .getInputStream()));

                String line = "";

                while ((line = rd.readLine()) != null) {
                    if (line.contains("READY")) {
                        ready = true;
                    } else if (line.contains("WAITING")) {
                        /*
                         * Else, move start forward in time...
                         */
                        start = present + step;
                    }
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        return ready;
    }

    /**
     * <p>This method extracts this actual Blast report. The default
format
is Text but can be changed before with the method
     * setQBlastOutputFormat.</p>
     * 
     * 
     * @return
     * @throws BioException
     */
    public InputStream getAlignmentResults() throws BioException {
        String srid = "CMD=Get&RID=" + rid;
        srid += "&FORMAT_TYPE=" + outputFormat;

        if(!this.done){
            throw new BioException("Unable to get report at this time.
Your
Blast request has not been processed yet.");
        }
        
        try {
            uConn = setQBlastProperties(aUrl.openConnection());

            fromQBlast = new
OutputStreamWriter(uConn.getOutputStream());
            fromQBlast.write(srid);
            fromQBlast.flush();

            return uConn.getInputStream();

        } catch (IOException ioe) {
            throw new BioException(
                    "It is not possible to fetch Blast report from NCBI
at
this time");
        }
    }

    /**
     * <p>
     * Set the sequence to be blasted using the String that correspond
to
the
     * sequence.
     * </p>
     * 
     * <p>
     * Take note that this method is mutually exclusive to
setGIToBlast()
for a
     * given Blast request.
     * </p>
     * 
     * @param aStr
     *            : a String with the sequence
     */
    public void setSequence(String aStr) {
        this.seq = "QUERY=" + aStr;
    }

    /**
     * Simply return a string with the blasted sequence.
     * 
     * @return seq : a string with the sequence
     */
    public String getSeqToBlast() {
        return this.seq;
    }

    /**
     * <p>
     * Set the sequence to be blasted using the NCBI GI value. At this
time,
     * there is no effort made to check the validity of this GI.
     * </p>
     * 
     * <p>
     * Take note that this method is mutually exclusive to
setSeqToBlast()
for a
     * given Blast request.
     * </p>
     * 
     * @param gi
     *            : an integer value representing a NCBI GI
     */
    public void setGIToBlast(String gi) {
        this.seq = "QUERY=" + gi;
    }

    /**
     * <p>
     * Simply return a string with the sequence blasted.
     * </p>
     * 
     * @return GI : a String with the GI of the blasted sequence
     */
    public String getGIToBlast() {
        return this.seq;
    }

    /**
     * <p>
     * This method set the program to be used to blast the given
sequence/GI. At
     * this time, there is no attempt at checking the matching of
sequence
type
     * to program.
     * </p>
     * 
     * @param prog
     *            : a String representing the program specified for this
QBlast
     *            request.
     * 
     */
    public void setProgram(String prog) {
        this.prog = "PROGRAM=" + prog;
    }

    /**
     * <p>
     * Simply returns the program used for the given Blast request.
     * </p>
     * 
     * @return prog : a String with the program used for this QBlast
request.
     */
    public String getProgram() {
        return this.prog;
    }

    /**
     * <p>
     * This method set the database to be used to blast the given
sequence/GI.
     * At this time, there is no attempt at checking the matching of
sequence
     * type to database.
     * </p>
     * 
     * @param db: a String for the database specified for this QBlast
request
     */
    public void setDatabase(String db) {
        this.db = "DATABASE=" + db;
    }

    /**
     * <p>
     * Simply returns the database used for the given Blast request.
     * </p>
     * 
     * @return db: a String with the database used for this QBlast
request.
     */
    public String getBlastDatabase() {
        return this.db;
    }

    /**
     * <p>This method let the user specify which format to use for
generating the output.</p>
     * 
     * @param type:an integer taken from the static constant of this
class,
either be TEXT, XML or HTML
     */
    public void setQBlastOutputFormat(int type) {

        switch (type) {
            case 0:
                this.outputFormat = "Text";
                break;
            case 1:
                this.outputFormat = "XML";
                break;
            case 2:
                this.outputFormat = "HTML";
                break;
        }
    }

    /**
     * <p>
     * Simply returns the output format used for the given Blast report.
     * </p>
     * 
     * @return outputFormat : a String with the format specified for the
QBlast report.
     */
    public String getQBlastOutputFormat() {
        return this.outputFormat;
    }

    /**
     * <p>This method is to be used if a request is to use non-default
values at submission. According to QBlast info,
     * the accepted parameters for PUT requests are:</p>
     * 
     * <ul>
     * <li>-G: cost to create a gap. Default = 5 (nuc-nuc) / 11
(protein) /
non-affine for megablast</li>
     * <li>-E: Cost to extend a gap. Default = 2 (nuc-nuc) / 1 (protein)
/
non-affine for megablast</li>
     * <li>-r: integer to reward for match. Default = 1</li>
     * <li>-q: negative integer for penalty to allow mismatch. Default =
-3</li>
     * <li>-e: expectation value. Default = 10.0</li>
     * <li>-W: word size. Default = 3 (proteins) / 11 (nuc-nuc) / 28
(megablast)</li>
     * <li>-y: dropoff for blast extensions in bits, using default if
not
specified. Default = 20 for blastn, 7 for all others
     * (except megablast for which it is not applicable).</li>
     * <li>-X: X dropoff value for gapped alignment, in bits. Default =
30
for blastn/megablast, 15 for all others.</li>
     * <li>-Z: final X dropoff value for gapped alignement, in bits.
Default
= 50 for blastn, 25 for all others
     * (except megablast for which it is not applicable)</li>
     * <li>-P: equals 0 for multiple hits 1-pass, 1 for single hit
1-pass.
Does not apply to blastn ou megablast.</li>
     * <li>-A: multiple hits window size. Default = 0 (for single hit
algorithm)</li>
     * <li>-I: number of database sequences to save hits for. Default =
500</li>
     * <li>-Y: effective length of the search space. Default = 0 (0
represents using the whole space)</li>
     * <li>-z: a real specifying the effective length of the database to
use. Default = 0 (0 represents the real size)</li>
     * <li>-c: an integer representing pseudocount constant for
PSI-BLAST.
Default = 7</li>
     * <li>-F: any filtering directive</li>
     * </ul>
     * 
     * <p>You have to be aware that at not moment is there any error
checking on the use of these parameters by this class.</p>
     * @param aStr: a String with any number of optional parameters with
an
associated value.
     *
     */
    public void setAdvancedOptions(String aStr) {
        this.advanced = "OTHER_ADVANCED=" + aStr;
    }

    /**
     * 
     * Simply return the string given as argument via
setBlastAdvancedOptions
     * 
     * @return advanced: the string with the advanced options
     */
    public String getBlastAdvancedOptions() {
        return this.advanced;
    }

    /**
     * 
     * Simply return the QBlast RID for this specific QBlast request
     * 
     * @return rid: the string with the RID
     */
    public String getBlastRID() {
        return this.rid;
    }

    /**
     * A simple method to check the availability of the QBlast service
     * 
     * @throws BioException
     */
    public void printRemoteBlastInfo() throws BioException {
        try {
            OutputStreamWriter out = new OutputStreamWriter(uConn
                    .getOutputStream());

            out.write("CMD=Info");
            out.flush();

            // Get the response
            BufferedReader rd = new BufferedReader(new
InputStreamReader(uConn
                    .getInputStream()));

            String line = "";

            while ((line = rd.readLine()) != null) {
                System.out.println(line);
            }

            out.close();
            rd.close();
        } catch (IOException e) {
            throw new BioException(
                    "Impossible to get info from QBlast service at this
time. Check your network connection");
        }
    }

    private URLConnection setQBlastProperties(URLConnection conn) {

        URLConnection tmp = conn;

        conn.setDoOutput(true);
        conn.setUseCaches(false);
        
        tmp.setRequestProperty("User-Agent",
"Biojava/RemoteQBlastService");
        tmp.setRequestProperty("Connection", "Keep-Alive");
        tmp.setRequestProperty("Content-type",
                "application/x-www-form-urlencoded");
        tmp.setRequestProperty("Content-length", "200");

        return tmp;
    }
}


_______________________________________________
biojava-dev mailing list
biojava-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-dev




More information about the biojava-dev mailing list