[Biojava-l] Extract accession number out of xml blast result

Richard HOLLAND hollandr at gis.a-star.edu.sg
Thu Nov 10 21:15:27 EST 2005


As documented at BioJava in Anger, the subject's accession can be
obtained from the SeqSimilaritySearchHit using getSubjectID(). 

By reading the API, the query's accession can be obtained from
SeqSimilaritySearchResult using getQuerySequence().getName().

However... unforunately, the query accession method above does not work
if you follow the BioJava in Anger example code!

BlastLikeSearchBuilder requires a SequenceDB and a
SequenceDBInstallation. The former should contain all sequences used in
the query, and the latter should be able to provide SequenceDB instances
corresponding to the databases used in the blast. For instance, if you
blasted query "A12345" vs. database "nr", then the SequenceDB instance
should return a meaningful value for getSequence("A12345"), and the
SequenceDBInstallation instance should return a meaningful value for
getSequenceDB("nr").

The example at BioJava in Anger uses a DummySequenceDB and
DummySequenceDBInstallation to pass to the BlastLikeSearchBuilder. Both
these instances generate the exact same response no matter what values
you pass to getSequence() and getSequenceDB() - they return a Sequence
or SequenceDB with the name of "dummy".

If you are really interested in the actual query accession, you would
need to provide your own SequenceDB which returned appropriately named
sequences. If your queries all come from an existing SequenceDB object,
you can just pass this straight in. Likewise, if you are really
interested in the target database name, you can construct or use some
other SequenceDBInstallation to provide the appropriate SequenceDB
instances.

BUT... you can get round all this object overkill by knowing a few
things about your query data before trying to parse it. First, when you
run BLAST on multiple query sequences in a single input file, the report
generated will contain the query sequences in the same order as the
input file. Second, the SeqSimilaritySearchResult objects are returned
in the same order as the results appear in the BLAST report, and there
will be one SeqSimilaritySearchResult object per query sequence. So, if
you have a list of your query sequence accessions in the order they
appear in the input file to BLAST, you can then maintain a counter which
increments each time you obtain the next SeqSimilaritySearchResult, and
that counter will provide a direct pointer into your list to tell you
which query accession you are currently working with. Likewise, you
should know already what blast database you blasted against, so you
shouldn't really need to get this information from the results.

cheers,
Richard

Richard Holland
Bioinformatics Specialist
GIS extension 8199
---------------------------------------------
This email is confidential and may be privileged. If you are not the
intended recipient, please delete it and notify us immediately. Please
do not copy or use it for any purpose, or disclose its content to any
other person. Thank you.
---------------------------------------------


> -----Original Message-----
> From: biojava-l-bounces at portal.open-bio.org 
> [mailto:biojava-l-bounces at portal.open-bio.org] On Behalf Of 
> Andreas Scheucher
> Sent: Thursday, November 10, 2005 6:49 PM
> To: biojava-l at biojava.org
> Subject: [Biojava-l] Extract accession number out of xml blast result
> 
> 
> Hi,
> 
> I'am parsing a blast result file for an multi fasta search 
> with biojava.
> Now I'm wondering, whether there really is no possibility to get the 
> accession number out of an blast hit. The xml tag with the 
> information 
> is there but where ist the belonging function?
> 
> Thanks for your effort.
> 
> Regards,
> Andreas
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
> 



More information about the Biojava-l mailing list