[Biojava-l] BLAST Parser for extracting all BLAST data?
hollandr at gis.a-star.edu.sg
Sun Jun 26 11:33:14 EDT 2005
BioJava's BLAST framework parses files and fires events for every piece of information it finds. The SeqSimilarityAdapter class is an example of how to catch these events and construct basic BLAST result objects (SimpleSeqSimilarityHit), however they are not comprehensive and do not record full details of every hit.
If you want the kind of detail you mention below you will have to write your own content handler for BLAST parsing and parse it to the BLASTLikeSAXParser when parsing a file. This event handler should implement the ContentHandler interface. Look at the source of SeqSimilarityAdapter for guidance. You will then receive events for every part of the file, from which you can construct your own custom BLAST result objects to describe them.
If you're not sure what tag names to listen for in your ContentHandler the easiest thing to do is just run it once and dump them all out to see what you get.
From: biojava-l-bounces at portal.open-bio.org on behalf of Y D Sun
Sent: Sun 6/26/2005 5:42 PM
To: biojava-l at biojava.org
Subject: [Biojava-l] BLAST Parser for extracting all BLAST data?
I want to extract all data from BLASTP results. In the following hit,
for example, I need to get the lengths of query and subject proteins,
the identities (including all data 54, 124 and 43%), the positives (all
data 79, 124 and 63%), and the gaps (3, 124 and 2%). Can the
BLASTLikeSAXParser filter all these information? I can't find the
methods in SeqSimilaritySearchHit and SeqSimilaritySearchSubHit APIs to
retrieve these data. Does Biojava provide any methods for this purpose?
BLASTP 2.2.5 [Nov-16-2002]
2407 sequences; 662,866 total letters
Sequences producing significant alignments: (bits)
Length = 138
Score = 100 bits (250), Expect = 1e-23
Identities = 54/124 (43%), Positives = 79/124 (63%), Gaps = 3/124 (2%)
Query: 18 NARTKFTDIAKTLNLTEAAIRKRIKKLEENQIIKRYSIDIDYKKLGYNMAIIGLDIDMDY
NAR T IAK LN+TEAA+RKRI LE + I Y I+YKK+G + ++ G+D+D D
Sbjct: 15 NARIPKTRIAKELNVTEAAVRKRIANLERREEILGYKAIINYKKVGLSASLTGVDVDPDK
Query: 78 FPKIIKELEKRKEFLHIYSSAGDHDIMVIAIYK---DLEEIYNYLKNLKGVKRVCPAIII
K+++EL+ + ++ + GDH IM I K +L EI+ + ++GVKRVCP+II
Sbjct: 75 LWKVVEELKDLESVKSLWLTTGDHTIMAEIIAKSVQELSEIHQKIAEMEGVKRVCPSIIT
Query: 135 DQIK 138
Sbjct: 135 DIVK 138
Biojava-l mailing list - Biojava-l at biojava.org
More information about the Biojava-l