[Biojava-l] Question about BioJava DASClient classes

Keith James kdj@sanger.ac.uk
12 Sep 2002 17:10:38 +0100


>>>>> "Patrick" == Patrick McConnell <MCCon012@mc.duke.edu> writes:

    Patrick> I also noticed an inconsistency with the
    Patrick> numberOfIdentities and the percentageIdentity of the
    Patrick> HSPSummary element.  The identities are 0 but the
    Patrick> percentage identity is 100.  This seems strange to me.
    Patrick> Am I interpretting the results incorrectly?

    Patrick>             <biojava:HSPSummary score="1036.8"
    Patrick> expectValue="4e-54" numberOfIdentities="0"
    Patrick> alignmentSize="141" percentageIdentity="100.0"
    Patrick> querySequenceType="protein" hitSequenceType="protein">

numberOfIdentities is calculated from the MatchConsensus data i.e.
all those :.- symbols in the '; al_cons:' tag.

Here's a chunk of m 10 output from example file in demos/files (note
the argv data in the header showing that the actual command line was

 fasta33 -m 10 NMA0159.aa files/fp_demo.db

>>>NMA0159.aa, 425 aa vs files/fp_demo.db library
; mp_name: fasta33
; mp_ver: 33t06
; mp_argv: fasta33 -m 10 NMA0159.aa files/fp_demo.db
; pg_name: FASTA
; pg_ver: 3.36 June 2000
; pg_matrix: BL50 (15:-5)
; pg_gap-pen: -12 -2
; pg_ktup: 2
; pg_optcut: 25
; pg_cgap: 37
; mp_extrap: 60000 2064
; mp_stats:  Expectation_n fit: rho(ln(x))= 7.0580+/-0.00388; mu= 0.0108+/- 0.21
6  mean_var=85.1078+/-17.948, 0's: 0 Z-trim: 1  B-trim: 192 in 2/38  Lambda= 0.1
390
; mp_KS: 0.0185 (N=29) at  44
>>NMA0159 putative two-component trancriptional regulator 143485:144762 reverse 
MW:46361
; fa_frame: f
; fa_initn: 2739
; fa_init1: 2739
; fa_opt: 2739
; fa_z-score: 2972.7
; fa_bits: 559.1
; fa_expect: 1.9e-160
; sw_score: 2739
; sw_ident: 1.000
; sw_gident: 1.000
; sw_overlap: 425
>NMA0159 ..
; sq_len: 425
; sq_offset: 1
; sq_type: p
; al_start: 1
; al_stop: 425
; al_display_start: 1
MRSSDILIVDDEIGIRDLLSEILQDEGYSVALAENAEEARKLRHQARPAM
VLLDIWMPDCDGITLLKEWAKNGQLNMPVVMMSGHASIDTAVEATKIGAL
DFLEKPISLQKLLSAVENALKYGAAQTETGPVFDKLGNSAAIQEMNREVG
AAVKCASPVLLTGEAGSPFETVARYFHKNGTPWVSPARVEYLINMPMELL
QKAEGGVLYVGDIAQYSRNIQAGIAFIVGKAEHRRVRVVASGSRAAGSDG
IACEEKLAELLSESVVRIPPLRMQHEDIPFLIQGITCNVAESQKIAPASF
SEDALAALTRYEWPGNFDQLSSVVATLLLEADGQEIGAGAVSSLLGQNVP
AEGAEDMVGGFNFNLPLRELREEVERRYFEYHIAQEGQNMSKVAQKVGLE
RTHLYRKLKQLGIGVSRRAGEKTEE
>NMA0159 ..
; sq_len: 425
; sq_type: p
; al_start: 1
; al_stop: 425
; al_display_start: 1
MRSSDILIVDDEIGIRDLLSEILQDEGYSVALAENAEEARKLRHQARPAM
VLLDIWMPDCDGITLLKEWAKNGQLNMPVVMMSGHASIDTAVEATKIGAL
DFLEKPISLQKLLSAVENALKYGAAQTETGPVFDKLGNSAAIQEMNREVG
AAVKCASPVLLTGEAGSPFETVARYFHKNGTPWVSPARVEYLINMPMELL
QKAEGGVLYVGDIAQYSRNIQAGIAFIVGKAEHRRVRVVASGSRAAGSDG
IACEEKLAELLSESVVRIPPLRMQHEDIPFLIQGITCNVAESQKIAPASF
SEDALAALTRYEWPGNFDQLSSVVATLLLEADGQEIGAGAVSSLLGQNVP
AEGAEDMVGGFNFNLPLRELREEVERRYFEYHIAQEGQNMSKVAQKVGLE
RTHLYRKLKQLGIGVSRRAGEKTEE
; al_cons:
::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::
:::::::::::::::::::::::::

My guess is that you version of Fasta is omitting the lower section of
data.

Something in the back of my brain reminded me to look at the
Makefile... it's only a *compile-time* option isn't it? Ack! I had
forgotten about that because our Fasta is always built with it
enabled. Sorry about that.

>From the readme:

>> November 8, 1996

fasta30t7 differs from fasta30t6 in the amount of information provided
with the -m 10 option.

(1) The query and library sequence identifiers are no longer abbreviated.

(2) New information about the program and program version are provided:

The new information provided is:

	mp_name: program name (actually argv[0])
	mp_ver: main program version (can be different from function version)
	mp_argv: command line arguments (duplicates argv[0])

    Some statistical information is provided as well:
	mp_extrap: XXXX YYY - statistics extrapolated from XXX to YYY
	mp_stats: indicates type of statistics used for E() value
	mp_KS: Kolmogorov-Smirnoff statistic

The "mp_" (main program) information is function independent, while the "pg_"
information is produced by a particular comparison function (ssearch,
fastx, fasta, etc).  "pg_" should probably be called "fn_", and "mp_"
called "pg_", but I remain backwards compatible.

(3) The end of the "parseable" records is denoted with:

	>>><<<

(4) There now an compile-time option -DM10_CONS, that allows you to
display a final alignment summary:

;al_cons:
     .::.:-   .:: ..  :.    .:.---:   :  .--.:. : 
..  .---  ..: :: ... :..: .::.:. .  .---.  .   .: 
 : .  . . :    ..   .    :..: .--. . : .:. .. :  .
 .:.:::  ..:. :

So Fasta has to be built with -DM10_CONS. I've added this to the
Javadoc of FastaSearchSAXParser.

Keith

-- 

- Keith James <kdj@sanger.ac.uk> bioinformatics programming support -
- Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, UK -