[Bioperl-l] Changes in FASTA output format
William R. Pearson
wrp at virginia.edu
Fri Mar 30 18:05:15 UTC 2007
The next major revision of the FASTA program package will have some
major improvements to the strategy for calculating statistical
significance, particularly when a small library is being searched
(high scoring sequences will be shuffled and used to estimate a
second set of statistical parameters).
As a result, I am considering some changes in FASTA output.
(1) I would like to expand the line that shows the algorithm and
scoring matrix parameters to multiple lines. Currently it looks like:
Smith-Waterman (SSE2, Michael Farrar 2006) (6.0 Mar 2007) function
[BL50 matrix (15:-5)], open/ext: -12/-2
Scan time: 2.140
I would like to allow at least two lines here, one for the algorithm
and version, a second for the scoring parameters:
Smith-Waterman (SSE2, Michael Farrar 2006) (6.0 Mar 2007) function
BL50 matrix (15:-5), open/ext: -12/-2
Scan time: 2.140
I could even imagine tagging the lines:
Algorithm: Smith-Waterman (SSE2, Michael Farrar 2006) (6.0 Mar 2007)
Parameters: BL50 matrix (15:-5), open/ext: -12/-2
Scan time: 2.140
I don't think this would break many FASTA parsers, but I wanted to
check.
(2) I am also thinking about displaying multiple E()-values,
depending on whether they are calculated from the similarity search
or the shuffled high scores, e.g., going from:
The best scores are: s-w bits E
(231210)
gi|121716|sp|P10649|GSTM1_MOUSE Glutathione S-tran ( 218) 1497 349.6
6.1e-96
gi|121717|sp|P04905|GSTM1_RAT Glutathione S-transf ( 218) 1413 330.4
3.8e-90
gi|399829|sp|Q00285|GSTMU_CRILO Glutathione S-tran ( 218) 1354 316.9
4.5e-86
To:
The best scores are: s-w bits E
(231210) ES()
gi|121716|sp|P10649|GSTM1_MOUSE Glutathione S-tran ( 218) 1497 349.6
6.1e-96 5.5e-95
gi|121717|sp|P04905|GSTM1_RAT Glutathione S-transf ( 218) 1413 330.4
3.8e-90 2.2e-89
gi|399829|sp|Q00285|GSTMU_CRILO Glutathione S-tran ( 218) 1354 316.9
4.5e-86 8.3e-85
I think this output would break many more FASTA parsers, and one
option would be (initially) to add it only to the alignment output.
Naturally, initially it will be easy to revert to the classic format.
I would appreciate any comments on the problems these changes might
cause.
Bill Pearson
More information about the Bioperl-l
mailing list