[Biojava-l] biojava BLAST parser proposal

Peter Keller keller@ebi.ac.uk
Fri, 18 Feb 2000 15:01:46 +0000 (GMT)


Hi Simon (and others),

I am a little bemused by all this blast parser stuff. Blast output was
never meant to be parsed, although lots of people try with varying
degrees of success. However, the format of blast output files is not
stable, and programs that parse them can break from time to time.

If you are really interested in moving the software handling of blast
output forward in a big way, I would have thought that the first step
would be to write a nice easy-to-parse output option for blast, and
persuade the NCBI to include it in future versions. FastA already has
this in the form of the '-m 10' option, and I have put a sample of the
output from this below. Surely this has to be better than trying to
parse output that was designed to be human-readable? This kind of format
has the additional advantage that new fields can be added without
breaking existing software, as long as that software is written to
ignore fields that it isn't interested in or doesn't know about.

Regards,
Peter.

========================================================================
Peter Keller.                     | "Research without indebtedness is
European Bioinformatics Institute,|  suspect, and somebody must always,
Hinxton Hall,                     |  somehow, be thanked."
Cambridge, CB10 1SD, UK           |                     --- Umberto Eco
-----------------------------------
Email: keller@ebi.ac.uk |
Tel. (+44/0)1223 494637 | Macromolecular Structure Database
Fax. (+44/0)1223 494468 | http://msd.ebi.ac.uk
========================================================================


----------------- Sample of 'fasta3 -m 10' output starts here ------------

>>>test.seq, 120 nt vs /ebi/services/idata/fastadb/em_ov library
; mp_name: fasta3
; mp_ver: 30t78.2
; mp_argv: fasta3 -m 10
; pg_name: FASTA
; pg_ver: 3.06 Sept, 1996
; pg_matrix: DNA
; pg_gap-pen: -16 -4
; pg_ktup: 6
; pg_optcut: 31
; pg_cgap: 46
; mp_extrap: 50000 19977
; mp_stats: Expectation fit: rho(ln(x))= 5.3978+/-0.000719; mu= 12.0045+/- 0.047
;  mean_var=90.4621+/-14.975
; mp_KS: 0.0227 (N=29) at  40
>>EM_OV:GGU87449 U87449 GALLUS GALLUS OPSIN GENE, COMPLETE CDS.
; fa_initn:  81
; fa_init1:  51
; fa_opt: 112
; fa_z-score: 107.3
; fa_expect:    1.8
; fa_ident: 0.627
; fa_overlap: 83
>BOVPRL ..
; sq_len: 120
; sq_type: D
; al_start: 2
; al_stop: 81
; al_display_start: 1
-----------------------------TGCTTGGCTGAGGAGCCATAG
GACGAGAGC---TTCCTGGTGAAGTGTGTTTCTTGAAATCATCACCACCA
TGGACAGCAAAGGTTCGTCGCAGAAAGGGTCCCGCCTGCTCCTGCTGCTG
GT
>EM_OV:GGU87449 ..
; sq_len: 4543
; sq_type: D
; al_start: 3887
; al_stop: 3968
; al_display_start: 3857
ACACCTGGGCCCCATGCGGATGTCACTGCAGCGGGGCTGAGGAACAAGGT
GATGCCAGCACACCCCGTGTGACCTCTGTTTCAGCACAGCTTCACCAACA
CGGGCA-CAACGGAGGGCCAGGGAGCAGTGCTCCAACGGGACCCAGCAGG
CCCAGAAAAGCACAGCATTGCCTTCTCGTG
;al_cons:
                              ==mm=========m=m=mmm
==m=mm===---mm==mm====mm=m======mmm=m=m=m======m==
m==m==-===m==