[Biojava-l] biojava BLAST parser proposal

Ewan Birney birney@ebi.ac.uk
Fri, 18 Feb 2000 15:07:54 +0000


On Fri, 18 Feb 2000, Peter Keller wrote:

> Hi Simon (and others),
> 
> I am a little bemused by all this blast parser stuff. Blast output was
> never meant to be parsed, although lots of people try with varying
> degrees of success. However, the format of blast output files is not
> stable, and programs that parse them can break from time to time.


No shit!

> 
> If you are really interested in moving the software handling of blast
> output forward in a big way, I would have thought that the first step
> would be to write a nice easy-to-parse output option for blast, and
> persuade the NCBI to include it in future versions. FastA already has
> this in the form of the '-m 10' option, and I have put a sample of the
> output from this below. Surely this has to be better than trying to
> parse output that was designed to be human-readable? This kind of format
> has the additional advantage that new fields can be added without
> breaking existing software, as long as that software is written to
> ignore fields that it isn't interested in or doesn't know about.
> 


Peter - I almost broke into real deep laughter. The idea that ncbi would
take other people's code from the outside is pretty unusual and you
clearly haven't gone into the guts of BLAST 2.0 -> ooooh boy.


....and ncbi would claim that there is a structured format, namely
the ASN.1 model. 

(just that they change the ASN.1 definition regularly).





> Regards,
> Peter.
> 
> ========================================================================
> Peter Keller.                     | "Research without indebtedness is
> European Bioinformatics Institute,|  suspect, and somebody must always,
> Hinxton Hall,                     |  somehow, be thanked."
> Cambridge, CB10 1SD, UK           |                     --- Umberto Eco
> -----------------------------------
> Email: keller@ebi.ac.uk |
> Tel. (+44/0)1223 494637 | Macromolecular Structure Database
> Fax. (+44/0)1223 494468 | http://msd.ebi.ac.uk
> ========================================================================
> 
> 
> ----------------- Sample of 'fasta3 -m 10' output starts here ------------
> 
> >>>test.seq, 120 nt vs /ebi/services/idata/fastadb/em_ov library
> ; mp_name: fasta3
> ; mp_ver: 30t78.2
> ; mp_argv: fasta3 -m 10
> ; pg_name: FASTA
> ; pg_ver: 3.06 Sept, 1996
> ; pg_matrix: DNA
> ; pg_gap-pen: -16 -4
> ; pg_ktup: 6
> ; pg_optcut: 31
> ; pg_cgap: 46
> ; mp_extrap: 50000 19977
> ; mp_stats: Expectation fit: rho(ln(x))= 5.3978+/-0.000719; mu= 12.0045+/- 0.047
> ;  mean_var=90.4621+/-14.975
> ; mp_KS: 0.0227 (N=29) at  40
> >>EM_OV:GGU87449 U87449 GALLUS GALLUS OPSIN GENE, COMPLETE CDS.
> ; fa_initn:  81
> ; fa_init1:  51
> ; fa_opt: 112
> ; fa_z-score: 107.3
> ; fa_expect:    1.8
> ; fa_ident: 0.627
> ; fa_overlap: 83
> >BOVPRL ..
> ; sq_len: 120
> ; sq_type: D
> ; al_start: 2
> ; al_stop: 81
> ; al_display_start: 1
> -----------------------------TGCTTGGCTGAGGAGCCATAG
> GACGAGAGC---TTCCTGGTGAAGTGTGTTTCTTGAAATCATCACCACCA
> TGGACAGCAAAGGTTCGTCGCAGAAAGGGTCCCGCCTGCTCCTGCTGCTG
> GT
> >EM_OV:GGU87449 ..
> ; sq_len: 4543
> ; sq_type: D
> ; al_start: 3887
> ; al_stop: 3968
> ; al_display_start: 3857
> ACACCTGGGCCCCATGCGGATGTCACTGCAGCGGGGCTGAGGAACAAGGT
> GATGCCAGCACACCCCGTGTGACCTCTGTTTCAGCACAGCTTCACCAACA
> CGGGCA-CAACGGAGGGCCAGGGAGCAGTGCTCCAACGGGACCCAGCAGG
> CCCAGAAAAGCACAGCATTGCCTTCTCGTG
> ;al_cons:
>                               ==mm=========m=m=mmm
> ==m=mm===---mm==mm====mm=m======mmm=m=m=m======m==
> m==m==-===m==
> 
> 
> 
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
> 

-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230
<birney@ebi.ac.uk>
-----------------------------------------------------------------