[Biojava-l] biojava BLAST parser proposal

Bernard French bfrench@cas.org
Fri, 18 Feb 2000 10:15:40 -0500 (EST)


Hi Folks,

First, let me say I'm enjoying following the conversations that are going
on within this group and second I agree with Peter about the blast
parser. I believe NCBI in their blast output return an ASN.1 SeqAlign
structure, which they use to generate a blast output report. Whether
the SeqAlign structure is accessible through the blast software which
would allow you to format your own output I don't know.

-- Bernie

***********************************************************************
*                                                                     *
*  Bernard T. French, Ph.D.                Chemical Abstracts Service *
*  Department Manager                      2540 Olentangy River Road  *
*  Molecular Biology and Genetics Dept.    Columbus, Ohio 43202       *
*  Email: bfrench@cas.org                                             *
*                                                                     *
***********************************************************************

|Date: Fri, 18 Feb 2000 15:01:46 +0000 (GMT)
|From: Peter Keller <keller@ebi.ac.uk>
|To: biojava-l@biojava.org
|Subject: Re: [Biojava-l] biojava BLAST parser proposal
|Organization: "EBI - European Bioinformatics Institute"
|MIME-Version: 1.0
|Content-Type: TEXT/PLAIN; charset=US-ASCII
|Sender: biojava-l-admin@biojava.org
|X-Mailman-Version: 1.0rc3
|List-Id: Biojava discussion list <biojava-l.biojava.org>
|X-BeenThere: biojava-l@biojava.org
|
|Hi Simon (and others),
|
|I am a little bemused by all this blast parser stuff. Blast output was
|never meant to be parsed, although lots of people try with varying
|degrees of success. However, the format of blast output files is not
|stable, and programs that parse them can break from time to time.
|
|If you are really interested in moving the software handling of blast
|output forward in a big way, I would have thought that the first step
|would be to write a nice easy-to-parse output option for blast, and
|persuade the NCBI to include it in future versions. FastA already has
|this in the form of the '-m 10' option, and I have put a sample of the
|output from this below. Surely this has to be better than trying to
|parse output that was designed to be human-readable? This kind of format
|has the additional advantage that new fields can be added without
|breaking existing software, as long as that software is written to
|ignore fields that it isn't interested in or doesn't know about.
|
|Regards,
|Peter.
|
|========================================================================
|Peter Keller.                     | "Research without indebtedness is
|European Bioinformatics Institute,|  suspect, and somebody must always,
|Hinxton Hall,                     |  somehow, be thanked."
|Cambridge, CB10 1SD, UK           |                     --- Umberto Eco
|-----------------------------------
|Email: keller@ebi.ac.uk |
|Tel. (+44/0)1223 494637 | Macromolecular Structure Database
|Fax. (+44/0)1223 494468 | http://msd.ebi.ac.uk
|========================================================================
|
|
|----------------- Sample of 'fasta3 -m 10' output starts here ------------
|
|>>>test.seq, 120 nt vs /ebi/services/idata/fastadb/em_ov library
|; mp_name: fasta3
|; mp_ver: 30t78.2
|; mp_argv: fasta3 -m 10
|; pg_name: FASTA
|; pg_ver: 3.06 Sept, 1996
|; pg_matrix: DNA
|; pg_gap-pen: -16 -4
|; pg_ktup: 6
|; pg_optcut: 31
|; pg_cgap: 46
|; mp_extrap: 50000 19977
|; mp_stats: Expectation fit: rho(ln(x))= 5.3978+/-0.000719; mu= 12.0045+/- 
0.047
|;  mean_var=90.4621+/-14.975
|; mp_KS: 0.0227 (N=29) at  40
|>>EM_OV:GGU87449 U87449 GALLUS GALLUS OPSIN GENE, COMPLETE CDS.
|; fa_initn:  81
|; fa_init1:  51
|; fa_opt: 112
|; fa_z-score: 107.3
|; fa_expect:    1.8
|; fa_ident: 0.627
|; fa_overlap: 83
|>BOVPRL ..
|; sq_len: 120
|; sq_type: D
|; al_start: 2
|; al_stop: 81
|; al_display_start: 1
|-----------------------------TGCTTGGCTGAGGAGCCATAG
|GACGAGAGC---TTCCTGGTGAAGTGTGTTTCTTGAAATCATCACCACCA
|TGGACAGCAAAGGTTCGTCGCAGAAAGGGTCCCGCCTGCTCCTGCTGCTG
|GT
|>EM_OV:GGU87449 ..
|; sq_len: 4543
|; sq_type: D
|; al_start: 3887
|; al_stop: 3968
|; al_display_start: 3857
|ACACCTGGGCCCCATGCGGATGTCACTGCAGCGGGGCTGAGGAACAAGGT
|GATGCCAGCACACCCCGTGTGACCTCTGTTTCAGCACAGCTTCACCAACA
|CGGGCA-CAACGGAGGGCCAGGGAGCAGTGCTCCAACGGGACCCAGCAGG
|CCCAGAAAAGCACAGCATTGCCTTCTCGTG
|;al_cons:
|                              ==mm=========m=m=mmm
|==m=mm===---mm==mm====mm=m======mmm=m=m=m======m==
|m==m==-===m==
|
|
|
|
|_______________________________________________
|Biojava-l mailing list  -  Biojava-l@biojava.org
|http://biojava.org/mailman/listinfo/biojava-l