[Biojava-l] Blast parser?

Simon Brocklehurst simon.brocklehurst@CambridgeAntibody.com
Wed, 16 Feb 2000 18:41:21 +0000


This is a multi-part message in MIME format.
--------------610F56FDD687C2F8BC826810
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Forwarded message from Terry (my fault 'cos I replied to him, and not to
the list as well).

--
Simon M. Brocklehurst, Ph.D.
Head of Bioinformatics
Cambridge Antibody Technology
The Science Park, Melbourn, Cambridgeshire, UK
http://www.CambridgeAntibody.com/
mailto:simon.brocklehurst@CambridgeAntibody.com


--------------610F56FDD687C2F8BC826810
Content-Type: message/rfc822
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Return-Path: <jchang@SMI.Stanford.EDU>
Received: from [192.168.1.2] (HELO camb-antibody)
  by camb-antibody.co.uk (CommuniGate Pro SMTP 3.1)
  with SMTP id 1265783 for simon.brocklehurst@cambridgeantibody.com; Wed, 16 Feb 2000 17:16:05 +0000
Received: from crg-gw.Stanford.EDU ([171.65.32.201]) by camb-antibody.camb-antibody.co.uk; Wed, 16 Feb 2000 17:36:00 +0000 (GMT)
Received: from taiyang.Stanford.EDU (jchang@taiyang.Stanford.EDU [171.65.32.101])
	by crg-gw.Stanford.EDU (8.9.1a/8.9.1) with ESMTP id JAA03932;
	Wed, 16 Feb 2000 09:37:03 -0800 (PST)
Received: (from jchang@localhost)
	by taiyang.Stanford.EDU (8.9.0.Beta5/8.8.8) id JAA09840;
	Wed, 16 Feb 2000 09:37:03 -0800 (PST)
Date: Wed, 16 Feb 2000 09:37:03 -0800 (PST)
From: Jeffrey Chang <jchang@SMI.Stanford.EDU>
Sender: jchang@crg-gw.Stanford.EDU
To: Simon Brocklehurst <simon.brocklehurst@cambridgeantibody.com>
cc: biojava-l@biojava.org
Subject: Re: [Biojava-l] status?  Blast parser?
In-Reply-To: <38AACA96.D1AA4C1E@CambridgeAntibody.com>
Message-ID: <Pine.GSO.4.05.10002160925060.9839-100000@taiyang>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Mozilla-Status2: 00000000

Hello all,

On Wed, 16 Feb 2000, Simon Brocklehurst wrote:
> 
> It's not hard to add SAX functionality to our systems, and if the
> consensus view of people on the list is that we should go ahead and
> use CAT's code as the basis for the biojava BLAST parser, we will
> definitely implement that fairly quickly. If we get to that point,
> we'll need to agree on some standard meta data conventions and I'll
> post a proposal (e.g. we are keen for this software to work with
> software that has generic BLAST-like output (e.g. HMMER) with the
> minimum of effort, so our proposal would probably reflect that).

The biopython parsers are built around a SAX-like event model that you've
described.  The discussions are documented in the newsgroup threads in
November and December:
http://www.biopython.org/pipermail/biopython/1999-November/thread.html
http://www.biopython.org/pipermail/biopython/1999-December/thread.html

The final design is documented within the CVS tree, but it's relatively
long, so I won't post it here.  Basically, it's build around a
Scanner/Consumer model where a Scanner object goes through a stream,
recognizes content, and passes it into a Consumer object that does the
final processing.  Then, a Parser object contains both a Scanner and
Consumer, and thus has the ability to take an input stream and processes
it into some final data structure.

It's relatively flexible, as you can substitute different consumers
depending on what kind of data you want.

We've already decided upon a meta-data convention for BLAST content.  My
feeling is that you'll run into trouble if you try to have 1 standard for
all similarity algorithms, and you'll be better off creating a standard
specifically for each algorithm, and then a more general one that they can
map into.

Jeff




BLAST Scanners produce the following events:
SECTION NAME                      COMMENTS
    EVENT NAME

header
    version
    reference
    query_info
    database_info

descriptions
    round                         psi blast
    model_sequences               psi blast
    nonmodel_sequences            psi blast
    converged                     psi blast
    description
    no_hits

alignment
    multalign                     master-slave
    title                         pairwise
    length                        pairwise
  hsp
    score                         pairwise
    identities                    pairwise
    strand                        pairwise, blastn
    frame                         pairwise, blastx, tblastn, tblastx
    query                         pairwise
    align                         pairwise
    sbjct                         pairwise

database_report
    database
    posted_date
    num_letters_in_database
    num_sequences_in_database
    num_letters_searched          RESERVED.  Currently unused.  I've never
    num_sequences_searched        RESERVED.  seen it, but it's in blastool.c..
    ka_params
    gapped                        not blastp
    ka_params_gap                 gapped mode (not tblastx)

parameters
    matrix
    gap_penalties                 gapped mode (not tblastx)
    num_hits                      
    num_sequences                 
    num_extends                   
    num_good_extends              
    num_seqs_better_e
    hsps_no_gap                   gapped (not tblastx) and not blastn
    hsps_prelim_gapped            gapped (not tblastx) and not blastn
    hsps_prelim_gap_attempted     gapped (not tblastx) and not blastn
    hsps_gapped                   gapped (not tblastx) and not blastn
    query_length
    database_length
    effective_hsp_length
    effective_query_length
    effective_database_length
    effective_search_space
    effective_search_space_used
    frameshift                    blastx or tblastn or tblastx
    threshold
    window_size
    dropoff_1st_pass
    gap_x_dropoff
    gap_x_dropoff_final           gapped (not tblastx) and not blastn
    gap_trigger
    blast_cutoff




--------------610F56FDD687C2F8BC826810--