[BioPython] import Standalone problems
Rohini Damle
rohini.damle at gmail.com
Wed Jul 19 19:09:06 UTC 2006
Hi,
Can someone suggest me for which version of Blast, the Biopython's
(text or xml) parser works fine?
I will download that blast version locally and can use biopython's parser.
thanx,
Rohini
On 7/18/06, Jacob Joseph <jacob at jjoseph.org> wrote:
> Hi.
> I encountered similar difficulties over the past few days myself and
> have made some improvements to the XML parser. Well, that is, it now
> functions with blastall, but I have made no effort to parse the other
> blast programs. I do not expect I have done any harm to other parsing,
> however.
>
> Attached are Record.py, NCBIStandalone.py, and NCBIXML.py. I have not
> yet spent significant time to clean up my changes. Without getting into
> specific modifications, I have made an effort to make consistent the
> variables in Record and NCBIXML, focusing primarily on what I needed
> this week.
>
> One portion I am not settled on reinitialization of Record.Blast at
> every call to iterator.next(), and, by extension, BlastParser.parse().
> See NCBIXML.py, line 114. Without re-initializing this class, we run
> the risk of retaining portions of a Record from previously parsed
> queries. This causes the bug 1970, mentioned below. Unfortunately,
> this re-initialization exacts a significant performance penalty of at
> least a factor of 10 by some rough measures. I would appreciate any
> suggestions for improvement here.
>
> I do apologize for not being more specific about my changes. When I get
> a chance(next week?), I will package them up as a proper patch and file
> a bug. Perhaps what I have done so far will be of use until then.
>
> fyi, I have done all of my testing with Blast 2.2.13. 2.2.14 seems to
> not have separate <?xml> blocks within its output, requiring a different
> method of iteration.
>
> -Jacob
>
> Peter wrote:
> > Rohini Damle wrote:
> >> Hi,
> >> I have a XML file with 4 blast records (for proteins P1, P2, P3, P4)
> >> I am trying to extract alignment information for each of them.
> >> So I wrote the following code:
> >>
> >> for b_record in b_iterator :
> >>
> >> E_VALUE_THRESH =20
> >> for alignment in b_record.alignments:
> >> for hsp in alignment.hsps:
> >> if hsp.expect< E_VALUE_THRESH:
> >>
> >> print '****Alignment****'
> >> print 'sequence:', alignment.title.split()[0]
> >>
> >> With this code, I am getting information for P1,
> >> then information for P1 + P2
> >> then for P1+P2 +P3
> >> and finally for P1+P2+P3+P4
> >> why this is so?
> >> is there something wrong with the looping?
> >
> > I'm aware of something funny with the XML parsing, Bug 1970, which might
> > well be the same issue:
> >
> > http://bugzilla.open-bio.org/show_bug.cgi?id=1970
> >
> > I confess I haven't looked into exactly what is going wrong here - too
> > many other demands on my time to learn about XML and how BioPython
> > parses it.
> >
> > Does the work around on the bug report help? Depending on which version
> > of standalone blast you have installed, you might have better luck with
> > plain text output - the trouble is this is a moving target and the NBCI
> > keeps tweaking it.
> >
> > Peter
> >
> > _______________________________________________
> > BioPython mailing list - BioPython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
>
>
>
> # Copyright 1999-2000 by Jeffrey Chang. All rights reserved.
> # This code is part of the Biopython distribution and governed by its
> # license. Please see the LICENSE file that should have been included
> # as part of this package.
> # Patches by Mike Poidinger to support multiple databases.
>
> """
> This module provides code to work with the standalone version of
> BLAST, either blastall or blastpgp, provided by the NCBI.
> http://www.ncbi.nlm.nih.gov/BLAST/
>
> Classes:
> LowQualityBlastError Except that indicates low quality query sequences.
> BlastParser Parses output from blast.
> BlastErrorParser Parses output and tries to diagnose possible errors.
> PSIBlastParser Parses output from psi-blast.
> Iterator Iterates over a file of blast results.
>
> _Scanner Scans output from standalone BLAST.
> _BlastConsumer Consumes output from blast.
> _PSIBlastConsumer Consumes output from psi-blast.
> _HeaderConsumer Consumes header information.
> _DescriptionConsumer Consumes description information.
> _AlignmentConsumer Consumes alignment information.
> _HSPConsumer Consumes hsp information.
> _DatabaseReportConsumer Consumes database report information.
> _ParametersConsumer Consumes parameters information.
>
> Functions:
> blastall Execute blastall.
> blastpgp Execute blastpgp.
> rpsblast Execute rpsblast.
>
> """
>
> from __future__ import generators
> import os
> import re
>
> from Bio import File
> from Bio.ParserSupport import *
> from Bio.Blast import Record
>
>
> class LowQualityBlastError(Exception):
> """Error caused by running a low quality sequence through BLAST.
>
> When low quality sequences (like GenBank entries containing only
> stretches of a single nucleotide) are BLASTed, they will result in
> BLAST generating an error and not being able to perform the BLAST.
> search. This error should be raised for the BLAST reports produced
> in this case.
> """
> pass
>
> class ShortQueryBlastError(Exception):
> """Error caused by running a short query sequence through BLAST.
>
> If the query sequence is too short, BLAST outputs warnings and errors:
> Searching[blastall] WARNING: [000.000] AT1G08320: SetUpBlastSearch failed.
> [blastall] ERROR: [000.000] AT1G08320: Blast:
> [blastall] ERROR: [000.000] AT1G08320: Blast: Query must be at least wordsize
> done
>
> This exception is raised when that condition is detected.
>
> """
> pass
>
>
> class _Scanner:
> """Scan BLAST output from blastall or blastpgp.
>
> Tested with blastall and blastpgp v2.0.10, v2.0.11
>
> Methods:
> feed Feed data into the scanner.
>
> """
> def feed(self, handle, consumer):
> """S.feed(handle, consumer)
>
> Feed in a BLAST report for scanning. handle is a file-like
> object that contains the BLAST report. consumer is a Consumer
> object that will receive events as the report is scanned.
>
> """
> if isinstance(handle, File.UndoHandle):
> uhandle = handle
> else:
> uhandle = File.UndoHandle(handle)
>
> # Try to fast-forward to the beginning of the blast report.
> read_and_call_until(uhandle, consumer.noevent, contains='BLAST')
> # Now scan the BLAST report.
> self._scan_header(uhandle, consumer)
> self._scan_rounds(uhandle, consumer)
> self._scan_database_report(uhandle, consumer)
> self._scan_parameters(uhandle, consumer)
>
> def _scan_header(self, uhandle, consumer):
> # BLASTP 2.0.10 [Aug-26-1999]
> #
> #
> # Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaf
> # Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
> # "Gapped BLAST and PSI-BLAST: a new generation of protein database sea
> # programs", Nucleic Acids Res. 25:3389-3402.
> #
> # Query= test
> # (140 letters)
> #
> # Database: sdqib40-1.35.seg.fa
> # 1323 sequences; 223,339 total letters
> #
>
> consumer.start_header()
>
> read_and_call(uhandle, consumer.version, contains='BLAST')
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> # Read the reference lines and the following blank line.
> # There might be a <pre> line, for qblast output.
> attempt_read_and_call(uhandle, consumer.noevent, start="<pre>")
> read_and_call(uhandle, consumer.reference, start='Reference')
> while 1:
> line = uhandle.readline()
> if is_blank_line(line) or line.startswith("RID"):
> consumer.noevent(line)
> read_and_call_while(uhandle, consumer.noevent, blank=1)
> break
> consumer.reference(line)
>
> # blastpgp has a Reference for composition-based statistics.
> if attempt_read_and_call(
> uhandle, consumer.reference, start="Reference"):
> read_and_call_until(uhandle, consumer.reference, blank=1)
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> # Read the Query lines and the following blank line.
> read_and_call(uhandle, consumer.query_info, start='Query=')
> read_and_call_until(uhandle, consumer.query_info, blank=1)
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> # Read the database lines and the following blank line.
> read_and_call_until(uhandle, consumer.database_info, end='total letters')
> read_and_call(uhandle, consumer.database_info, contains='sequences')
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> consumer.end_header()
>
> def _scan_rounds(self, uhandle, consumer):
> # Scan a bunch of rounds.
> # Each round begins with either a "Searching......" line
> # or a 'Score E' line followed by descriptions and alignments.
> # The email server doesn't give the "Searching....." line.
> # If there is no 'Searching.....' line then you'll first see a
> # 'Results from round' line
>
> while 1:
> line = safe_peekline(uhandle)
> if (not line.startswith('Searching') and
> not line.startswith('Results from round') and
> re.search(r"Score +E", line) is None and
> line.find('No hits found') == -1):
> break
>
> self._scan_descriptions(uhandle, consumer)
> self._scan_alignments(uhandle, consumer)
>
> def _scan_descriptions(self, uhandle, consumer):
> # Searching..................................................done
> # Results from round 2
> #
> #
> # Sc
> # Sequences producing significant alignments: (b
> # Sequences used in model and found again:
> #
> # d1tde_2 3.4.1.4.4 (119-244) Thioredoxin reductase [Escherichia ...
> # d1tcob_ 1.31.1.5.16 Calcineurin regulatory subunit (B-chain) [B...
> # d1symb_ 1.31.1.2.2 Calcyclin (S100) [RAT (RATTUS NORVEGICUS)]
> #
> # Sequences not found previously or not previously below threshold:
> #
> # d1osa__ 1.31.1.5.11 Calmodulin [Paramecium tetraurelia]
> # d1aoza3 2.5.1.3.3 (339-552) Ascorbate oxidase [zucchini (Cucurb...
> #
>
> # If PSI-BLAST, may also have:
> #
> # CONVERGED!
>
> consumer.start_descriptions()
>
> # Read 'Searching'
> # This line seems to be missing in BLASTN 2.1.2 (others?)
> attempt_read_and_call(uhandle, consumer.noevent, start='Searching')
>
> # blastpgp 2.0.10 from NCBI 9/19/99 for Solaris sometimes crashes here.
> # If this happens, the handle will yield no more information.
> if not uhandle.peekline():
> raise SyntaxError, "Unexpected end of blast report. " + \
> "Looks suspiciously like a PSI-BLAST crash."
>
> # BLASTN 2.2.3 sometimes spews a bunch of warnings and errors here:
> # Searching[blastall] WARNING: [000.000] AT1G08320: SetUpBlastSearch
> # [blastall] ERROR: [000.000] AT1G08320: Blast:
> # [blastall] ERROR: [000.000] AT1G08320: Blast: Query must be at leas
> # done
> # Reported by David Weisman.
> # Check for these error lines and ignore them for now. Let
> # the BlastErrorParser deal with them.
> line = uhandle.peekline()
> if line.find("ERROR:") != -1 or line.startswith("done"):
> read_and_call_while(uhandle, consumer.noevent, contains="ERROR:")
> read_and_call(uhandle, consumer.noevent, start="done")
>
> # Check to see if this is PSI-BLAST.
> # If it is, the 'Searching' line will be followed by:
> # (version 2.0.10)
> # Searching.............................
> # Results from round 2
> # or (version 2.0.11)
> # Searching.............................
> #
> #
> # Results from round 2
>
> # Skip a bunch of blank lines.
> read_and_call_while(uhandle, consumer.noevent, blank=1)
> # Check for the results line if it's there.
> if attempt_read_and_call(uhandle, consumer.round, start='Results'):
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> # Three things can happen here:
> # 1. line contains 'Score E'
> # 2. line contains "No hits found"
> # 3. no descriptions
> # The first one begins a bunch of descriptions. The last two
> # indicates that no descriptions follow, and we should go straight
> # to the alignments.
> if not attempt_read_and_call(
> uhandle, consumer.description_header,
> has_re=re.compile(r'Score +E')):
> # Either case 2 or 3. Look for "No hits found".
> attempt_read_and_call(uhandle, consumer.no_hits,
> contains='No hits found')
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> #Psiblast can repeat the Searching...No hits found section
> if attempt_read_and_call(uhandle, consumer.noevent,
> start='Searching'):
> read_and_call_while(uhandle, consumer.noevent, blank=1)
> read_and_call(uhandle, consumer.noevent,
> contains='No hits found')
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> consumer.end_descriptions()
> # Stop processing.
> return
>
> # Read the score header lines
> read_and_call(uhandle, consumer.description_header,
> start='Sequences producing')
>
> # If PSI-BLAST, read the 'Sequences used in model' line.
> attempt_read_and_call(uhandle, consumer.model_sequences,
> start='Sequences used in model')
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> # Read the descriptions and the following blank lines, making
> # sure that there are descriptions.
> if not uhandle.peekline().startswith('Sequences not found'):
> read_and_call_until(uhandle, consumer.description, blank=1)
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> # If PSI-BLAST, read the 'Sequences not found' line followed
> # by more descriptions. However, I need to watch out for the
> # case where there were no sequences not found previously, in
> # which case there will be no more descriptions.
> if attempt_read_and_call(uhandle, consumer.nonmodel_sequences,
> start='Sequences not found'):
> # Read the descriptions and the following blank lines.
> read_and_call_while(uhandle, consumer.noevent, blank=1)
> l = safe_peekline(uhandle)
> # Brad -- added check for QUERY. On some PSI-BLAST outputs
> # there will be a 'Sequences not found' line followed by no
> # descriptions. Check for this case since the first thing you'll
> # get is a blank line and then 'QUERY'
> if not l.startswith('CONVERGED') and l[0] != '>' \
> and not l.startswith('QUERY'):
> read_and_call_until(uhandle, consumer.description, blank=1)
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> attempt_read_and_call(uhandle, consumer.converged, start='CONVERGED')
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> consumer.end_descriptions()
>
> def _scan_alignments(self, uhandle, consumer):
> # qblast inserts a helpful line here.
> attempt_read_and_call(uhandle, consumer.noevent, start="ALIGNMENTS")
>
> # First, check to see if I'm at the database report.
> line = safe_peekline(uhandle)
> if line.startswith(' Database'):
> return
> elif line[0] == '>':
> # XXX make a better check here between pairwise and masterslave
> self._scan_pairwise_alignments(uhandle, consumer)
> else:
> # XXX put in a check to make sure I'm in a masterslave alignment
> self._scan_masterslave_alignment(uhandle, consumer)
>
> def _scan_pairwise_alignments(self, uhandle, consumer):
> while 1:
> line = safe_peekline(uhandle)
> if line[0] != '>':
> break
> self._scan_one_pairwise_alignment(uhandle, consumer)
>
> def _scan_one_pairwise_alignment(self, uhandle, consumer):
> consumer.start_alignment()
>
> self._scan_alignment_header(uhandle, consumer)
>
> # Scan a bunch of score/alignment pairs.
> while 1:
> line = safe_peekline(uhandle)
> if not line.startswith(' Score'):
> break
> self._scan_hsp(uhandle, consumer)
> consumer.end_alignment()
>
> def _scan_alignment_header(self, uhandle, consumer):
> # >d1rip__ 2.24.7.1.1 Ribosomal S17 protein [Bacillus
> # stearothermophilus]
> # Length = 81
> #
> read_and_call(uhandle, consumer.title, start='>')
> while 1:
> line = safe_readline(uhandle)
> if line.lstrip().startswith('Length ='):
> consumer.length(line)
> break
> elif is_blank_line(line):
> # Check to make sure I haven't missed the Length line
> raise SyntaxError, "I missed the Length in an alignment header"
> consumer.title(line)
>
> # Older versions of BLAST will have a line with some spaces.
> # Version 2.0.14 (maybe 2.0.13?) and above print a true blank line.
> if not attempt_read_and_call(uhandle, consumer.noevent,
> start=' '):
> read_and_call(uhandle, consumer.noevent, blank=1)
>
> def _scan_hsp(self, uhandle, consumer):
> consumer.start_hsp()
> self._scan_hsp_header(uhandle, consumer)
> self._scan_hsp_alignment(uhandle, consumer)
> consumer.end_hsp()
>
> def _scan_hsp_header(self, uhandle, consumer):
> # Score = 22.7 bits (47), Expect = 2.5
> # Identities = 10/36 (27%), Positives = 18/36 (49%)
> # Strand = Plus / Plus
> # Frame = +3
> #
>
> read_and_call(uhandle, consumer.score, start=' Score')
> read_and_call(uhandle, consumer.identities, start=' Identities')
> # BLASTN
> attempt_read_and_call(uhandle, consumer.strand, start = ' Strand')
> # BLASTX, TBLASTN, TBLASTX
> attempt_read_and_call(uhandle, consumer.frame, start = ' Frame')
> read_and_call(uhandle, consumer.noevent, blank=1)
>
> def _scan_hsp_alignment(self, uhandle, consumer):
> # Query: 11 GRGVSACA-------TCDGFFYRNQKVAVIGGGNTAVEEALYLSNIASEVHLIHRRDGF
> # GRGVS+ TC Y + + V GGG+ + EE L + I R+
> # Sbjct: 12 GRGVSSVVRRCIHKPTCKE--YAVKIIDVTGGGSFSAEEVQELREATLKEVDILRKVSG
> #
> # Query: 64 AEKILIKR 71
> # I +K
> # Sbjct: 70 PNIIQLKD 77
> #
>
> while 1:
> # Blastn adds an extra line filled with spaces before Query
> attempt_read_and_call(uhandle, consumer.noevent, start=' ')
> read_and_call(uhandle, consumer.query, start='Query')
> read_and_call(uhandle, consumer.align, start=' ')
> read_and_call(uhandle, consumer.sbjct, start='Sbjct')
> read_and_call_while(uhandle, consumer.noevent, blank=1)
> line = safe_peekline(uhandle)
> # Alignment continues if I see a 'Query' or the spaces for Blastn.
> if not (line.startswith('Query') or line.startswith(' ')):
> break
>
> def _scan_masterslave_alignment(self, uhandle, consumer):
> consumer.start_alignment()
> while 1:
> line = safe_readline(uhandle)
> # Check to see whether I'm finished reading the alignment.
> # This is indicated by 1) database section, 2) next psi-blast
> # round, which can also be a 'Results from round' if no
> # searching line is present
> # patch by chapmanb
> if line.startswith('Searching') or \
> line.startswith('Results from round'):
> uhandle.saveline(line)
> break
> elif line.startswith(' Database'):
> uhandle.saveline(line)
> break
> elif is_blank_line(line):
> consumer.noevent(line)
> else:
> consumer.multalign(line)
> read_and_call_while(uhandle, consumer.noevent, blank=1)
> consumer.end_alignment()
>
> def _scan_database_report(self, uhandle, consumer):
> # Database: sdqib40-1.35.seg.fa
> # Posted date: Nov 1, 1999 4:25 PM
> # Number of letters in database: 223,339
> # Number of sequences in database: 1323
> #
> # Lambda K H
> # 0.322 0.133 0.369
> #
> # Gapped
> # Lambda K H
> # 0.270 0.0470 0.230
> #
>
> consumer.start_database_report()
>
> # Subset of the database(s) listed below
> # Number of letters searched: 562,618,960
> # Number of sequences searched: 228,924
> if attempt_read_and_call(uhandle, consumer.noevent, start=" Subset"):
> read_and_call(uhandle, consumer.noevent, contains="letters")
> read_and_call(uhandle, consumer.noevent, contains="sequences")
> read_and_call(uhandle, consumer.noevent, start=" ")
>
> # Sameet Mehta reported seeing output from BLASTN 2.2.9 that
> # was missing the "Database" stanza completely.
> while attempt_read_and_call(uhandle, consumer.database,
> start=' Database'):
> # BLAT output ends abruptly here, without any of the other
> # information. Check to see if this is the case. If so,
> # then end the database report here gracefully.
> if not uhandle.peekline():
> consumer.end_database_report()
> return
>
> # Database can span multiple lines.
> read_and_call_until(uhandle, consumer.database, start=' Posted')
> read_and_call(uhandle, consumer.posted_date, start=' Posted')
> read_and_call(uhandle, consumer.num_letters_in_database,
> start=' Number of letters')
> read_and_call(uhandle, consumer.num_sequences_in_database,
> start=' Number of sequences')
> read_and_call(uhandle, consumer.noevent, start=' ')
>
> line = safe_readline(uhandle)
> uhandle.saveline(line)
> if line.find('Lambda') != -1:
> break
>
> read_and_call(uhandle, consumer.noevent, start='Lambda')
> read_and_call(uhandle, consumer.ka_params)
> read_and_call(uhandle, consumer.noevent, blank=1)
>
> # not BLASTP
> attempt_read_and_call(uhandle, consumer.gapped, start='Gapped')
> # not TBLASTX
> if attempt_read_and_call(uhandle, consumer.noevent, start='Lambda'):
> read_and_call(uhandle, consumer.ka_params_gap)
>
> # Blast 2.2.4 can sometimes skip the whole parameter section.
> # Thus, I need to be careful not to read past the end of the
> # file.
> try:
> read_and_call_while(uhandle, consumer.noevent, blank=1)
> except SyntaxError, x:
> if str(x) != "Unexpected end of stream.":
> raise
> consumer.end_database_report()
>
> def _scan_parameters(self, uhandle, consumer):
> # Matrix: BLOSUM62
> # Gap Penalties: Existence: 11, Extension: 1
> # Number of Hits to DB: 50604
> # Number of Sequences: 1323
> # Number of extensions: 1526
> # Number of successful extensions: 6
> # Number of sequences better than 10.0: 5
> # Number of HSP's better than 10.0 without gapping: 5
> # Number of HSP's successfully gapped in prelim test: 0
> # Number of HSP's that attempted gapping in prelim test: 1
> # Number of HSP's gapped (non-prelim): 5
> # length of query: 140
> # length of database: 223,339
> # effective HSP length: 39
> # effective length of query: 101
> # effective length of database: 171,742
> # effective search space: 17345942
> # effective search space used: 17345942
> # T: 11
> # A: 40
> # X1: 16 ( 7.4 bits)
> # X2: 38 (14.8 bits)
> # X3: 64 (24.9 bits)
> # S1: 41 (21.9 bits)
> # S2: 42 (20.8 bits)
>
> # Blast 2.2.4 can sometimes skip the whole parameter section.
> # Thus, check to make sure that the parameter section really
> # exists.
> if not uhandle.peekline():
> return
>
> # BLASTN 2.2.9 looks like it reverses the "Number of Hits" and
> # "Number of Sequences" lines.
> consumer.start_parameters()
>
> # Matrix line may be missing in BLASTN 2.2.9
> attempt_read_and_call(uhandle, consumer.matrix, start='Matrix')
> # not TBLASTX
> attempt_read_and_call(uhandle, consumer.gap_penalties, start='Gap')
>
> attempt_read_and_call(uhandle, consumer.num_sequences,
> start='Number of Sequences')
> read_and_call(uhandle, consumer.num_hits,
> start='Number of Hits')
> attempt_read_and_call(uhandle, consumer.num_sequences,
> start='Number of Sequences')
> read_and_call(uhandle, consumer.num_extends,
> start='Number of extensions')
> read_and_call(uhandle, consumer.num_good_extends,
> start='Number of successful')
>
> read_and_call(uhandle, consumer.num_seqs_better_e,
> start='Number of sequences')
>
> # not BLASTN, TBLASTX
> if attempt_read_and_call(uhandle, consumer.hsps_no_gap,
> start="Number of HSP's better"):
> # BLASTN 2.2.9
> if attempt_read_and_call(uhandle, consumer.noevent,
> start="Number of HSP's gapped:"):
> read_and_call(uhandle, consumer.noevent,
> start="Number of HSP's successfully")
> read_and_call(uhandle, consumer.noevent,
> start="Number of extra gapped extensions")
> else:
> read_and_call(uhandle, consumer.hsps_prelim_gapped,
> start="Number of HSP's successfully")
> read_and_call(uhandle, consumer.hsps_prelim_gap_attempted,
> start="Number of HSP's that")
> read_and_call(uhandle, consumer.hsps_gapped,
> start="Number of HSP's gapped")
> # not in blastx 2.2.1
> attempt_read_and_call(uhandle, consumer.query_length,
> has_re=re.compile(r"[Ll]ength of query"))
> read_and_call(uhandle, consumer.database_length,
> has_re=re.compile(r"[Ll]ength of \s*[Dd]atabase"))
>
> # BLASTN 2.2.9
> attempt_read_and_call(uhandle, consumer.noevent,
> start="Length adjustment")
> attempt_read_and_call(uhandle, consumer.effective_hsp_length,
> start='effective HSP')
> # Not in blastx 2.2.1
> attempt_read_and_call(
> uhandle, consumer.effective_query_length,
> has_re=re.compile(r'[Ee]ffective length of query'))
> read_and_call(
> uhandle, consumer.effective_database_length,
> has_re=re.compile(r'[Ee]ffective length of \s*[Dd]atabase'))
> # Not in blastx 2.2.1, added a ':' to distinguish between
> # this and the 'effective search space used' line
> attempt_read_and_call(
> uhandle, consumer.effective_search_space,
> has_re=re.compile(r'[Ee]ffective search space:'))
> # Does not appear in BLASTP 2.0.5
> attempt_read_and_call(
> uhandle, consumer.effective_search_space_used,
> has_re=re.compile(r'[Ee]ffective search space used'))
>
> # BLASTX, TBLASTN, TBLASTX
> attempt_read_and_call(uhandle, consumer.frameshift, start='frameshift')
> # not in BLASTN 2.2.9
> attempt_read_and_call(uhandle, consumer.threshold, start='T')
> read_and_call(uhandle, consumer.window_size, start='A')
> read_and_call(uhandle, consumer.dropoff_1st_pass, start='X1')
> read_and_call(uhandle, consumer.gap_x_dropoff, start='X2')
> # not BLASTN, TBLASTX
> attempt_read_and_call(uhandle, consumer.gap_x_dropoff_final,
> start='X3')
> read_and_call(uhandle, consumer.gap_trigger, start='S1')
> # not in blastx 2.2.1
> # first we make sure we have additional lines to work with, if
> # not then the file is done and we don't have a final S2
> if not is_blank_line(uhandle.peekline(), allow_spaces=1):
> read_and_call(uhandle, consumer.blast_cutoff, start='S2')
>
> consumer.end_parameters()
>
> class BlastParser(AbstractParser):
> """Parses BLAST data into a Record.Blast object.
>
> """
> def __init__(self):
> """__init__(self)"""
> self._scanner = _Scanner()
> self._consumer = _BlastConsumer()
>
> def parse(self, handle):
> """parse(self, handle)"""
> self._scanner.feed(handle, self._consumer)
> return self._consumer.data
>
> class PSIBlastParser(AbstractParser):
> """Parses BLAST data into a Record.PSIBlast object.
>
> """
> def __init__(self):
> """__init__(self)"""
> self._scanner = _Scanner()
> self._consumer = _PSIBlastConsumer()
>
> def parse(self, handle):
> """parse(self, handle)"""
> self._scanner.feed(handle, self._consumer)
> return self._consumer.data
>
> class _HeaderConsumer:
> def start_header(self):
> self._header = Record.Header()
>
> def version(self, line):
> c = line.split()
> self._header.application = c[0]
> self._header.version = c[1]
> self._header.date = c[2][1:-1]
>
> def reference(self, line):
> if line.startswith('Reference: '):
> self._header.reference = line[11:]
> else:
> self._header.reference = self._header.reference + line
>
> def query_info(self, line):
> if line.startswith('Query= '):
> self._header.query = line[7:]
> elif not line.startswith(' '): # continuation of query_info
> self._header.query = "%s%s" % (self._header.query, line)
> else:
> letters, = _re_search(
> r"([0-9,]+) letters", line,
> "I could not find the number of letters in line\n%s" % line)
> self._header.query_letters = _safe_int(letters)
>
> def database_info(self, line):
> line = line.rstrip()
> if line.startswith('Database: '):
> self._header.database = line[10:]
> elif not line.endswith('total letters'):
> self._header.database = self._header.database + line.strip()
> else:
> sequences, letters =_re_search(
> r"([0-9,]+) sequences; ([0-9,-]+) total letters", line,
> "I could not find the sequences and letters in line\n%s" %line)
> self._header.database_sequences = _safe_int(sequences)
> self._header.database_letters = _safe_int(letters)
>
> def end_header(self):
> # Get rid of the trailing newlines
> self._header.reference = self._header.reference.rstrip()
> self._header.query = self._header.query.rstrip()
>
> class _DescriptionConsumer:
> def start_descriptions(self):
> self._descriptions = []
> self._model_sequences = []
> self._nonmodel_sequences = []
> self._converged = 0
> self._type = None
> self._roundnum = None
>
> self.__has_n = 0 # Does the description line contain an N value?
>
> def description_header(self, line):
> if line.startswith('Sequences producing'):
> cols = line.split()
> if cols[-1] == 'N':
> self.__has_n = 1
>
> def description(self, line):
> dh = self._parse(line)
> if self._type == 'model':
> self._model_sequences.append(dh)
> elif self._type == 'nonmodel':
> self._nonmodel_sequences.append(dh)
> else:
> self._descriptions.append(dh)
>
> def model_sequences(self, line):
> self._type = 'model'
>
> def nonmodel_sequences(self, line):
> self._type = 'nonmodel'
>
> def converged(self, line):
> self._converged = 1
>
> def no_hits(self, line):
> pass
>
> def round(self, line):
> if not line.startswith('Results from round'):
> raise SyntaxError, "I didn't understand the round line\n%s" % line
> self._roundnum = _safe_int(line[18:].strip())
>
> def end_descriptions(self):
> pass
>
> def _parse(self, description_line):
> line = description_line # for convenience
> dh = Record.Description()
>
> # I need to separate the score and p-value from the title.
> # sp|P21297|FLBT_CAUCR FLBT PROTEIN [snip] 284 7e-77
> # sp|P21297|FLBT_CAUCR FLBT PROTEIN [snip] 284 7e-77 1
> # special cases to handle:
> # - title must be preserved exactly (including whitespaces)
> # - score could be equal to e-value (not likely, but what if??)
> # - sometimes there's an "N" score of '1'.
> cols = line.split()
> if len(cols) < 3:
> raise SyntaxError, \
> "Line does not appear to contain description:\n%s" % line
> if self.__has_n:
> i = line.rfind(cols[-1]) # find start of N
> i = line.rfind(cols[-2], 0, i) # find start of p-value
> i = line.rfind(cols[-3], 0, i) # find start of score
> else:
> i = line.rfind(cols[-1]) # find start of p-value
> i = line.rfind(cols[-2], 0, i) # find start of score
> if self.__has_n:
> dh.title, dh.score, dh.e, dh.num_alignments = \
> line[:i].rstrip(), cols[-3], cols[-2], cols[-1]
> else:
> dh.title, dh.score, dh.e, dh.num_alignments = \
> line[:i].rstrip(), cols[-2], cols[-1], 1
> dh.num_alignments = _safe_int(dh.num_alignments)
> dh.score = _safe_int(dh.score)
> dh.e = _safe_float(dh.e)
> return dh
>
> class _AlignmentConsumer:
> # This is a little bit tricky. An alignment can either be a
> # pairwise alignment or a multiple alignment. Since it's difficult
> # to know a-priori which one the blast record will contain, I'm going
> # to make one class that can parse both of them.
> def start_alignment(self):
> self._alignment = Record.Alignment()
> self._multiple_alignment = Record.MultipleAlignment()
>
> def title(self, line):
> self._alignment.title = "%s%s" % (self._alignment.title,
> line.lstrip())
>
> def length(self, line):
> self._alignment.length = line.split()[2]
> self._alignment.length = _safe_int(self._alignment.length)
>
> def multalign(self, line):
> # Standalone version uses 'QUERY', while WWW version uses blast_tmp.
> if line.startswith('QUERY') or line.startswith('blast_tmp'):
> # If this is the first line of the multiple alignment,
> # then I need to figure out how the line is formatted.
>
> # Format of line is:
> # QUERY 1 acttg...gccagaggtggtttattcagtctccataagagaggggacaaacg 60
> try:
> name, start, seq, end = line.split()
> except ValueError:
> raise SyntaxError, "I do not understand the line\n%s" \
> % line
> self._start_index = line.index(start, len(name))
> self._seq_index = line.index(seq,
> self._start_index+len(start))
> # subtract 1 for the space
> self._name_length = self._start_index - 1
> self._start_length = self._seq_index - self._start_index - 1
> self._seq_length = line.rfind(end) - self._seq_index - 1
>
> #self._seq_index = line.index(seq)
> ## subtract 1 for the space
> #self._seq_length = line.rfind(end) - self._seq_index - 1
> #self._start_index = line.index(start)
> #self._start_length = self._seq_index - self._start_index - 1
> #self._name_length = self._start_index
>
> # Extract the information from the line
> name = line[:self._name_length]
> name = name.rstrip()
> start = line[self._start_index:self._start_index+self._start_length]
> start = start.rstrip()
> if start:
> start = _safe_int(start)
> end = line[self._seq_index+self._seq_length:].rstrip()
> if end:
> end = _safe_int(end)
> seq = line[self._seq_index:self._seq_index+self._seq_length].rstrip()
> # right pad the sequence with spaces if necessary
> if len(seq) < self._seq_length:
> seq = seq + ' '*(self._seq_length-len(seq))
>
> # I need to make sure the sequence is aligned correctly with the query.
> # First, I will find the length of the query. Then, if necessary,
> # I will pad my current sequence with spaces so that they will line
> # up correctly.
>
> # Two possible things can happen:
> # QUERY
> # 504
> #
> # QUERY
> # 403
> #
> # Sequence 504 will need padding at the end. Since I won't know
> # this until the end of the alignment, this will be handled in
> # end_alignment.
> # Sequence 403 will need padding before being added to the alignment.
>
> align = self._multiple_alignment.alignment # for convenience
> align.append((name, start, seq, end))
>
> # This is old code that tried to line up all the sequences
> # in a multiple alignment by using the sequence title's as
> # identifiers. The problem with this is that BLAST assigns
> # different HSP's from the same sequence the same id. Thus,
> # in one alignment block, there may be multiple sequences with
> # the same id. I'm not sure how to handle this, so I'm not
> # going to.
>
> # # If the sequence is the query, then just add it.
> # if name == 'QUERY':
> # if len(align) == 0:
> # align.append((name, start, seq))
> # else:
> # aname, astart, aseq = align[0]
> # if name != aname:
> # raise SyntaxError, "Query is not the first sequence"
> # aseq = aseq + seq
> # align[0] = aname, astart, aseq
> # else:
> # if len(align) == 0:
> # raise SyntaxError, "I could not find the query sequence"
> # qname, qstart, qseq = align[0]
> #
> # # Now find my sequence in the multiple alignment.
> # for i in range(1, len(align)):
> # aname, astart, aseq = align[i]
> # if name == aname:
> # index = i
> # break
> # else:
> # # If I couldn't find it, then add a new one.
> # align.append((None, None, None))
> # index = len(align)-1
> # # Make sure to left-pad it.
> # aname, astart, aseq = name, start, ' '*(len(qseq)-len(seq))
> #
> # if len(qseq) != len(aseq) + len(seq):
> # # If my sequences are shorter than the query sequence,
> # # then I will need to pad some spaces to make them line up.
> # # Since I've already right padded seq, that means aseq
> # # must be too short.
> # aseq = aseq + ' '*(len(qseq)-len(aseq)-len(seq))
> # aseq = aseq + seq
> # if astart is None:
> # astart = start
> # align[index] = aname, astart, aseq
>
> def end_alignment(self):
> # Remove trailing newlines
> if self._alignment:
> self._alignment.title = self._alignment.title.rstrip()
>
> # This code is also obsolete. See note above.
> # If there's a multiple alignment, I will need to make sure
> # all the sequences are aligned. That is, I may need to
> # right-pad the sequences.
> # if self._multiple_alignment is not None:
> # align = self._multiple_alignment.alignment
> # seqlen = None
> # for i in range(len(align)):
> # name, start, seq = align[i]
> # if seqlen is None:
> # seqlen = len(seq)
> # else:
> # if len(seq) < seqlen:
> # seq = seq + ' '*(seqlen - len(seq))
> # align[i] = name, start, seq
> # elif len(seq) > seqlen:
> # raise SyntaxError, \
> # "Sequence %s is longer than the query" % name
>
> # Clean up some variables, if they exist.
> try:
> del self._seq_index
> del self._seq_length
> del self._start_index
> del self._start_length
> del self._name_length
> except AttributeError:
> pass
>
> class _HSPConsumer:
> def start_hsp(self):
> self._hsp = Record.HSP()
>
> def score(self, line):
> self._hsp.bits, self._hsp.score = _re_search(
> r"Score =\s*([0-9.e+]+) bits \(([0-9]+)\)", line,
> "I could not find the score in line\n%s" % line)
> self._hsp.score = _safe_float(self._hsp.score)
> self._hsp.bits = _safe_float(self._hsp.bits)
>
> x, y = _re_search(
> r"Expect\(?(\d*)\)? = +([0-9.e\-|\+]+)", line,
> "I could not find the expect in line\n%s" % line)
> if x:
> self._hsp.num_alignments = _safe_int(x)
> else:
> self._hsp.num_alignments = 1
> self._hsp.expect = _safe_float(y)
>
> def identities(self, line):
> x, y = _re_search(
> r"Identities = (\d+)\/(\d+)", line,
> "I could not find the identities in line\n%s" % line)
> self._hsp.identities = _safe_int(x), _safe_int(y)
>
> if line.find('Positives') != -1:
> x, y = _re_search(
> r"Positives = (\d+)\/(\d+)", line,
> "I could not find the positives in line\n%s" % line)
> self._hsp.positives = _safe_int(x), _safe_int(y)
>
> if line.find('Gaps') != -1:
> x, y = _re_search(
> r"Gaps = (\d+)\/(\d+)", line,
> "I could not find the gaps in line\n%s" % line)
> self._hsp.gaps = _safe_int(x), _safe_int(y)
>
>
> def strand(self, line):
> self._hsp.strand = _re_search(
> r"Strand = (\w+) / (\w+)", line,
> "I could not find the strand in line\n%s" % line)
>
> def frame(self, line):
> # Frame can be in formats:
> # Frame = +1
> # Frame = +2 / +2
> if line.find('/') != -1:
> self._hsp.frame = _re_search(
> r"Frame = ([-+][123]) / ([-+][123])", line,
> "I could not find the frame in line\n%s" % line)
> else:
> self._hsp.frame = _re_search(
> r"Frame = ([-+][123])", line,
> "I could not find the frame in line\n%s" % line)
>
> # Match a space, if one is available. Masahir Ishikawa found a
> # case where there's no space between the start and the sequence:
> # Query: 100tt 101
> # line below modified by Yair Benita, Sep 2004
> _query_re = re.compile(r"Query: \s*(\d+)\s*(.+) (\d+)")
> def query(self, line):
> m = self._query_re.search(line)
> if m is None:
> raise SyntaxError, "I could not find the query in line\n%s" % line
>
> # line below modified by Yair Benita, Sep 2004.
> # added the end attribute for the query
> start, seq, end = m.groups()
> self._hsp.query = self._hsp.query + seq
> if self._hsp.query_start is None:
> self._hsp.query_start = _safe_int(start)
>
> # line below added by Yair Benita, Sep 2004.
> # added the end attribute for the query
> self._hsp.query_end = _safe_int(end)
> self._query_start_index = m.start(2)
> self._query_len = len(seq)
>
> def align(self, line):
> seq = line[self._query_start_index:].rstrip()
> if len(seq) < self._query_len:
> # Make sure the alignment is the same length as the query
> seq = seq + ' ' * (self._query_len-len(seq))
> elif len(seq) < self._query_len:
> raise SyntaxError, "Match is longer than the query in line\n%s" % \
> line
> self._hsp.match = self._hsp.match + seq
>
> def sbjct(self, line):
> # line below modified by Yair Benita, Sep 2004
> # added the end group and the -? to allow parsing
> # of BLAT output in BLAST format.
> start, seq, end = _re_search(
> r"Sbjct: (-?\d+)\s*(.+) (-?\d+)", line,
> "I could not find the sbjct in line\n%s" % line)
> #mikep 26/9/00
> #On occasion, there is a blast hit with no subject match
> #so far, it only occurs with 1-line short "matches"
> #I have decided to let these pass as they appear
> if not seq.strip():
> seq = ' ' * self._query_len
> self._hsp.sbjct = self._hsp.sbjct + seq
> if self._hsp.sbjct_start is None:
> self._hsp.sbjct_start = _safe_int(start)
>
> self._hsp.sbjct_end = _safe_int(end)
> if len(seq) != self._query_len:
> raise SyntaxError, \
> "QUERY and SBJCT sequence lengths don't match in line\n%s" \
> % line
>
> del self._query_start_index # clean up unused variables
> del self._query_len
>
> def end_hsp(self):
> pass
>
> class _DatabaseReportConsumer:
>
> def start_database_report(self):
> self._dr = Record.DatabaseReport()
>
> def database(self, line):
> m = re.search(r"Database: (.+)$", line)
> if m:
> self._dr.database_name.append(m.group(1))
> elif self._dr.database_name:
> # This must be a continuation of the previous name.
> self._dr.database_name[-1] = "%s%s" % (self._dr.database_name[-1],
> line.strip())
>
> def posted_date(self, line):
> self._dr.posted_date.append(_re_search(
> r"Posted date:\s*(.+)$", line,
> "I could not find the posted date in line\n%s" % line))
>
> def num_letters_in_database(self, line):
> letters, = _get_cols(
> line, (-1,), ncols=6, expected={2:"letters", 4:"database:"})
> self._dr.num_letters_in_database.append(_safe_int(letters))
>
> def num_sequences_in_database(self, line):
> sequences, = _get_cols(
> line, (-1,), ncols=6, expected={2:"sequences", 4:"database:"})
> self._dr.num_sequences_in_database.append(_safe_int(sequences))
>
> def ka_params(self, line):
> x = line.split()
> self._dr.ka_params = map(_safe_float, x)
>
> def gapped(self, line):
> self._dr.gapped = 1
>
> def ka_params_gap(self, line):
> x = line.split()
> self._dr.ka_params_gap = map(_safe_float, x)
>
> def end_database_report(self):
> pass
>
> class _ParametersConsumer:
> def start_parameters(self):
> self._params = Record.Parameters()
>
> def matrix(self, line):
> self._params.matrix = line[8:].rstrip()
>
> def gap_penalties(self, line):
> x = _get_cols(
> line, (3, 5), ncols=6, expected={2:"Existence:", 4:"Extension:"})
> self._params.gap_penalties = map(_safe_float, x)
>
> def num_hits(self, line):
> if line.find('1st pass') != -1:
> x, = _get_cols(line, (-4,), ncols=11, expected={2:"Hits"})
> self._params.num_hits = _safe_int(x)
> else:
> x, = _get_cols(line, (-1,), ncols=6, expected={2:"Hits"})
> self._params.num_hits = _safe_int(x)
>
> def num_sequences(self, line):
> if line.find('1st pass') != -1:
> x, = _get_cols(line, (-4,), ncols=9, expected={2:"Sequences:"})
> self._params.num_sequences = _safe_int(x)
> else:
> x, = _get_cols(line, (-1,), ncols=4, expected={2:"Sequences:"})
> self._params.num_sequences = _safe_int(x)
>
> def num_extends(self, line):
> if line.find('1st pass') != -1:
> x, = _get_cols(line, (-4,), ncols=9, expected={2:"extensions:"})
> self._params.num_extends = _safe_int(x)
> else:
> x, = _get_cols(line, (-1,), ncols=4, expected={2:"extensions:"})
> self._params.num_extends = _safe_int(x)
>
> def num_good_extends(self, line):
> if line.find('1st pass') != -1:
> x, = _get_cols(line, (-4,), ncols=10, expected={3:"extensions:"})
> self._params.num_good_extends = _safe_int(x)
> else:
> x, = _get_cols(line, (-1,), ncols=5, expected={3:"extensions:"})
> self._params.num_good_extends = _safe_int(x)
>
> def num_seqs_better_e(self, line):
> self._params.num_seqs_better_e, = _get_cols(
> line, (-1,), ncols=7, expected={2:"sequences"})
> self._params.num_seqs_better_e = _safe_int(
> self._params.num_seqs_better_e)
>
> def hsps_no_gap(self, line):
> self._params.hsps_no_gap, = _get_cols(
> line, (-1,), ncols=9, expected={3:"better", 7:"gapping:"})
> self._params.hsps_no_gap = _safe_int(self._params.hsps_no_gap)
>
> def hsps_prelim_gapped(self, line):
> self._params.hsps_prelim_gapped, = _get_cols(
> line, (-1,), ncols=9, expected={4:"gapped", 6:"prelim"})
> self._params.hsps_prelim_gapped = _safe_int(
> self._params.hsps_prelim_gapped)
>
> def hsps_prelim_gapped_attempted(self, line):
> self._params.hsps_prelim_gapped_attempted, = _get_cols(
> line, (-1,), ncols=10, expected={4:"attempted", 7:"prelim"})
> self._params.hsps_prelim_gapped_attempted = _safe_int(
> self._params.hsps_prelim_gapped_attempted)
>
> def hsps_gapped(self, line):
> self._params.hsps_gapped, = _get_cols(
> line, (-1,), ncols=6, expected={3:"gapped"})
> self._params.hsps_gapped = _safe_int(self._params.hsps_gapped)
>
> def query_length(self, line):
> self._params.query_length, = _get_cols(
> line.lower(), (-1,), ncols=4, expected={0:"length", 2:"query:"})
> self._params.query_length = _safe_int(self._params.query_length)
>
> def database_length(self, line):
> self._params.database_length, = _get_cols(
> line.lower(), (-1,), ncols=4, expected={0:"length", 2:"database:"})
> self._params.database_length = _safe_int(self._params.database_length)
>
> def effective_hsp_length(self, line):
> self._params.effective_hsp_length, = _get_cols(
> line, (-1,), ncols=4, expected={1:"HSP", 2:"length:"})
> self._params.effective_hsp_length = _safe_int(
> self._params.effective_hsp_length)
>
> def effective_query_length(self, line):
> self._params.effective_query_length, = _get_cols(
> line, (-1,), ncols=5, expected={1:"length", 3:"query:"})
> self._params.effective_query_length = _safe_int(
> self._params.effective_query_length)
>
> def effective_database_length(self, line):
> self._params.effective_database_length, = _get_cols(
> line.lower(), (-1,), ncols=5, expected={1:"length", 3:"database:"})
> self._params.effective_database_length = _safe_int(
> self._params.effective_database_length)
>
> def effective_search_space(self, line):
> self._params.effective_search_space, = _get_cols(
> line, (-1,), ncols=4, expected={1:"search"})
> self._params.effective_search_space = _safe_int(
> self._params.effective_search_space)
>
> def effective_search_space_used(self, line):
> self._params.effective_search_space_used, = _get_cols(
> line, (-1,), ncols=5, expected={1:"search", 3:"used:"})
> self._params.effective_search_space_used = _safe_int(
> self._params.effective_search_space_used)
>
> def frameshift(self, line):
> self._params.frameshift = _get_cols(
> line, (4, 5), ncols=6, expected={0:"frameshift", 2:"decay"})
>
> def threshold(self, line):
> self._params.threshold, = _get_cols(
> line, (1,), ncols=2, expected={0:"T:"})
> self._params.threshold = _safe_int(self._params.threshold)
>
> def window_size(self, line):
> self._params.window_size, = _get_cols(
> line, (1,), ncols=2, expected={0:"A:"})
> self._params.window_size = _safe_int(self._params.window_size)
>
> def dropoff_1st_pass(self, line):
> score, bits = _re_search(
> r"X1: (\d+) \(\s*([0-9,.]+) bits\)", line,
> "I could not find the dropoff in line\n%s" % line)
> self._params.dropoff_1st_pass = _safe_int(score), _safe_float(bits)
>
> def gap_x_dropoff(self, line):
> score, bits = _re_search(
> r"X2: (\d+) \(\s*([0-9,.]+) bits\)", line,
> "I could not find the gap dropoff in line\n%s" % line)
> self._params.gap_x_dropoff = _safe_int(score), _safe_float(bits)
>
> def gap_x_dropoff_final(self, line):
> score, bits = _re_search(
> r"X3: (\d+) \(\s*([0-9,.]+) bits\)", line,
> "I could not find the gap dropoff final in line\n%s" % line)
> self._params.gap_x_dropoff_final = _safe_int(score), _safe_float(bits)
>
> def gap_trigger(self, line):
> score, bits = _re_search(
> r"S1: (\d+) \(\s*([0-9,.]+) bits\)", line,
> "I could not find the gap trigger in line\n%s" % line)
> self._params.gap_trigger = _safe_int(score), _safe_float(bits)
>
> def blast_cutoff(self, line):
> score, bits = _re_search(
> r"S2: (\d+) \(\s*([0-9,.]+) bits\)", line,
> "I could not find the blast cutoff in line\n%s" % line)
> self._params.blast_cutoff = _safe_int(score), _safe_float(bits)
>
> def end_parameters(self):
> pass
>
>
> class _BlastConsumer(AbstractConsumer,
> _HeaderConsumer,
> _DescriptionConsumer,
> _AlignmentConsumer,
> _HSPConsumer,
> _DatabaseReportConsumer,
> _ParametersConsumer
> ):
> # This Consumer is inherits from many other consumer classes that handle
> # the actual dirty work. An alternate way to do it is to create objects
> # of those classes and then delegate the parsing tasks to them in a
> # decorator-type pattern. The disadvantage of that is that the method
> # names will need to be resolved in this classes. However, using
> # a decorator will retain more control in this class (which may or
> # may not be a bad thing). In addition, having each sub-consumer as
> # its own object prevents this object's dictionary from being cluttered
> # with members and reduces the chance of member collisions.
> def __init__(self):
> self.data = None
>
> def round(self, line):
> # Make sure nobody's trying to pass me PSI-BLAST data!
> raise ValueError, \
> "This consumer doesn't handle PSI-BLAST data"
>
> def start_header(self):
> self.data = Record.Blast()
> _HeaderConsumer.start_header(self)
>
> def end_header(self):
> _HeaderConsumer.end_header(self)
> self.data.__dict__.update(self._header.__dict__)
>
> def end_descriptions(self):
> self.data.descriptions = self._descriptions
>
> def end_alignment(self):
> _AlignmentConsumer.end_alignment(self)
> if self._alignment.hsps:
> self.data.alignments.append(self._alignment)
> if self._multiple_alignment.alignment:
> self.data.multiple_alignment = self._multiple_alignment
>
> def end_hsp(self):
> _HSPConsumer.end_hsp(self)
> try:
> self._alignment.hsps.append(self._hsp)
> except AttributeError:
> raise SyntaxError, "Found an HSP before an alignment"
>
> def end_database_report(self):
> _DatabaseReportConsumer.end_database_report(self)
> self.data.__dict__.update(self._dr.__dict__)
>
> def end_parameters(self):
> _ParametersConsumer.end_parameters(self)
> self.data.__dict__.update(self._params.__dict__)
>
> class _PSIBlastConsumer(AbstractConsumer,
> _HeaderConsumer,
> _DescriptionConsumer,
> _AlignmentConsumer,
> _HSPConsumer,
> _DatabaseReportConsumer,
> _ParametersConsumer
> ):
> def __init__(self):
> self.data = None
>
> def start_header(self):
> self.data = Record.PSIBlast()
> _HeaderConsumer.start_header(self)
>
> def end_header(self):
> _HeaderConsumer.end_header(self)
> self.data.__dict__.update(self._header.__dict__)
>
> def start_descriptions(self):
> self._round = Record.Round()
> self.data.rounds.append(self._round)
> _DescriptionConsumer.start_descriptions(self)
>
> def end_descriptions(self):
> _DescriptionConsumer.end_descriptions(self)
> self._round.number = self._roundnum
> if self._descriptions:
> self._round.new_seqs.extend(self._descriptions)
> self._round.reused_seqs.extend(self._model_sequences)
> self._round.new_seqs.extend(self._nonmodel_sequences)
> if self._converged:
> self.data.converged = 1
>
> def end_alignment(self):
> _AlignmentConsumer.end_alignment(self)
> if self._alignment.hsps:
> self._round.alignments.append(self._alignment)
> if self._multiple_alignment:
> self._round.multiple_alignment = self._multiple_alignment
>
> def end_hsp(self):
> _HSPConsumer.end_hsp(self)
> try:
> self._alignment.hsps.append(self._hsp)
> except AttributeError:
> raise SyntaxError, "Found an HSP before an alignment"
>
> def end_database_report(self):
> _DatabaseReportConsumer.end_database_report(self)
> self.data.__dict__.update(self._dr.__dict__)
>
> def end_parameters(self):
> _ParametersConsumer.end_parameters(self)
> self.data.__dict__.update(self._params.__dict__)
>
> class Iterator:
> """Iterates over a file of multiple BLAST results.
>
> Methods:
> next Return the next record from the stream, or None.
>
> """
> def __init__(self, handle, parser=None):
> """__init__(self, handle, parser=None)
>
> Create a new iterator. handle is a file-like object. parser
> is an optional Parser object to change the results into another form.
> If set to None, then the raw contents of the file will be returned.
>
> """
> try:
> handle.readline
> except AttributeError:
> raise ValueError(
> "I expected a file handle or file-like object, got %s"
> % type(handle))
> self._uhandle = File.UndoHandle(handle)
> self._parser = parser
>
> def next(self):
> """next(self) -> object
>
> Return the next Blast record from the file. If no more records,
> return None.
>
> """
> lines = []
> while 1:
> line = self._uhandle.readline()
> if not line:
> break
> # If I've reached the next one, then put the line back and stop.
> if lines and (line.startswith('BLAST')
> or line.startswith('BLAST', 1)
> or line.startswith('<?xml ')):
> self._uhandle.saveline(line)
> break
> lines.append(line)
>
> if not lines:
> return None
>
> data = ''.join(lines)
> if self._parser is not None:
> return self._parser.parse(File.StringHandle(data))
> return data
>
> def __iter__(self):
> return iter(self.next, None)
>
> def blastall(blastcmd, program, database, infile, **keywds):
> """blastall(blastcmd, program, database, infile, **keywds) ->
> read, error Undohandles
>
> Execute and retrieve data from blastall. blastcmd is the command
> used to launch the 'blastall' executable. program is the blast program
> to use, e.g. 'blastp', 'blastn', etc. database is the path to the database
> to search against. infile is the path to the file containing
> the sequence to search with.
>
> You may pass more parameters to **keywds to change the behavior of
> the search. Otherwise, optional values will be chosen by blastall.
>
> Scoring
> matrix Matrix to use.
> gap_open Gap open penalty.
> gap_extend Gap extension penalty.
> nuc_match Nucleotide match reward. (BLASTN)
> nuc_mismatch Nucleotide mismatch penalty. (BLASTN)
> query_genetic_code Genetic code for Query.
> db_genetic_code Genetic code for database. (TBLAST[NX])
>
> Algorithm
> gapped Whether to do a gapped alignment. T/F (not for TBLASTX)
> expectation Expectation value cutoff.
> wordsize Word size.
> strands Query strands to search against database.([T]BLAST[NX])
> keep_hits Number of best hits from a region to keep.
> xdrop Dropoff value (bits) for gapped alignments.
> hit_extend Threshold for extending hits.
> region_length Length of region used to judge hits.
> db_length Effective database length.
> search_length Effective length of search space.
>
> Processing
> filter Filter query sequence? T/F
> believe_query Believe the query defline. T/F
> restrict_gi Restrict search to these GI's.
> nprocessors Number of processors to use.
> oldengine Force use of old engine [T/F]
>
> Formatting
> html Produce HTML output? T/F
> descriptions Number of one-line descriptions.
> alignments Number of alignments.
> align_view Alignment view. Integer 0-6.
> show_gi Show GI's in deflines? T/F
> seqalign_file seqalign file to output.
>
> """
> att2param = {
> 'matrix' : '-M',
> 'gap_open' : '-G',
> 'gap_extend' : '-E',
> 'nuc_match' : '-r',
> 'nuc_mismatch' : '-q',
> 'query_genetic_code' : '-Q',
> 'db_genetic_code' : '-D',
>
> 'gapped' : '-g',
> 'expectation' : '-e',
> 'wordsize' : '-W',
> 'strands' : '-S',
> 'keep_hits' : '-K',
> 'xdrop' : '-X',
> 'hit_extend' : '-f',
> 'region_length' : '-L',
> 'db_length' : '-z',
> 'search_length' : '-Y',
>
> 'program' : '-p',
> 'database' : '-d',
> 'infile' : '-i',
> 'filter' : '-F',
> 'believe_query' : '-J',
> 'restrict_gi' : '-l',
> 'nprocessors' : '-a',
> 'oldengine' : '-V',
>
> 'html' : '-T',
> 'descriptions' : '-v',
> 'alignments' : '-b',
> 'align_view' : '-m',
> 'show_gi' : '-I',
> 'seqalign_file' : '-O'
> }
>
> if not os.path.exists(blastcmd):
> raise ValueError, "blastall does not exist at %s" % blastcmd
>
> params = []
>
> params.extend([att2param['program'], program])
> params.extend([att2param['database'], database])
> params.extend([att2param['infile'], infile])
>
> for attr in keywds.keys():
> params.extend([att2param[attr], str(keywds[attr])])
>
> w, r, e = os.popen3(" ".join([blastcmd] + params))
> w.close()
> return File.UndoHandle(r), File.UndoHandle(e)
>
>
> def blastpgp(blastcmd, database, infile, **keywds):
> """blastpgp(blastcmd, database, infile, **keywds) ->
> read, error Undohandles
>
> Execute and retrieve data from blastpgp. blastcmd is the command
> used to launch the 'blastpgp' executable. database is the path to the
> database to search against. infile is the path to the file containing
> the sequence to search with.
>
> You may pass more parameters to **keywds to change the behavior of
> the search. Otherwise, optional values will be chosen by blastpgp.
>
> Scoring
> matrix Matrix to use.
> gap_open Gap open penalty.
> gap_extend Gap extension penalty.
> window_size Multiple hits window size.
> npasses Number of passes.
> passes Hits/passes. Integer 0-2.
>
> Algorithm
> gapped Whether to do a gapped alignment. T/F
> expectation Expectation value cutoff.
> wordsize Word size.
> keep_hits Number of beset hits from a region to keep.
> xdrop Dropoff value (bits) for gapped alignments.
> hit_extend Threshold for extending hits.
> region_length Length of region used to judge hits.
> db_length Effective database length.
> search_length Effective length of search space.
> nbits_gapping Number of bits to trigger gapping.
> pseudocounts Pseudocounts constants for multiple passes.
> xdrop_final X dropoff for final gapped alignment.
> xdrop_extension Dropoff for blast extensions.
> model_threshold E-value threshold to include in multipass model.
> required_start Start of required region in query.
> required_end End of required region in query.
>
> Processing
> XXX should document default values
> program The blast program to use. (PHI-BLAST)
> filter Filter query sequence with SEG? T/F
> believe_query Believe the query defline? T/F
> nprocessors Number of processors to use.
>
> Formatting
> html Produce HTML output? T/F
> descriptions Number of one-line descriptions.
> alignments Number of alignments.
> align_view Alignment view. Integer 0-6.
> show_gi Show GI's in deflines? T/F
> seqalign_file seqalign file to output.
> align_outfile Output file for alignment.
> checkpoint_outfile Output file for PSI-BLAST checkpointing.
> restart_infile Input file for PSI-BLAST restart.
> hit_infile Hit file for PHI-BLAST.
> matrix_outfile Output file for PSI-BLAST matrix in ASCII.
> align_infile Input alignment file for PSI-BLAST restart.
>
> """
> att2param = {
> 'matrix' : '-M',
> 'gap_open' : '-G',
> 'gap_extend' : '-E',
> 'window_size' : '-A',
> 'npasses' : '-j',
> 'passes' : '-P',
>
> 'gapped' : '-g',
> 'expectation' : '-e',
> 'wordsize' : '-W',
> 'keep_hits' : '-K',
> 'xdrop' : '-X',
> 'hit_extend' : '-f',
> 'region_length' : '-L',
> 'db_length' : '-Z',
> 'search_length' : '-Y',
> 'nbits_gapping' : '-N',
> 'pseudocounts' : '-c',
> 'xdrop_final' : '-Z',
> 'xdrop_extension' : '-y',
> 'model_threshold' : '-h',
> 'required_start' : '-S',
> 'required_end' : '-H',
>
> 'program' : '-p',
> 'database' : '-d',
> 'infile' : '-i',
> 'filter' : '-F',
> 'believe_query' : '-J',
> 'nprocessors' : '-a',
>
> 'html' : '-T',
> 'descriptions' : '-v',
> 'alignments' : '-b',
> 'align_view' : '-m',
> 'show_gi' : '-I',
> 'seqalign_file' : '-O',
> 'align_outfile' : '-o',
> 'checkpoint_outfile' : '-C',
> 'restart_infile' : '-R',
> 'hit_infile' : '-k',
> 'matrix_outfile' : '-Q',
> 'align_infile' : '-B'
> }
>
> if not os.path.exists(blastcmd):
> raise ValueError, "blastpgp does not exist at %s" % blastcmd
>
> params = []
>
> params.extend([att2param['database'], database])
> params.extend([att2param['infile'], infile])
>
> for attr in keywds.keys():
> params.extend([att2param[attr], str(keywds[attr])])
>
> w, r, e = os.popen3(" ".join([blastcmd] + params))
> w.close()
> return File.UndoHandle(r), File.UndoHandle(e)
>
>
> def rpsblast(blastcmd, database, infile, align_view="7", **keywds):
> """rpsblast(blastcmd, database, infile, **keywds) ->
> read, error Undohandles
>
> Execute and retrieve data from standalone RPS-BLAST. blastcmd is the
> command used to launch the 'rpsblast' executable. database is the path
> to the database to search against. infile is the path to the file
> containing the sequence to search with.
>
> You may pass more parameters to **keywds to change the behavior of
> the search. Otherwise, optional values will be chosen by rpsblast.
>
> Please note that this function will give XML output by default, by
> setting align_view to seven (i.e. command line option -m 7).
> You should use the NCBIXML.BlastParser() to read the resulting output.
> This is because NCBIStandalone.BlastParser() does not understand the
> plain text output format from rpsblast.
>
> WARNING - The following text and associated parameter handling has not
> received extensive testing. Please report any errors we might have made...
>
> Algorithm/Scoring
> gapped Whether to do a gapped alignment. T/F
> multihit 0 for multiple hit (default), 1 for single hit
> expectation Expectation value cutoff.
> range_restriction Range restriction on query sequence (Format: start,stop) blastp only
> 0 in 'start' refers to the beginning of the sequence
> 0 in 'stop' refers to the end of the sequence
> Default = 0,0
> xdrop Dropoff value (bits) for gapped alignments.
> xdrop_final X dropoff for final gapped alignment (in bits).
> xdrop_extension Dropoff for blast extensions (in bits).
> search_length Effective length of search space.
> nbits_gapping Number of bits to trigger gapping.
> protein Query sequence is protein. T/F
> db_length Effective database length.
>
> Processing
> filter Filter query sequence with SEG? T/F
> case_filter Use lower case filtering of FASTA sequence T/F, default F
> believe_query Believe the query defline. T/F
> nprocessors Number of processors to use.
> logfile Name of log file to use, default rpsblast.log
>
> Formatting
> html Produce HTML output? T/F
> descriptions Number of one-line descriptions.
> alignments Number of alignments.
> align_view Alignment view. Integer 0-9.
> show_gi Show GI's in deflines? T/F
> seqalign_file seqalign file to output.
> align_outfile Output file for alignment.
>
> """
> att2param = {
> 'multihit' : '-P',
> 'gapped' : '-g',
> 'expectation' : '-e',
> 'range_restriction' : '-L',
> 'xdrop' : '-X',
> 'xdrop_final' : '-Z',
> 'xdrop_extension' : '-y',
> 'search_length' : '-Y',
> 'nbits_gapping' : '-N',
> 'protein' : '-p',
> 'db_length' : '-z',
>
> 'database' : '-d',
> 'infile' : '-i',
> 'filter' : '-F',
> 'case_filter' : '-U',
> 'believe_query' : '-J',
> 'nprocessors' : '-a',
> 'logfile' : '-l',
>
> 'html' : '-T',
> 'descriptions' : '-v',
> 'alignments' : '-b',
> 'align_view' : '-m',
> 'show_gi' : '-I',
> 'seqalign_file' : '-O',
> 'align_outfile' : '-o'
> }
>
> if not os.path.exists(blastcmd):
> raise ValueError, "rpsblast does not exist at %s" % blastcmd
>
> params = []
>
> params.extend([att2param['database'], database])
> params.extend([att2param['infile'], infile])
> params.extend([att2param['align_view'], align_view])
>
> for attr in keywds.keys():
> params.extend([att2param[attr], str(keywds[attr])])
>
> w, r, e = os.popen3(" ".join([blastcmd] + params))
> w.close()
> return File.UndoHandle(r), File.UndoHandle(e)
>
> def _re_search(regex, line, error_msg):
> m = re.search(regex, line)
> if not m:
> raise SyntaxError, error_msg
> return m.groups()
>
> def _get_cols(line, cols_to_get, ncols=None, expected={}):
> cols = line.split()
>
> # Check to make sure number of columns is correct
> if ncols is not None and len(cols) != ncols:
> raise SyntaxError, "I expected %d columns (got %d) in line\n%s" % \
> (ncols, len(cols), line)
>
> # Check to make sure columns contain the correct data
> for k in expected.keys():
> if cols[k] != expected[k]:
> raise SyntaxError, "I expected '%s' in column %d in line\n%s" % (
> expected[k], k, line)
>
> # Construct the answer tuple
> results = []
> for c in cols_to_get:
> results.append(cols[c])
> return tuple(results)
>
> def _safe_int(str):
> try:
> return int(str)
> except ValueError:
> # Something went wrong. Try to clean up the string.
> # Remove all commas from the string
> str = str.replace(',', '')
> try:
> # try again.
> return int(str)
> except ValueError:
> pass
> # If it fails again, maybe it's too long?
> # XXX why converting to float?
> return long(float(str))
>
> def _safe_float(str):
> # Thomas Rosleff Soerensen (rosleff at mpiz-koeln.mpg.de) noted that
> # float('e-172') does not produce an error on his platform. Thus,
> # we need to check the string for this condition.
>
> # Sometimes BLAST leaves of the '1' in front of an exponent.
> if str and str[0] in ['E', 'e']:
> str = '1' + str
> try:
> return float(str)
> except ValueError:
> # Remove all commas from the string
> str = str.replace(',', '')
> # try again.
> return float(str)
>
> class _BlastErrorConsumer(_BlastConsumer):
> def __init__(self):
> _BlastConsumer.__init__(self)
> def noevent(self, line):
> if line.find("Query must be at least wordsize") != -1:
> raise ShortQueryBlastError, "Query must be at least wordsize"
> # Now pass the line back up to the superclass.
> method = getattr(_BlastConsumer, 'noevent',
> _BlastConsumer.__getattr__(self, 'noevent'))
> method(line)
>
> class BlastErrorParser(AbstractParser):
> """Attempt to catch and diagnose BLAST errors while parsing.
>
> This utilizes the BlastParser module but adds an additional layer
> of complexity on top of it by attempting to diagnose SyntaxError's
> that may actually indicate problems during BLAST parsing.
>
> Current BLAST problems this detects are:
> o LowQualityBlastError - When BLASTing really low quality sequences
> (ie. some GenBank entries which are just short streches of a single
> nucleotide), BLAST will report an error with the sequence and be
> unable to search with this. This will lead to a badly formatted
> BLAST report that the parsers choke on. The parser will convert the
> SyntaxError to a LowQualityBlastError and attempt to provide useful
> information.
>
> """
> def __init__(self, bad_report_handle = None):
> """Initialize a parser that tries to catch BlastErrors.
>
> Arguments:
> o bad_report_handle - An optional argument specifying a handle
> where bad reports should be sent. This would allow you to save
> all of the bad reports to a file, for instance. If no handle
> is specified, the bad reports will not be saved.
> """
> self._bad_report_handle = bad_report_handle
>
> #self._b_parser = BlastParser()
> self._scanner = _Scanner()
> self._consumer = _BlastErrorConsumer()
>
> def parse(self, handle):
> """Parse a handle, attempting to diagnose errors.
> """
> results = handle.read()
>
> try:
> self._scanner.feed(File.StringHandle(results), self._consumer)
> except SyntaxError, msg:
> # if we have a bad_report_file, save the info to it first
> if self._bad_report_handle:
> # send the info to the error handle
> self._bad_report_handle.write(results)
>
> # now we want to try and diagnose the error
> self._diagnose_error(
> File.StringHandle(results), self._consumer.data)
>
> # if we got here we can't figure out the problem
> # so we should pass along the syntax error we got
> raise
> return self._consumer.data
>
> def _diagnose_error(self, handle, data_record):
> """Attempt to diagnose an error in the passed handle.
>
> Arguments:
> o handle - The handle potentially containing the error
> o data_record - The data record partially created by the consumer.
> """
> line = handle.readline()
>
> while line:
> # 'Searchingdone' instead of 'Searching......done' seems
> # to indicate a failure to perform the BLAST due to
> # low quality sequence
> if line.startswith('Searchingdone'):
> raise LowQualityBlastError("Blast failure occured on query: ",
> data_record.query)
> line = handle.readline()
>
>
>
>
> # BLAST XML parsing
> """This module provides code to work with the BLAST XML output
> following the DTD available on the NCBI FTP
> ftp://ftp.ncbi.nlm.nih.gov/blast/documents/xml/NCBI_BlastOutput.dtd
>
> Classes:
> BlastParser Parses XML output from BLAST.
>
> _XMLParser Generic SAX parser.
> """
> from Bio.Blast import Record
> import xml.sax
> from xml.sax.handler import ContentHandler
>
> class _XMLparser(ContentHandler):
> """Generic SAX Parser
>
> Just a very basic SAX parser.
>
> Redefine the methods startElement, characters and endElement.
> """
> def __init__(self):
> """Constructor
> """
> self._tag = []
> self._value = ''
>
> def _secure_name(self, name):
> """Removes 'dangerous' from tag names
>
> name -- name to be 'secured'
> """
> # Replace '-' with '_' in XML tag names
> return name.replace('-', '_')
>
> def startElement(self, name, attr):
> """Found XML start tag
>
> No real need of attr, BLAST DTD doesn't use them
>
> name -- name of the tag
>
> attr -- tag attributes
> """
> self._tag.append(name)
>
> # Try to call a method
> try:
> eval(self._secure_name('self._start_' + name))()
> except AttributeError:
> # Doesn't exist (yet)
> pass
>
> def characters(self, ch):
> """Found some text
>
> ch -- characters read
> """
> self._value += ch # You don't ever get the whole string
>
> def endElement(self, name):
> """Found XML end tag
>
> name -- tag name
> """
> # Strip character buffer
> self._value = self._value.strip()
>
> # Try to call a method (defined in subclasses)
> try:
> eval(self._secure_name('self._end_' + name))()
> except AttributeError: # Method doesn't exist (yet ?)
> pass
>
> # Reset character buffer
> self._value = ''
>
> class BlastParser(_XMLparser):
> """Parse XML BLAST data into a Record.Blast object
>
> Methods:
> parse Parses BLAST XML data.
>
> All XML 'action' methods are private methods and may be:
> _start_TAG called when the start tag is found
> _end_TAG called when the end tag is found
> """
>
> def __init__(self):
> """Constructor
> """
> # Calling superclass method
> _XMLparser.__init__(self)
>
> self._parser = xml.sax.make_parser()
> self._parser.setContentHandler(self)
>
> # To avoid ValueError: unknown url type: NCBI_BlastOutput.dtd
> self._parser.setFeature(xml.sax.handler.feature_validation, 0)
> self._parser.setFeature(xml.sax.handler.feature_namespaces, 0)
> self._parser.setFeature(xml.sax.handler.feature_external_pes, 0)
> self._parser.setFeature(xml.sax.handler.feature_external_ges, 0)
>
> self._blast = Record.Blast()
>
> def parse(self, handler):
> """Parses the XML data
>
> handler -- file handler or StringIO
> """
> # initialize a new Blast Record
> #self._blast = Record.Blast()
> # FIXME: very slow?
> self._blast.__init__()
>
> # bugfix: changed `filename` to `handler`. Iddo 12/20/2004
> self._parser.parse(handler)
> return self._blast
>
> # Header
> def _end_BlastOutput_program(self):
> """BLAST program, e.g., blastp, blastn, etc.
> """
> self._blast.application = self._value.uuper()
>
> def _end_BlastOutput_version(self):
> """version number of the BLAST engine (e.g., 2.1.2)
> """
> self._blast.version = self._value.split()[1]
> self._blast.date = self._value.split()[2][1:-1]
>
> def _end_BlastOutput_reference(self):
> """a reference to the article describing the algorithm
> """
> self._blast.reference = self._value
>
> def _end_BlastOutput_db(self):
> """the database(s) searched
> """
> self._blast.database = self._value
>
> def _end_BlastOutput_query_ID(self):
> """the identifier of the query
> """
> self._blast.query_id = self._value
>
> def _end_BlastOutput_query_def(self):
> """the definition line of the query
> """
> self._blast.query = self._value
>
> def _end_BlastOutput_query_len(self):
> """the length of the query
> """
> self._blast.query_length = int(self._value)
>
> ## def _end_BlastOutput_query_seq(self):
> ## """the query sequence
> ## """
> ## pass # XXX Missing in Record.Blast ?
>
> ## def _end_BlastOutput_iter_num(self):
> ## """the psi-blast iteration number
> ## """
> ## pass # XXX TODO PSI
>
> # non-existent in blastall 2.2.13 output
> def _end_BlastOutput_hits(self):
> """hits to the database sequences, one for every sequence
> """
> self._blast.num_hits = int(self._value)
>
> ## def _end_BlastOutput_message(self):
> ## """error messages
> ## """
> ## pass # XXX What to do ?
>
> # Parameters
> def _end_Parameters_matrix(self):
> """matrix used (-M)
> """
> self._blast.matrix = self._value
>
> def _end_Parameters_expect(self):
> """expect values cutoff (-e)
> """
> self._blast.expect = self._value
>
> ## def _end_Parameters_include(self):
> ## """inclusion threshold for a psi-blast iteration (-h)
> ## """
> ## pass # XXX TODO PSI
>
> def _end_Parameters_sc_match(self):
> """match score for nucleotide-nucleotide comparaison (-r)
> """
> self._blast.sc_match = int(self._value)
>
> def _end_Parameters_sc_mismatch(self):
> """mismatch penalty for nucleotide-nucleotide comparaison (-r)
> """
> self._blast.sc_mismatch = int(self._value)
>
> def _end_Parameters_gap_open(self):
> """gap existence cost (-G)
> """
> self._blast.gap_penalties[0] = int(self._value)
>
> def _end_Parameters_gap_extend(self):
> """gap extension cose (-E)
> """
> self._blast.gap_penalties[1] = int(self._value)
>
> def _end_Parameters_filter(self):
> """filtering options (-F)
> """
> self._blast.filter = self._value
>
> ## def _end_Parameters_pattern(self):
> ## """pattern used for phi-blast search
> ## """
> ## pass # XXX TODO PSI
>
> ## def _end_Parameters_entrez_query(self):
> ## """entrez query used to limit search
> ## """
> ## pass # XXX TODO PSI
>
> # Hits
> def _start_Hit(self):
> self._blast.alignments.append(Record.Alignment())
> self._blast.descriptions.append(Record.Description())
> self._blast.multiple_alignment = []
> self._hit = self._blast.alignments[-1]
> self._descr = self._blast.descriptions[-1]
> self._descr.num_alignments = 0
>
> # Hit_num is useless
>
> def _end_Hit_id(self):
> """identifier of the matched database sequence
> """
> self._hit.title_id = self._value
> self._descr.title_id = self._value
>
> def _end_Hit_def(self):
> """definition line (title) of the database sequence
> """
> self._hit.title = self._value
> self._descr.title = self._value
>
> # not necessary?
> def _end_Hit_accession(self):
> """accession of the database sequence
> """
> self._hit.accession = self._value
> self._descr.accession = self._value
>
> def _end_Hit_len(self):
> self._hit.length = int(self._value)
>
> # HSPs
> def _start_Hsp(self):
> self._hit.hsps.append(Record.HSP())
> self._hsp = self._hit.hsps[-1]
> self._descr.num_alignments += 1
> self._blast.multiple_alignment.append(Record.MultipleAligment())
> self._mult_al = self._blast.multiple_alignment[-1]
>
> # Hsp_num is useless
> def _end_Hsp_score(self):
> """raw score of HSP
> """
> self._hsp.score = float(self._value)
> # hits are in order of best score to worst. keep best
> if self._descr.score == None:
> self._descr.score = float(self._value)
>
> def _end_Hsp_bit_score(self):
> """bit score of HSP
> """
> self._hsp.bit_score = float(self._value)
> # hits are in order of best score to worst. keep best
> if self._descr.bit_score == None:
> self._descr.bit_score = float(self._value)
>
> def _end_Hsp_evalue(self):
> """expect value of the HSP
> """
> self._hsp.evalue = float(self._value)
> if self._descr.evalue == None:
> self._descr.evalue = float(self._value)
>
> def _end_Hsp_query_from(self):
> """offset of query at the start of the alignment (one-offset)
> """
> self._hsp.query_start = int(self._value)
>
> def _end_Hsp_query_to(self):
> """offset of query at the end of the alignment (one-offset)
> """
> self._hsp.query_stop = int(self._value)
>
> def _end_Hsp_hit_from(self):
> """offset of the database at the start of the alignment (one-offset)
> """
> self._hsp.sbjct_start = int(self._value)
>
> def _end_Hsp_hit_to(self):
> """offset of the database at the start of the alignment (one-offset)
> """
> self._hsp.sbjct_stop = int(self._value)
>
>
> ## def _end_Hsp_pattern_from(self):
> ## """start of phi-blast pattern on the query (one-offset)
> ## """
> ## pass # XXX TODO PSI
>
> ## def _end_Hsp_pattern_to(self):
> ## """end of phi-blast pattern on the query (one-offset)
> ## """
> ## pass # XXX TODO PSI
>
> def _end_Hsp_query_frame(self):
> """frame of the query if applicable
> """
> self._hsp.frame = (int(self._value),)
>
> def _end_Hsp_hit_frame(self):
> """frame of the database sequence if applicable
> """
> self._hsp.frame += (int(self._value),)
>
> def _end_Hsp_identity(self):
> """number of identities in the alignment
> """
> self._hsp.identities = (int(self._value), None)
>
> def _end_Hsp_positive(self):
> """number of positive (conservative) substitutions in the alignment
> """
> self._hsp.positives = (int(self._value), None)
>
> def _end_Hsp_align_len(self):
> """length of the alignment
> """
> self._hsp.align_length = int(self._value)
>
> def _end_Hsp_gaps(self):
> """number of gaps in the alignment
> """
> self._hsp.gaps = (int(self._value),None)
>
> ## def _en_Hsp_density(self):
> ## """score density
> ## """
> ## pass # XXX ???
>
> def _end_Hsp_qseq(self):
> """alignment string for the query
> """
> self._hsp.query = self._value
>
> def _end_Hsp_hseq(self):
> """alignment string for the database
> """
> self._hsp.sbjct = self._value
>
> def _end_Hsp_midline(self):
> """Formatting middle line as normally seen in BLAST report
> """
> self._hsp.match = self._value
>
> # Statistics
> def _end_Statistics_db_num(self):
> """number of sequences in the database
> """
> self._blast.num_sequences_in_database = int(self._value)
>
> def _end_Statistics_db_len(self):
> """number of letters in the database
> """
> self._blast.num_letters_in_database = int(self._value)
>
> def _end_Statistics_hsp_len(self):
> """the effective HSP length
> """
> self._blast.effective_hsp_length = int(self._value)
>
> def _end_Statistics_eff_space(self):
> """the effective search space
> """
> self._blast.effective_search_space = float(self._value)
>
> def _end_Statistics_kappa(self):
> """Karlin-Altschul parameter K
> """
> self._blast.ka_params[0] = float(self._value)
>
> def _end_Statistics_lambda(self):
> """Karlin-Altschul parameter Lambda
> """
> self._blast.ka_params[1] = float(self._value)
>
> def _end_Statistics_entropy(self):
> """Karlin-Altschul parameter H
> """
> self._blast.ka_params[2] = float(self._value)
>
>
> if __name__ == '__main__':
> import sys
> p = BlastParser()
> r = p.parse(sys.argv[1])
>
> # Small test
> print 'Blast of', r.query
> print 'Found %s alignments with a total of %s HSPs' % (len(r.alignments),
> reduce(lambda a,b: a+b,
> [len(a.hsps) for a in r.alignments]))
>
> for al in r.alignments:
> print al.title[:50], al.length, 'bp', len(al.hsps), 'HSPs'
>
> # Cookbook example
> E_VALUE_THRESH = 0.04
> for alignment in r.alignments:
> for hsp in alignment.hsps:
> if hsp.expect < E_VALUE_THRESH:
> print '*****'
> print 'sequence', alignment.title
> print 'length', alignment.length
> print 'e value', hsp.expect
> print hsp.query[:75] + '...'
> print hsp.match[:75] + '...'
> print hsp.sbjct[:75] + '...'
>
>
>
> # Copyright 1999-2000 by Jeffrey Chang. All rights reserved.
> # This code is part of the Biopython distribution and governed by its
> # license. Please see the LICENSE file that should have been included
> # as part of this package.
>
> """Record classes to hold BLAST output.
>
> Classes:
> Blast Holds all the information from a blast search.
> PSIBlast Holds all the information from a psi-blast search.
>
> Header Holds information from the header.
> Description Holds information about one hit description.
> Alignment Holds information about one alignment hit.
> HSP Holds information about one HSP.
> MultipleAlignment Holds information about a multiple alignment.
> DatabaseReport Holds information from the database report.
> Parameters Holds information from the parameters.
>
> """
> # XXX finish printable BLAST output
>
> import string
>
> from Bio.Align import Generic
>
> class Header:
> """Saves information from a blast header.
>
> Members:
> application The name of the BLAST flavor that generated this data.
> version Version of blast used.
> date Date this data was generated.
> reference Reference for blast.
>
> query Name of the query sequence.
> query_id Query ID (necessary?) (str)
> query_length Length of the query sequence. (int)
>
> database Name of the database.
> """
> def __init__(self):
> self.application = ''
> self.version = ''
> self.date = ''
> self.reference = ''
>
> self.query = ''
> self.query_id = ''
> self.query_length = None
>
> self.database = ''
>
> class Description:
> """Stores information about one hit in the descriptions section.
>
> Members:
> title Title of the hit.
> title_id Hit ID (necessary?). (str)
> accession Hit accession (necessary?). (str)
> score Number of bits. (int)
> evalue Expect value. (float)
> num_alignments Number of alignments for the same subject. (int)
>
> """
> def __init__(self):
> self.title = ''
> self.title_id = ''
> self.accession = ''
> self.score = None
> self.bit_score = None
> self.evalue = None
> self.num_alignments = None
> def __str__(self):
> return "%-66s %5s %s %s" % (self.title, self.score, self.bit_score, self.evalue)
>
> class Alignment:
> """Stores information about one hit in the alignments section.
>
> Members:
> title Name of the matched sequence. (str)
> title_id Hit ID (necessary?). (str)
> accession Hit accession (necessary?). (str)
> length Length of matched sequence. (int)
> hsps A list of HSP objects.
>
> """
> def __init__(self):
> self.title = ''
> self.title_id = ''
> self.accession = ''
> self.length = None
> self.hsps = []
> def __str__(self):
> lines = []
> titles = string.split(self.title, '\n')
> for i in range(len(titles)):
> if i:
> lines.append(" ")
> lines.append("%s\n" % titles[i])
> lines.append(" Length = %s\n" % self.length)
> return string.join(lines, '')
>
> class HSP:
> """Stores information about one hsp in an alignment hit.
>
> Members:
> score BLAST score of hit. (float)
> bit_score Number of bits for that score. (float)
> evalue Expect value. (float)
> num_alignments Number of alignments for same subject. (int)
> identities Number of identities/total aligned. tuple of (int, int)
> positives Number of positives/total aligned. tuple of (int, int)
> gaps Numer of gaps/total aligned. tuple of (int, int)
> strand Tuple of (query, target) strand.
> frame Tuple of 1 or 2 frame shifts, depending on the flavor.
>
> query The query sequence.
> query_start The start residue for the query sequence. (1-based)
> query_end The end residue for the query sequence. (1-based)
> match The match sequence.
> sbjct The sbjct sequence.
> sbjct_start The start residue for the sbjct sequence. (1-based)
> sbjct_end The end residue for the sbjct sequence. (1-based)
>
> align_length Length of the alignment. (int)
>
> Not all flavors of BLAST return values for every attribute:
> score expect identities positives strand frame
> BLASTP X X X X
> BLASTN X X X X X
> BLASTX X X X X X
> TBLASTN X X X X X
> TBLASTX X X X X X/X
>
> Note: for BLASTX, the query sequence is shown as a protein sequence,
> but the numbering is based on the nucleotides. Thus, the numbering
> is 3x larger than the number of amino acid residues. A similar effect
> can be seen for the sbjct sequence in TBLASTN, and for both sequences
> in TBLASTX.
>
> Also, for negative frames, the sequence numbering starts from
> query_start and counts down.
>
> """
> def __init__(self):
> self.score = None
> self.bit_score = None
> self.expect = None
> self.num_alignments = None
> self.identities = (None, None)
> self.positives = (None, None)
> self.gaps = (None, None)
> self.strand = (None, None)
> self.frame = ()
>
> self.query = ''
> self.query_start = None
> self.query_stop = None
> self.match = ''
> self.sbjct = ''
> self.sbjct_start = None
> self.sbjct_stop = None
>
> self.align_length = None
>
> class MultipleAlignment:
> """Holds information about a multiple alignment.
>
> Members:
> alignment A list of tuples (name, start residue, sequence, end residue).
>
> The start residue is 1-based. It may be blank, if that sequence is
> not aligned in the multiple alignment.
>
> """
> def __init__(self):
> self.alignment = []
>
> def to_generic(self, alphabet):
> """Retrieve generic alignment object for the given alignment.
>
> Instead of the tuples, this returns an Alignment object from
> Bio.Align.Generic, through which you can manipulate and query
> the object.
>
> alphabet is the specified alphabet for the sequences in the code (for
> example IUPAC.IUPACProtein.
>
> Thanks to James Casbon for the code.
> """
> seq_parts = []
> seq_names = []
> parse_number = 0
> n = 0
> for name, start, seq, end in self.alignment:
> if name == 'QUERY': #QUERY is the first in each alignment block
> parse_number = parse_number + 1
> n = 0
>
> if parse_number == 1: # create on first_parse, append on all others
> seq_parts.append(seq)
> seq_names.append(name)
> else:
> seq_parts[n] = seq_parts[n] + seq
> n = n + 1
>
> generic = Generic.Alignment(alphabet)
> for (name,seq) in zip(seq_names,seq_parts):
> generic.add_sequence(name, seq)
>
> return generic
>
> class Round:
> """Holds information from a PSI-BLAST round.
>
> Members:
> number Round number. (int)
> reused_seqs Sequences in model, found again. List of Description objects.
> new_seqs Sequences not found, or below threshold. List of Description.
> alignments A list of Alignment objects.
> multiple_alignment A MultipleAlignment object.
>
> """
> def __init__(self):
> self.number = None
> self.reused_seqs = []
> self.new_seqs = []
> self.alignments = []
> self.multiple_alignment = None
>
> class DatabaseReport:
> """Holds information about a database report.
>
> Members:
> database_name List of database names. (can have multiple dbs)
> num_letters_in_database Number of letters in the database. (int)
> num_sequences_in_database List of number of sequences in the database. (int)
> posted_date List of the dates the databases were posted.
> ka_params Array of [lambda, k, h] values. (floats)
> gapped # XXX this isn't set right!
> ka_params_gap Array of [lambda, k, h] values. (floats)
>
> """
> def __init__(self):
> self.database_name = []
> self.posted_date = []
> self.num_letters_in_database = None
> self.num_sequences_in_database = None
> self.ka_params = [None, None, None]
> self.gapped = 0
> self.ka_params_gap = [None, None, None]
>
> class Parameters:
> """Holds information about the parameters.
>
> Members:
> matrix Name of the matrix.
> filter Filter parameter. (str)
> gap_penalties Array of [open, extend] penalties. (floats)
> sc_match Match score for nucleotide-nucleotide comparison
> sc_mismatch Mismatch penalty for nucleotide-nucleotide comparison
> num_hits Number of hits to the database. (int)
> num_sequences Number of sequences. (int)
> num_good_extends Number of extensions. (int)
> expect Expectation value. (int)
> hsps_no_gap Number of HSP's better, without gapping. (int)
> hsps_prelim_gapped Number of HSP's gapped in prelim test. (int)
> hsps_prelim_gapped_attemped Number of HSP's attempted in prelim. (int)
> hsps_gapped Total number of HSP's gapped. (int)
> query_length Length of the query. (int)
> database_length Number of letters in the database. (int)
> effective_hsp_length Effective HSP length. (int)
> effective_query_length Effective length of query. (int)
> effective_database_length Effective length of database. (int)
> effective_search_space Effective search space. (int)
> effective_search_space_used Effective search space used. (int)
> frameshift Frameshift window. Tuple of (int, float)
> threshold Threshold. (int)
> window_size Window size. (int)
> dropoff_1st_pass Tuple of (score, bits). (int, float)
> gap_x_dropoff Tuple of (score, bits). (int, float)
> gap_x_dropoff_final Tuple of (score, bits). (int, float)
> gap_trigger Tuple of (score, bits). (int, float)
> blast_cutoff Tuple of (score, bits). (int, float)
> """
> def __init__(self):
> self.matrix = ''
> self.filter = ''
> self.gap_penalties = [None, None]
> self.sc_match = None
> self.sc_mismatch = None
> self.num_hits = None
> self.num_sequences = None
> self.num_good_extends = None
> self.expect = None
> self.hsps_no_gap = None
> self.hsps_prelim_gapped = None
> self.hsps_prelim_gapped_attemped = None
> self.hsps_gapped = None
> self.query_length = None
> self.database_length = None
> self.effective_hsp_length = None
> self.effective_query_length = None
> self.effective_database_length = None
> self.effective_search_space = None
> self.effective_search_space_used = None
> self.frameshift = (None, None)
> self.threshold = None
> self.window_size = None
> self.dropoff_1st_pass = (None, None)
> self.gap_x_dropoff = (None, None)
> self.gap_x_dropoff_final = (None, None)
> self.gap_trigger = (None, None)
> self.blast_cutoff = (None, None)
>
> class Blast(Header, DatabaseReport, Parameters):
> """Saves the results from a blast search.
>
> Members:
> descriptions A list of Description objects.
> alignments A list of Alignment objects.
> multiple_alignment A MultipleAlignment object.
> + members inherited from base classes
>
> """
> def __init__(self):
> Header.__init__(self)
> DatabaseReport.__init__(self)
> Parameters.__init__(self)
> self.descriptions = []
> self.alignments = []
> self.multiple_alignment = None
>
> class PSIBlast(Header, DatabaseReport, Parameters):
> """Saves the results from a blastpgp search.
>
> Members:
> rounds A list of Round objects.
> converged Whether the search converged.
> + members inherited from base classes
>
> """
> def __init__(self):
> Header.__init__(self)
> DatabaseReport.__init__(self)
> Parameters.__init__(self)
> self.rounds = []
> self.converged = 0
>
>
>
> _______________________________________________
> BioPython mailing list - BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>
>
More information about the Biopython
mailing list