[BioPython] Problem with blastx output parsing =~
Italo Maia
italo.maia at gmail.com
Mon Jun 4 17:22:15 UTC 2007
Well, i have 24 thousand of those, i think it would be very painfull to
remake them...i'll fill the the bug, but, could there be a workaround? The
file goes below:
<<<begin>>>
BLASTX 2.2.15 [Oct-15-2006]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs", Nucleic Acids Res. 25:3389-3402.
Query= 26
(858 letters)
Database: Leigo
4,535,438 sequences; 1,573,298,872 total letters
Searching..................................................done
Score E
Sequences producing significant alignments: (bits)
Value
gi|15778340|gb|AAL07392.1|AF411412_4 polymerase [Hepatitis B virus] 39
0.33
gi|12060441|dbj|BAB20611.1| DNA polymerase [Hepatitis B virus] 38
0.57
gi|84095095|dbj|BAE66661.1| P protein [Hepatitis B virus] 38
0.57
gi|57021117|ref|NP_647604.2| Polymerase [Hepatitis B virus] 38
0.75
>gi|15778340|gb|AAL07392.1|AF411412_4 polymerase [Hepatitis B virus]
Length = 843
Score = 38.9 bits (89), Expect = 0.33
Identities = 24/89 (26%), Positives = 42/89 (47%), Gaps = 1/89 (1%)
Frame = +1
Query: 562 VSPLLGAMTRGKRRKPGRIWSISHPLPITNLWQHPDGAWHANNRPTSVLAAAN*KE-RKF 738
+ P G++ RGK + G IW+ HP + P G+ H +N +S + + RK
Sbjct: 225 LQPQQGSLARGKSGRSGSIWARVHPTTRQSFGVEPSGSRHIDNSASSTTSCLHQSAVRKT 284
Query: 739 FFYKQTSCKAANNTGRATPDAQWTPSTHR 825
+ ++ K +++GRA PS+ R
Sbjct: 285 AYSHLSTSKRQSSSGRAVELHNIPPSSVR 313
>gi|12060441|dbj|BAB20611.1| DNA polymerase [Hepatitis B virus]
Length = 843
Score = 38.1 bits (87), Expect = 0.57
Identities = 23/90 (25%), Positives = 42/90 (46%), Gaps = 1/90 (1%)
Frame = +1
Query: 562 VSPLLGAMTRGKRRKPGRIWSISHPLPITNLWQHPDGAWHANNRPTSVLAAAN*KE-RKF 738
+ P G++ RGK + G IWS HP + P G+ H +N +S + + RK
Sbjct: 225 LQPQQGSLARGKSGRSGSIWSRVHPTTRRSFGVEPSGSGHIDNSASSTSSCLHQSAVRKT 284
Query: 739 FFYKQTSCKAANNTGRATPDAQWTPSTHRA 828
+ ++ K +++G A P++ R+
Sbjct: 285 AYSHLSTSKRQSSSGHAVEFHNIPPNSARS 314
>gi|84095095|dbj|BAE66661.1| P protein [Hepatitis B virus]
Length = 843
Score = 38.1 bits (87), Expect = 0.57
Identities = 23/90 (25%), Positives = 42/90 (46%), Gaps = 1/90 (1%)
Frame = +1
Query: 562 VSPLLGAMTRGKRRKPGRIWSISHPLPITNLWQHPDGAWHANNRPTSVLAAAN*KE-RKF 738
+ P G++ RGK + G IW+ HP + P G+ H +N +S + + RK
Sbjct: 225 LQPQQGSLARGKSGRSGSIWARVHPTSRRSFGVEPSGSGHIDNSASSASSCLHQSAVRKT 284
Query: 739 FFYKQTSCKAANNTGRATPDAQWTPSTHRA 828
+ ++ K +++G A PS+ R+
Sbjct: 285 AYSHLSTSKRQSSSGHAVELLNIPPSSARS 314
>gi|57021117|ref|NP_647604.2| Polymerase [Hepatitis B virus]
Length = 843
Score = 37.7 bits (86), Expect = 0.75
Identities = 24/90 (26%), Positives = 41/90 (45%), Gaps = 1/90 (1%)
Frame = +1
Query: 562 VSPLLGAMTRGKRRKPGRIWSISHPLPITNLWQHPDGAWHANNRPTSVLAAAN*KE-RKF 738
+ P G++ RGK + G IWS HP P G+ H +N +S + + RK
Sbjct: 225 LQPQQGSLARGKSGRSGSIWSRVHPTTRRPFGVEPSGSGHIDNTASSTSSCLHQSAVRKT 284
Query: 739 FFYKQTSCKAANNTGRATPDAQWTPSTHRA 828
+ ++ K +++G A PS+ R+
Sbjct: 285 AYSHLSTSKRQSSSGHAVELHNIPPSSARS 314
Database: Leigo
Posted date: Jan 22, 2007 11:26 AM
Number of letters in database: 1,573,298,872
Number of sequences in database: 4,535,438
Lambda K H
0.318 0.134 0.401
Gapped
Lambda K H
0.267 0.0410 0.140
Matrix: BLOSUM62
Gap Penalties: Existence: 11, Extension: 1
Number of Sequences: 4535438
Number of Hits to DB: 2,724,816,234
Number of extensions: 65999927
Number of successful extensions: 158184
Number of sequences better than 2.0: 4
Number of HSP's gapped: 158133
Number of HSP's successfully gapped: 4
Length of query: 286
Length of database: 1,573,298,872
Length adjustment: 130
Effective length of query: 156
Effective length of database: 983,691,932
Effective search space: 153455941392
Effective search space used: 153455941392
Neighboring words threshold: 12
Window for multiple hits: 40
X1: 16 ( 7.3 bits)
X2: 38 (14.6 bits)
X3: 64 (24.7 bits)
S1: 41 (21.7 bits)
S2: 32 (16.9 bits)
<<<end>>>
2007/6/4, Peter <biopython at maubp.freeserve.co.uk>:
>
> Italo Maia wrote:
> > Well, i have a perfectly fine blastx output that throws an error when
> parsed
> > by biopython.
> > It gives me this output:
> >
> > Traceback (most recent call last):
> > File "<stdin>", line 1, in <module>
> > File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py",
> line
> > 624, in parse
> > self._scanner.feed(handle, self._consumer)
> > File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py",
> line
> > 99, in feed
> > self._scan_parameters(uhandle, consumer)
> > File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py",
> line
> > 570, in _scan_parameters
> > has_re=re.compile(r"[Ll]ength of \s*[Dd]atabase"))
> > File "/var/lib/python-support/python2.5/Bio/ParserSupport.py", line
> 300,
> > in read_and_call
> > raise SyntaxError, errmsg
> > SyntaxError: Line does not match regex '[Ll]ength of \s*[Dd]atabase':
> > Number of HSP's gapped: 136690
> >
> > What could i do??? I'm using ubuntu feisty here.
>
> It looks like you are using the plain text output from blast, so we
> would recommend you try the XML output instead.
>
> See section 3.4 of the tutorial:
> http://biopython.org/DIST/docs/tutorial/Tutorial.html
>
> If you really want to use the plain text output, please file a bug
> (including Biopython version number) and then attach the plain text
> blast output which fails. But no promises - its an uphill battle to keep
> the parser up to date with each version of Blast!
>
> Peter
>
>
--
"A arrogância é a arma dos fracos."
===========================
Italo Moreira Campelo Maia
Ciência da Computação - UECE
Desenvolvedor WEB
Programador Java, Python
Meu blog ^^ http://eusouolobomal.blogspot.com/
===========================
More information about the Biopython
mailing list