[Biopython-dev] NCBIStandalone Blast HSP parsing

Mon Oct 17 11:27:28 EDT 2005

Just to make sure I understand what you're doing:

Are the query_end and sbjct_end attributes found in the Blast output, or do
you calculate them from the other attributes in the Blast output? If they're
in the Blast output,
1) Do they always appear in the Blast output, or does it depend on the query?
In the latter case, does the modified Blast parser choke on Blast output that
do not contain these attributes?
2) Does these attributes also appear in Blast XML output? The XML parser is
easier to maintain than the text-based parser in BlastStandalone, may
therefore become the main Blast parser in Biopython in the long run.

--Michiel. 

Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032

-----Original Message-----
From: biopython-dev-bounces at portal.open-bio.org on behalf of Mark Hoebeke
Sent: Mon 10/17/2005 10:07 AM
To: biopython-dev at biopython.org
Subject: [Biopython-dev] NCBIStandalone Blast HSP parsing

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all,

I wanted a quick and easy way to determine the endpoints of HSPs extraced
from
Blast reports parser with NCBIStandalone. Unfortunately the HSP class lacks
the
query_end and sbjct_end attributes. Googling around led me to a recipe
describing how to compute the endpoint using the total length, gap length and
other niceties. Not exactly intuitive to me.

Hence I dove into the NCBIStandalone and HSP modules and made some slight
modifications. Basically I added the two attributes to HSP and the following
snippets to NCBIStandalone (release 1.4b):

972c972
<     _query_re = re.compile(r"Query: (\d+)\s*(.+) (\d+)")
- ---
>     _query_re = re.compile(r"Query: (\d+)\s*(.+) \d")
977,978c977
<         start, seq, end = m.groups()
<       self._hsp.query_end=string.atoi(end);
- ---
>         start, seq = m.groups()
997,998c996,997
<         start, seq, end = _re_search(
<             r"Sbjct: (\d+)\s*(.+) (\d+)", line,
- ---
>         start, seq = _re_search(
>             r"Sbjct: (\d+)\s*(.+) \d", line,
1014c1013
<       self._hsp.sbjct_end=string.atoi(end)
- ---
>

Looks to easy to be true, I thought. Now sorry if I'm missing some important
issues here (I'm quite new to BioPython), but is there a reason no one has
made
this patch yet ?

Thanks for any comments (flames and others.)

Cheers,

Mark

- --
-
----------------------------Mark.Hoebeke at jouy.inra.fr-----------------------
Unité Statistique & Génome    _/_/_/    _/_/_/  http://stat.genopole.cnrs.fr
Tél : +33 (0)1 60 87 38 03  _/        _/          Fax : +33 (0)1 60 87 38 09
Tour Evry 2,                 _/_/    _/  _/_/         523, pl. des Terrasses
F-91000,                        _/  _/    _/                            Evry
PGP : A2AD52E3           _/_/_/      _/_/_/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDU7ARa3nTV6KtUuMRArBqAKC/m4i+VpVaU3clvOkMuYkfRrZQ+QCfbRKg
gBBW5wNKS3sb/Uqr31eumx8=
=vSWV
-----END PGP SIGNATURE-----
_______________________________________________
Biopython-dev mailing list
Biopython-dev at biopython.org
http://biopython.org/mailman/listinfo/biopython-dev