[Biopython-dev] NCBIStandalone Blast HSP parsing

Mark Hoebeke Mark.Hoebeke at jouy.inra.fr
Mon Oct 17 13:05:53 EDT 2005


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Michiel De Hoon wrote:
> Just to make sure I understand what you're doing:
> 
> Are the query_end and sbjct_end attributes found in the Blast output, or do
> you calculate them from the other attributes in the Blast output? 

I directly grab them from the Blast report.

>If they're
> in the Blast output,
> 1) Do they always appear in the Blast output, or does it depend on the query?
> In the latter case, does the modified Blast parser choke on Blast output that
> do not contain these attributes?

The patterns in the official release 1.4b module check for "a single
digit" following the string of sequence characters at the end of the
alignment lines.

All I did was to extend the patterns to "one or more digits" and to
capture them in order to store their contents in the HSP attributes. So
AFAIK, the patch does not change the way reports are currently parsed.

> 2) Does these attributes also appear in Blast XML output? The XML parser is
> easier to maintain than the text-based parser in BlastStandalone, may
> therefore become the main Blast parser in Biopython in the long run.

With the sequence set I'm currently working on (and with NCBI Blast
2.2.12), the XML output has indeed the following elements : Hsp_query-to
and Hsp_hit-to which seem to have the intended meaning.

I suppose I should be able to  adapt the XML parser while I'm on it, if
it is officially accepted.

Mark

> 
> --Michiel. 
> 
> 
> 
> Michiel de Hoon
> Center for Computational Biology and Bioinformatics
> Columbia University
> 1150 St Nicholas Avenue
> New York, NY 10032
> 
> 
> 
> -----Original Message-----
> From: biopython-dev-bounces at portal.open-bio.org on behalf of Mark Hoebeke
> Sent: Mon 10/17/2005 10:07 AM
> To: biopython-dev at biopython.org
> Subject: [Biopython-dev] NCBIStandalone Blast HSP parsing
>  
> Hi all,
> 
> I wanted a quick and easy way to determine the endpoints of HSPs extraced
> from
> Blast reports parser with NCBIStandalone. Unfortunately the HSP class lacks
> the
> query_end and sbjct_end attributes. Googling around led me to a recipe
> describing how to compute the endpoint using the total length, gap length and
> other niceties. Not exactly intuitive to me.
> 
> Hence I dove into the NCBIStandalone and HSP modules and made some slight
> modifications. Basically I added the two attributes to HSP and the following
> snippets to NCBIStandalone (release 1.4b):
> 
> 972c972
> <     _query_re = re.compile(r"Query: (\d+)\s*(.+) (\d+)")
> ---
> 
>>>    _query_re = re.compile(r"Query: (\d+)\s*(.+) \d")
> 
> 977,978c977
> <         start, seq, end = m.groups()
> <       self._hsp.query_end=string.atoi(end);
> ---
> 
>>>        start, seq = m.groups()
> 
> 997,998c996,997
> <         start, seq, end = _re_search(
> <             r"Sbjct: (\d+)\s*(.+) (\d+)", line,
> ---
> 
>>>        start, seq = _re_search(
>>>            r"Sbjct: (\d+)\s*(.+) \d", line,
> 
> 1014c1013
> <       self._hsp.sbjct_end=string.atoi(end)
> ---
> 
> 
> Looks to easy to be true, I thought. Now sorry if I'm missing some important
> issues here (I'm quite new to BioPython), but is there a reason no one has
> made
> this patch yet ?
> 
> Thanks for any comments (flames and others.)
> 
> Cheers,
> 
> Mark
> 
> 
> --
> -
> ----------------------------Mark.Hoebeke at jouy.inra.fr-----------------------
> Unité Statistique & Génome    _/_/_/    _/_/_/  http://stat.genopole.cnrs.fr
> Tél : +33 (0)1 60 87 38 03  _/        _/          Fax : +33 (0)1 60 87 38 09
> Tour Evry 2,                 _/_/    _/  _/_/         523, pl. des Terrasses
> F-91000,                        _/  _/    _/                            Evry
> PGP : A2AD52E3           _/_/_/      _/_/_/
> 
> 
> 
> 
_______________________________________________
Biopython-dev mailing list
Biopython-dev at biopython.org
http://biopython.org/mailman/listinfo/biopython-dev

- --
- -------------------------Mark.Hoebeke at jouy.inra.fr---------------------
Unité Statistique & Génome                                    Unité MIG
+33 (0)1 60 87 38 03                   Tél.        +33 (0)1 34 65 28 85
+33 (0)1 60 87 38 09                   Fax.        +33 (0)1 34 65 29 01
Tour Evry 2, 523 pl. des Terrasses            INRA - Domaine de Vilvert
F - 91000 Evry                            F - 78352 Jouy-en-Josas CEDEX
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDU9nxa3nTV6KtUuMRApqXAJ9a9z7J0bvigZ1NiZZxmTUziMocIgCdE0O9
EvX5Bm6f7dMcAUFGfNIO8tk=
=mWo3
-----END PGP SIGNATURE-----


More information about the Biopython-dev mailing list