[Bioperl-l] Hierarchical location parsing

Brian Osborne brian_osborne at cognia.com
Tue Mar 29 08:09:05 EST 2005


Mark,

I didn't see any "join(join..." statements in that Genbank entry, as part of
a source feature or anywhere else. I'm used this URL:

http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=21909536


Brian O.


-----Original Message-----
From: bioperl-l-bounces at portal.open-bio.org
[mailto:bioperl-l-bounces at portal.open-bio.org]On Behalf Of Mark Hoebeke
Sent: Friday, March 25, 2005 3:24 PM
To: Brian Osborne
Cc: bioperl-l at portal.open-bio.org
Subject: RE: [Bioperl-l] Hierarchical location parsing


Brian,

an example of a nested location is found in the 'source' feature of the
Genbank entry having accession AE014074 (Streptococcus pyogenes MGAS315
complete genome). As the file is over 1 Meg in size once compressed it
might not be a good idea to attach it to this mail which is CC'ed to
bioperl-l ;D

Regarding the performance hit of my fix, I feared that replacing a
compiled regexp with a split and a loop over every character of the
string could have a significant impact. As it stands, I timed a simple
parsing script swallowing Genbank files and spitting out each feature
location as a GFF string, on 131 complete microbial genomes. There is no
difference in output between the bioperl-live FTLocationFactory and its
patched version (basically meaning that this test sample did not contain
nested locations). The times are comparable, with even a slight
advantage to the patched version (915.66user 19.53system 15:42.19elapsed
99%CPU vs. 938.06user 17.33system 16:04.15elapsed 99%CPU).

When comparing the outputs of the parser run on a file with a nested
location, it appears that without the bugfix, the nested location yields
an incorrect GFF string as shown by the diff below.

[mark at homer Loc]$ diff MGAS315 MGAS315_patched
1c1
<
join(1..749107,join(788646..977266,join(1018339..1137553,join(1171973..12301
14,join(1271911..1313193,join(1351400..1410541,1450556..1900521),)
---
>
join(1..749107,join(788646..977266,join(1018339..1137553,join(1171973..12301
14,join(1271911..1313193,join(1351400..1410541,1450556..1900521))))))

I'm still cautious about the bugfix because I only produced the diffs
on microbial genomes, which probably have simpler location definitions
that higher eukaryotes.

Greetings,

Mark

Le vendredi 25 mars 2005 à 11:52 -0500, Brian Osborne a écrit :
> Mark,
>
> Can you also attach the sequence file that you used in order to test your
> code? That way I can write a test specifically for the parsing of
> hierarchical locations.
>
> You wrote "I'm not sure the new patch won't slow down location parsing
> considerably..." Have you actually timed the parsing using the old and new
> code?
>
> Thanks again,
>
> Brian O.
>

--
--------------------------Mark.Hoebeke at jouy.inra.fr----------------------
Unité Statistique & Génome                                     Unité MIG
+33 (0)1 60 87 38 03                  Tél.          +33 (0)1 34 65 28 85
+33 (0)1 60 87 38 09                  Fax.          +33 (0)1 34 65 29 01
Tour Evry 2, 523 pl. des Terrasses             INRA - Domaine de Vilvert
F - 91000 Evry                             F - 78352 Jouy-en-Josas CEDEX





More information about the Bioperl-l mailing list