[Bioperl-l] Hierarchical location parsing

Mark Hoebeke Mark.Hoebeke at jouy.inra.fr
Fri Mar 25 15:23:42 EST 2005


Brian,

an example of a nested location is found in the 'source' feature of the
Genbank entry having accession AE014074 (Streptococcus pyogenes MGAS315
complete genome). As the file is over 1 Meg in size once compressed it
might not be a good idea to attach it to this mail which is CC'ed to
bioperl-l ;D

Regarding the performance hit of my fix, I feared that replacing a
compiled regexp with a split and a loop over every character of the
string could have a significant impact. As it stands, I timed a simple
parsing script swallowing Genbank files and spitting out each feature
location as a GFF string, on 131 complete microbial genomes. There is no
difference in output between the bioperl-live FTLocationFactory and its
patched version (basically meaning that this test sample did not contain
nested locations). The times are comparable, with even a slight
advantage to the patched version (915.66user 19.53system 15:42.19elapsed
99%CPU vs. 938.06user 17.33system 16:04.15elapsed 99%CPU).

When comparing the outputs of the parser run on a file with a nested
location, it appears that without the bugfix, the nested location yields
an incorrect GFF string as shown by the diff below.

[mark at homer Loc]$ diff MGAS315 MGAS315_patched
1c1
<
join(1..749107,join(788646..977266,join(1018339..1137553,join(1171973..1230114,join(1271911..1313193,join(1351400..1410541,1450556..1900521),)
---
>
join(1..749107,join(788646..977266,join(1018339..1137553,join(1171973..1230114,join(1271911..1313193,join(1351400..1410541,1450556..1900521))))))

I'm still cautious about the bugfix because I only produced the diffs
on microbial genomes, which probably have simpler location definitions
that higher eukaryotes.

Greetings,

Mark

Le vendredi 25 mars 2005 à 11:52 -0500, Brian Osborne a écrit :
> Mark,
> 
> Can you also attach the sequence file that you used in order to test your
> code? That way I can write a test specifically for the parsing of
> hierarchical locations.
> 
> You wrote "I'm not sure the new patch won't slow down location parsing
> considerably..." Have you actually timed the parsing using the old and new
> code?
> 
> Thanks again,
> 
> Brian O.
> 

-- 
--------------------------Mark.Hoebeke at jouy.inra.fr----------------------
Unité Statistique & Génome                                     Unité MIG
+33 (0)1 60 87 38 03                  Tél.          +33 (0)1 34 65 28 85
+33 (0)1 60 87 38 09                  Fax.          +33 (0)1 34 65 29 01
Tour Evry 2, 523 pl. des Terrasses             INRA - Domaine de Vilvert
F - 91000 Evry                             F - 78352 Jouy-en-Josas CEDEX

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Ceci est une partie de message
	=?ISO-8859-1?Q?num=E9riquement?= =?ISO-8859-1?Q?_sign=E9e?=
Url : http://portal.open-bio.org/pipermail/bioperl-l/attachments/20050325/88bde8b1/attachment.bin


More information about the Bioperl-l mailing list