[Bioperl-l] Hierarchical location parsing

Hoebeke Mark Mark.Hoebeke at jouy.inra.fr
Wed Mar 30 01:20:18 EST 2005


Hi Brian,

you are right, I reloaded the Genbank file from :

ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria/Streptococcus_pyogenes_MGAS315/AE14074.gbk

and indeed, the source feature has changed to an ordinary simple
location. It seems they corrected the original submission : the
modification date now reads "mar 9", whereas the date on the release I
initially fetched read "18 jul 2002" (which happens to be the date
mentioned in the LOCUS descriptor).

I guess this makes parsing hierarchical location descriptors a moot
point until I come up with another example...

Mark


Le mardi 29 mars 2005 à 08:09 -0500, Brian Osborne a écrit :
> Mark,
> 
> I didn't see any "join(join..." statements in that Genbank entry, as part of
> a source feature or anywhere else. I'm used this URL:
> 
> http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=21909536
> 
> 
> Brian O.
> 
> 
> -----Original Message-----
> From: bioperl-l-bounces at portal.open-bio.org
> [mailto:bioperl-l-bounces at portal.open-bio.org]On Behalf Of Mark Hoebeke
> Sent: Friday, March 25, 2005 3:24 PM
> To: Brian Osborne
> Cc: bioperl-l at portal.open-bio.org
> Subject: RE: [Bioperl-l] Hierarchical location parsing
> 
> 
> Brian,
> 
> an example of a nested location is found in the 'source' feature of the
> Genbank entry having accession AE014074 (Streptococcus pyogenes MGAS315
> complete genome). As the file is over 1 Meg in size once compressed it
> might not be a good idea to attach it to this mail which is CC'ed to
> bioperl-l ;D
> 
> Regarding the performance hit of my fix, I feared that replacing a
> compiled regexp with a split and a loop over every character of the
> string could have a significant impact. As it stands, I timed a simple
> parsing script swallowing Genbank files and spitting out each feature
> location as a GFF string, on 131 complete microbial genomes. There is no
> difference in output between the bioperl-live FTLocationFactory and its
> patched version (basically meaning that this test sample did not contain
> nested locations). The times are comparable, with even a slight
> advantage to the patched version (915.66user 19.53system 15:42.19elapsed
> 99%CPU vs. 938.06user 17.33system 16:04.15elapsed 99%CPU).
> 
> When comparing the outputs of the parser run on a file with a nested
> location, it appears that without the bugfix, the nested location yields
> an incorrect GFF string as shown by the diff below.
> 
> [mark at homer Loc]$ diff MGAS315 MGAS315_patched
> 1c1
> <
> join(1..749107,join(788646..977266,join(1018339..1137553,join(1171973..12301
> 14,join(1271911..1313193,join(1351400..1410541,1450556..1900521),)
> ---
> >
> join(1..749107,join(788646..977266,join(1018339..1137553,join(1171973..12301
> 14,join(1271911..1313193,join(1351400..1410541,1450556..1900521))))))
> 
> I'm still cautious about the bugfix because I only produced the diffs
> on microbial genomes, which probably have simpler location definitions
> that higher eukaryotes.
> 
> Greetings,
> 
> Mark
> 
> Le vendredi 25 mars 2005 à 11:52 -0500, Brian Osborne a écrit :
> > Mark,
> >
> > Can you also attach the sequence file that you used in order to test your
> > code? That way I can write a test specifically for the parsing of
> > hierarchical locations.
> >
> > You wrote "I'm not sure the new patch won't slow down location parsing
> > considerably..." Have you actually timed the parsing using the old and new
> > code?
> >
> > Thanks again,
> >
> > Brian O.
> >
> 
> --
> --------------------------Mark.Hoebeke at jouy.inra.fr----------------------
> Unité Statistique & Génome                                     Unité MIG
> +33 (0)1 60 87 38 03                  Tél.          +33 (0)1 34 65 28 85
> +33 (0)1 60 87 38 09                  Fax.          +33 (0)1 34 65 29 01
> Tour Evry 2, 523 pl. des Terrasses             INRA - Domaine de Vilvert
> F - 91000 Evry                             F - 78352 Jouy-en-Josas CEDEX
> 
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
-- 
-------------------------Mark.Hoebeke at jouy.inra.fr---------------------
Unité Statistique & Génome                                    Unité MIG
+33 (0)1 60 87 38 03                   Tél.        +33 (0)1 34 65 28 85
+33 (0)1 60 87 38 09                   Fax.        +33 (0)1 34 65 29 01
Tour Evry 2, 523 pl. des Terrasses            INRA - Domaine de Vilvert
F - 91000 Evry                            F - 78352 Jouy-en-Josas CEDEX

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Ceci est une partie de message
	=?ISO-8859-1?Q?num=E9riquement?= =?ISO-8859-1?Q?_sign=E9e?=
Url : http://portal.open-bio.org/pipermail/bioperl-l/attachments/20050330/3823412d/attachment-0001.bin


More information about the Bioperl-l mailing list