[Bioperl-l] Memory requirements for conversion from embl to genbank
Martin MOKREJŠ
mmokrejs at ribosome.natur.cuni.cz
Thu Aug 31 20:33:48 UTC 2006
Sendu Bala wrote:
> Martin MOKREJŠ wrote:
>
>> I observe the same. Testcase here. Please push it into tescases.
>> It will be helpful in the future when the parser should cope with the
>> two /note feature lines.
So, to recap, the script used to generate UTRdb (supposed UTRdb_gen)
mangles the input GenBank or EMBL formatted input. According to notes
on the ftp server EMBL rel. 86 has been used to generate this record:
ID 5HGB000664 standard; mRNA; VRL; 1892 BP.
XX
AC BB199698;
XX
DT 20-NOV-2002 (Rel. 16, Created)
DT 20-NOV-2002 (Rel. 16, Last updated, Version 1)
XX
DE 5'UTR in Hepatitis GB virus B subgenomic replicon neoRepB
XX
DR EMBL; AJ428955;
DR UTR; CC221018;
XX
OS Hepatitis GB virus B
OS Encephalomyocarditis virus
OC Viruses; ssRNA positive-strand viruses, no DNA stage; Flaviviridae.
OC Viruses; ssRNA positive-strand viruses, no DNA stage; Picornaviridae;
OC Cardiovirus.
XX
UT 5'UTR;
XX
FH Key Location/Qualifiers
FH
FT 5'UTR 1..1892
FT /source="EMBL::AJ428955:1..1892"
FT /product="non-structural polyprotein"
FT VECTOR 477..1274
FT /source="EMBL::AJ428955:477..1274"
FT /evidence="Similarity"
FT /db_xref="EMBL:"
FT /note="Possible vector contamination"
FT /note="Length=798 BP. Identities=99.6%"
XX
>
>
> Well the cause of the hang is the multiple species defined for one
> sequence. Is that valid? Desired? Should the fix be to somehow store and
> be able to output multiple species again, or to ignore all but one of
> the species? You have two sequences with this problem in the large file
> originally posted.
But the original record in both GenBank and EMBL does make sense, right?
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=21727885
LOCUS GVI428955 8027 bp mRNA linear VRL 15-APR-2005
DEFINITION Hepatitis GB virus B subgenomic replicon neoRepB.
ACCESSION AJ428955
VERSION AJ428955.1 GI:21727885
KEYWORDS core-neo fusion protein; core-neo gene; polyprotein.
SOURCE Hepatitis GB virus B
ORGANISM Hepatitis GB virus B
Viruses; ssRNA positive-strand viruses, no DNA stage; Flaviviridae.
REFERENCE 1
AUTHORS De Tomassi,A., Pizzuti,M., Graziani,R., Sbardellati,A.,
Altamura,S., Paonessa,G. and Traboni,C.
TITLE Cell clones selected from the Huh7 human hepatoma cell line support
efficient replication of a subgenomic GB virus B replicon
JOURNAL J. Virol. 76 (15), 7736-7746 (2002)
PUBMED 12097587
REFERENCE 2 (bases 1 to 8027)
AUTHORS Traboni,C.
TITLE Direct Submission
JOURNAL Submitted (22-JAN-2002) Traboni C., Biochemistry, IRBM P.Angeletti,
via Pontina, km.30, 600. 00040 Pomezia (Roma), ITALY
COMMENT related sequence AJ277947.
FEATURES Location/Qualifiers
source join(1..1281,1893..8027)
/organism="Hepatitis GB virus B"
/mol_type="mRNA"
/isolate="FL3"
/db_xref="taxon:39113"
/focus
source 1282..1892
/organism="Encephalomyocarditis virus"
/mol_type="mRNA"
/db_xref="taxon:12104"
5'UTR 1..445
/experiment="experimental evidence, no additional details
recorded"
CDS 446..1273
/function="core-neo fusion protein"
/codon_start=1
/product="neomycin phosphotransferase"
/protein_id="CAD21956.1"
/db_xref="GI:21727886"
/db_xref="GOA:Q8JKE5"
/db_xref="InterPro:IPR002575"
/db_xref="UniProtKB/TrEMBL:Q8JKE5"
/translation="MPVISTQTGRAMIEQDGLHAGSPAAWVERLFGYDWAQQTIGCSD
AAVFRLSAQGRPVLFVKTDLSGALNELQDEAARLSWLATTGVPCAAVLDVVTEAGRDW
LLLGEVPGQDLLSSHLAPAEKVSIMADAMRRLHTLDPATCPFDHQAKHRIERARTRME
AGLVDQDDLDEEHQGLAPAELFARLKARMPDGEDLVVTHGDACLPNIMVENGRFSGFI
DCGRLGVADRYQDIALATRDIAEELGGEWADRFLVLYGIAAPDSQRIAFYRLLDEFF"
misc_feature 1282..1892
/note="internal ribosome entry site (IRES)"
[...]
The above official GenBank record cannot be parsed and the parsing code
silently leaks through and exits with no data written out. I have filed
bug #2087.
Go to http://www.ebi.ac.uk/cgi-bin/emblfetch and search for AJ428955
ID AJ428955; SV 1; linear; mRNA; STD; VRL; 8027 BP.
XX
AC AJ428955;
XX
DT 09-JUL-2002 (Rel. 72, Created)
DT 15-APR-2005 (Rel. 83, Last updated, Version 4)
XX
DE Hepatitis GB virus B subgenomic replicon neoRepB
XX
KW core-neo fusion protein; core-neo gene; polyprotein.
XX
OS Hepatitis GB virus B
OC Viruses; ssRNA positive-strand viruses, no DNA stage; Flaviviridae.
XX
OS Encephalomyocarditis virus
OC Viruses; ssRNA positive-strand viruses, no DNA stage; Picornaviridae;
OC Cardiovirus.
XX
RN [1]
RP 1-8027
RA Traboni C.;
RT ;
RL Submitted (22-JAN-2002) to the EMBL/GenBank/DDBJ databases.
RL Traboni C., Biochemistry, IRBM P.Angeletti, via Pontina, km.30, 600. 00040
RL Pomezia (Roma), ITALY.
XX
RN [2]
RX DOI; 10.1128/JVI.76.15.7736-7746.2002
RX PUBMED; 12097587.
RA De Tomassi A., Pizzuti M., Graziani R., Sbardellati A., Altamura S.,
RA Paonessa G., Traboni C.;
RT "Cell clones selected from the Huh7 human hepatoma cell line support
RT efficient replication of a subgenomic GB virus B replicon";
RL J. Virol. 76(15):7736-7746(2002).
XX
CC related sequence AJ277947
XX
FH Key Location/Qualifiers
FH
FT source 1..8027
FT /organism="Hepatitis GB virus B"
FT /focus
FT /isolate="FL3"
FT /mol_type="mRNA"
FT /db_xref="taxon:39113"
FT source join(1..1281,1893..8027)
FT /organism="Hepatitis GB virus B"
FT /mol_type="mRNA"
FT /db_xref="taxon:39113"
FT source 1282..1892
FT /organism="Encephalomyocarditis virus"
FT /mol_type="mRNA"
FT /db_xref="taxon:12104"
FT 5'UTR 1..445
FT /experiment="experimental evidence, no additional details
FT recorded"
FT CDS 446..1273
FT /product="neomycin phosphotransferase"
FT /function="core-neo fusion protein"
FT /db_xref="GOA:Q8JKE5"
FT /db_xref="HSSP:P00552"
FT /db_xref="InterPro:IPR002575"
FT /db_xref="InterPro:IPR011009"
FT /db_xref="InterPro:IPR012149"
FT /db_xref="UniProtKB/TrEMBL:Q8JKE5"
FT /protein_id="CAD21956.1"
FT /translation="MPVISTQTGRAMIEQDGLHAGSPAAWVERLFGYDWAQQTIGCSDA
FT AVFRLSAQGRPVLFVKTDLSGALNELQDEAARLSWLATTGVPCAAVLDVVTEAGRDWLL
FT LGEVPGQDLLSSHLAPAEKVSIMADAMRRLHTLDPATCPFDHQAKHRIERARTRMEAGL
FT VDQDDLDEEHQGLAPAELFARLKARMPDGEDLVVTHGDACLPNIMVENGRFSGFIDCGR
FT LGVADRYQDIALATRDIAEELGGEWADRFLVLYGIAAPDSQRIAFYRLLDEFF"
FT misc_feature 1282..1892
FT /note="internal ribosome entry site (IRES)"
[...]
This official EMBL record cannot be parsed either:
------------- EXCEPTION -------------
MSG: Can't see new qualifier in: /focus
from:
/organism="Hepatitis GB virus B"
/focus
/isolate="FL3"
/mol_type="mRNA"
/db_xref="taxon:39113"
STACK Bio::SeqIO::embl::_read_FTHelper_EMBL /usr/lib/perl5/site_perl/5.8.8/Bio/SeqIO/embl.pm:1245
STACK Bio::SeqIO::embl::next_seq /usr/lib/perl5/site_perl/5.8.8/Bio/SeqIO/embl.pm:383
STACK toplevel testparsing.pl:20
--------------------------------------
Shall I file another bugreport or attach under the bug #2077, my favourite one? ;-)
>
> If this has 'worked' for you before it is probably because a completely
> meaningless composite species classification was generated. The new
> taxonomy system 'ensures' that the taxonomic data parsed is sane enough
> to be output correctly again.
I don't have the originally generated files anymore but parsing finished
"successfully" with "some" data written out. ;)
--
Dr. Martin Mokrejs
Faculty of Science, Charles University
Vinicna 5, 128 43 Prague, Czech Republic
http://www.iresite.org
http://www.iresite.org/~mmokrejs
More information about the Bioperl-l
mailing list