[Bioperl-l] Memory requirements for conversion from embl to genbank

Martin MOKREJŠ mmokrejs at ribosome.natur.cuni.cz
Thu Aug 31 20:33:48 UTC 2006


Sendu Bala wrote:
> Martin MOKREJŠ wrote:
> 
>> I observe the same. Testcase here. Please push it into tescases.
>> It will be helpful in the future when the parser should cope with the
>> two /note feature lines.

So, to recap, the script used to generate UTRdb (supposed UTRdb_gen)
mangles the input GenBank or EMBL formatted input. According to notes
on the ftp server EMBL rel. 86 has been used to generate this record:


ID   5HGB000664 standard; mRNA; VRL; 1892 BP.
XX
AC   BB199698;
XX
DT   20-NOV-2002 (Rel. 16, Created)
DT   20-NOV-2002 (Rel. 16, Last updated, Version 1)
XX
DE   5'UTR in Hepatitis GB virus B subgenomic replicon neoRepB
XX
DR   EMBL; AJ428955;
DR   UTR; CC221018;
XX
OS   Hepatitis GB virus B
OS   Encephalomyocarditis virus
OC   Viruses; ssRNA positive-strand viruses, no DNA stage; Flaviviridae.
OC   Viruses; ssRNA positive-strand viruses, no DNA stage; Picornaviridae;
OC   Cardiovirus.
XX
UT   5'UTR;
XX
FH   Key             Location/Qualifiers
FH
FT   5'UTR           1..1892
FT                   /source="EMBL::AJ428955:1..1892"
FT                   /product="non-structural polyprotein"
FT   VECTOR          477..1274
FT                   /source="EMBL::AJ428955:477..1274"
FT                   /evidence="Similarity"
FT                   /db_xref="EMBL:"
FT                   /note="Possible vector contamination"
FT                   /note="Length=798 BP. Identities=99.6%"
XX


> 
> 
> Well the cause of the hang is the multiple species defined for one
> sequence. Is that valid? Desired? Should the fix be to somehow store and
> be able to output multiple species again, or to ignore all but one of
> the species? You have two sequences with this problem in the large file
> originally posted.

But the original record in both GenBank and EMBL does make sense, right?


http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=21727885

LOCUS       GVI428955               8027 bp    mRNA    linear   VRL 15-APR-2005
DEFINITION  Hepatitis GB virus B subgenomic replicon neoRepB.
ACCESSION   AJ428955
VERSION     AJ428955.1  GI:21727885
KEYWORDS    core-neo fusion protein; core-neo gene; polyprotein.
SOURCE      Hepatitis GB virus B
  ORGANISM  Hepatitis GB virus B
            Viruses; ssRNA positive-strand viruses, no DNA stage; Flaviviridae.
REFERENCE   1
  AUTHORS   De Tomassi,A., Pizzuti,M., Graziani,R., Sbardellati,A.,
            Altamura,S., Paonessa,G. and Traboni,C.
  TITLE     Cell clones selected from the Huh7 human hepatoma cell line support
            efficient replication of a subgenomic GB virus B replicon
  JOURNAL   J. Virol. 76 (15), 7736-7746 (2002)
   PUBMED   12097587
REFERENCE   2  (bases 1 to 8027)
  AUTHORS   Traboni,C.
  TITLE     Direct Submission
  JOURNAL   Submitted (22-JAN-2002) Traboni C., Biochemistry, IRBM P.Angeletti,
            via Pontina, km.30, 600. 00040 Pomezia (Roma), ITALY
COMMENT     related sequence AJ277947.
FEATURES             Location/Qualifiers
     source          join(1..1281,1893..8027)
                     /organism="Hepatitis GB virus B"
                     /mol_type="mRNA"
                     /isolate="FL3"
                     /db_xref="taxon:39113"
                     /focus
     source          1282..1892
                     /organism="Encephalomyocarditis virus"
                     /mol_type="mRNA"
                     /db_xref="taxon:12104"
     5'UTR           1..445
                     /experiment="experimental evidence, no additional details
                     recorded"
     CDS             446..1273
                     /function="core-neo fusion protein"
                     /codon_start=1
                     /product="neomycin phosphotransferase"
                     /protein_id="CAD21956.1"
                     /db_xref="GI:21727886"
                     /db_xref="GOA:Q8JKE5"
                     /db_xref="InterPro:IPR002575"
                     /db_xref="UniProtKB/TrEMBL:Q8JKE5"
                     /translation="MPVISTQTGRAMIEQDGLHAGSPAAWVERLFGYDWAQQTIGCSD
                     AAVFRLSAQGRPVLFVKTDLSGALNELQDEAARLSWLATTGVPCAAVLDVVTEAGRDW
                     LLLGEVPGQDLLSSHLAPAEKVSIMADAMRRLHTLDPATCPFDHQAKHRIERARTRME
                     AGLVDQDDLDEEHQGLAPAELFARLKARMPDGEDLVVTHGDACLPNIMVENGRFSGFI
                     DCGRLGVADRYQDIALATRDIAEELGGEWADRFLVLYGIAAPDSQRIAFYRLLDEFF"
     misc_feature    1282..1892
                     /note="internal ribosome entry site (IRES)"
[...]


The above official GenBank record cannot be parsed and the parsing code
silently leaks through and exits with no data written out. I have filed
bug #2087.



Go to http://www.ebi.ac.uk/cgi-bin/emblfetch and search for AJ428955

ID   AJ428955; SV 1; linear; mRNA; STD; VRL; 8027 BP.
XX
AC   AJ428955;
XX
DT   09-JUL-2002 (Rel. 72, Created)
DT   15-APR-2005 (Rel. 83, Last updated, Version 4)
XX
DE   Hepatitis GB virus B subgenomic replicon neoRepB
XX
KW   core-neo fusion protein; core-neo gene; polyprotein.
XX
OS   Hepatitis GB virus B
OC   Viruses; ssRNA positive-strand viruses, no DNA stage; Flaviviridae.
XX
OS   Encephalomyocarditis virus
OC   Viruses; ssRNA positive-strand viruses, no DNA stage; Picornaviridae;
OC   Cardiovirus.
XX
RN   [1]
RP   1-8027
RA   Traboni C.;
RT   ;
RL   Submitted (22-JAN-2002) to the EMBL/GenBank/DDBJ databases.
RL   Traboni C., Biochemistry, IRBM P.Angeletti, via Pontina, km.30, 600. 00040
RL   Pomezia (Roma), ITALY.
XX
RN   [2]
RX   DOI; 10.1128/JVI.76.15.7736-7746.2002
RX   PUBMED; 12097587.
RA   De Tomassi A., Pizzuti M., Graziani R., Sbardellati A., Altamura S.,
RA   Paonessa G., Traboni C.;
RT   "Cell clones selected from the Huh7 human hepatoma cell line support
RT   efficient replication of a subgenomic GB virus B replicon";
RL   J. Virol. 76(15):7736-7746(2002).
XX
CC   related sequence AJ277947
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..8027
FT                   /organism="Hepatitis GB virus B"
FT                   /focus
FT                   /isolate="FL3"
FT                   /mol_type="mRNA"
FT                   /db_xref="taxon:39113"
FT   source          join(1..1281,1893..8027)
FT                   /organism="Hepatitis GB virus B"
FT                   /mol_type="mRNA"
FT                   /db_xref="taxon:39113"
FT   source          1282..1892
FT                   /organism="Encephalomyocarditis virus"
FT                   /mol_type="mRNA"
FT                   /db_xref="taxon:12104"
FT   5'UTR           1..445
FT                   /experiment="experimental evidence, no additional details
FT                   recorded"
FT   CDS             446..1273
FT                   /product="neomycin phosphotransferase"
FT                   /function="core-neo fusion protein"
FT                   /db_xref="GOA:Q8JKE5"
FT                   /db_xref="HSSP:P00552"
FT                   /db_xref="InterPro:IPR002575"
FT                   /db_xref="InterPro:IPR011009"
FT                   /db_xref="InterPro:IPR012149"
FT                   /db_xref="UniProtKB/TrEMBL:Q8JKE5"
FT                   /protein_id="CAD21956.1"
FT                   /translation="MPVISTQTGRAMIEQDGLHAGSPAAWVERLFGYDWAQQTIGCSDA
FT                   AVFRLSAQGRPVLFVKTDLSGALNELQDEAARLSWLATTGVPCAAVLDVVTEAGRDWLL
FT                   LGEVPGQDLLSSHLAPAEKVSIMADAMRRLHTLDPATCPFDHQAKHRIERARTRMEAGL
FT                   VDQDDLDEEHQGLAPAELFARLKARMPDGEDLVVTHGDACLPNIMVENGRFSGFIDCGR
FT                   LGVADRYQDIALATRDIAEELGGEWADRFLVLYGIAAPDSQRIAFYRLLDEFF"
FT   misc_feature    1282..1892
FT                   /note="internal ribosome entry site (IRES)"
[...]

This official EMBL record cannot be parsed either:

------------- EXCEPTION  -------------
MSG: Can't see new qualifier in: /focus
from:
/organism="Hepatitis GB virus B"
/focus
/isolate="FL3"
/mol_type="mRNA"
/db_xref="taxon:39113"

STACK Bio::SeqIO::embl::_read_FTHelper_EMBL /usr/lib/perl5/site_perl/5.8.8/Bio/SeqIO/embl.pm:1245
STACK Bio::SeqIO::embl::next_seq /usr/lib/perl5/site_perl/5.8.8/Bio/SeqIO/embl.pm:383
STACK toplevel testparsing.pl:20

--------------------------------------

Shall I file another bugreport or attach under the bug #2077, my favourite one? ;-)

> 
> If this has 'worked' for you before it is probably because a completely
> meaningless composite species classification was generated. The new
> taxonomy system 'ensures' that the taxonomic data parsed is sane enough
> to be output correctly again.

I don't have the originally generated files anymore but parsing finished
"successfully" with "some" data written out. ;)

-- 
Dr. Martin Mokrejs
Faculty of Science, Charles University
Vinicna 5, 128 43 Prague, Czech Republic
http://www.iresite.org
http://www.iresite.org/~mmokrejs



More information about the Bioperl-l mailing list