[Bioperl-l] Memory requirements for conversion from embl to genbank

Chris Fields cjfields at uiuc.edu
Thu Aug 31 16:22:31 UTC 2006


Sendu, Martin,

This has been the problem with these particular example sequences.  The
issue is that they do NOT conform to the EMBL standard or any sane sequence
format standard.  Not that we stick to a standard vehemently ourselves, but
we expect some sane formatting.  IMHO, (as I have repeatedly stated) we
should not be responsible for trying to 'fix' broken sequence formats unless
it is sanely possible and doesn't degrade performance/quality.  

Saying that, I do believe we should at the least have a warning or throw the
appropriate error.  So if duplicate species are present, shouldn't there be
a thrown error?

So far, here's the tally of formatting errors, for those who wish to follow:

1)  Missing quotes?  Check!

FT   5'UTR           1..213
FT                   /source="REFSEQ::XM_479174:1..213"
FT                   /gene="B1056G08.147"
FT                   /product="putative dihydropterin pyrophosphokinase

2)  Extra quotes?  Check!

FT   5'UTR           1..60
FT                   /source="EMBL::AJ487471:1..60"
FT                   /gene="f19c24.25""
FT                   /product="putative epsilon subunit of mitochondrial
F1-ATPase"

3)  Extra species?  Check!

OS   Hepatitis GB virus B
OS   Encephalomyocarditis virus
OC   Viruses; ssRNA positive-strand viruses, no DNA stage; Flaviviridae.
OC   Viruses; ssRNA positive-strand viruses, no DNA stage; Picornaviridae;
OC   Cardiovirus.

BTW, I didn't need to run a fix on genbank.pm for duplicate tags; they are
handled accurately (they are stored in an array, not a hash).

EMBL:

FT   VECTOR          477..1274
FT                   /source="EMBL::AJ428955:477..1274"
FT                   /evidence="Similarity"
FT                   /db_xref="EMBL:"
FT                   /note="Possible vector contamination"
FT                   /note="Length=798 BP. Identities=99.6%"

GenBank:

     VECTOR          477..1274
                     /db_xref="EMBL:"
                     /source="EMBL::AJ428955:477..1274"
                     /evidence=Similarity
                     /note="Possible vector contamination"
                     /note="Length=798 BP. Identities=99.6%"

Chris

> -----Original Message-----
> From: Sendu Bala [mailto:bix at sendu.me.uk]
> Sent: Thursday, August 31, 2006 10:51 AM
> To: bioperl-l at lists.open-bio.org
> Cc: Martin MOKREJŠ; Chris Fields
> Subject: Re: [Bioperl-l] Memory requirements for conversion from embl to
> genbank
> 
> Martin MOKREJŠ wrote:
> > I observe the same. Testcase here. Please push it into tescases.
> > It will be helpful in the future when the parser should cope with the
> > two /note feature lines.
> 
> Well the cause of the hang is the multiple species defined for one
> sequence. Is that valid? Desired? Should the fix be to somehow store and
> be able to output multiple species again, or to ignore all but one of
> the species? You have two sequences with this problem in the large file
> originally posted.
> 
> If this has 'worked' for you before it is probably because a completely
> meaningless composite species classification was generated. The new
> taxonomy system 'ensures' that the taxonomic data parsed is sane enough
> to be output correctly again.





More information about the Bioperl-l mailing list