Bioperl: Embl parsing oddities (long)

Ewan Birney
Thu, 16 Mar 2000 12:01:45 +0000 (GMT)

On 16 Mar 2000, Keith James wrote:

> I've been evaluating the pre-release 2 as a replacement for the Sanger
> prEMBL modules for parsing EMBL feature tables in some of our
> heavily-used scripts. I've figured out the basics of how the info is
> stored by running some test scripts in the debugger, but there are a
> couple of points that I could do with some advice on.

Many thanks keith. I am sure we can make this stuff work well for you.
You are very welcome to get a cvs login to fix things directly in
the modules.

> First off, the S. coelicolor genome project is cosmid based and so has
> many CDSs which are partial because they run off the end of a
> cosmid. I wrote a script to produce a database of translations for our
> blast server which also collects the 'partial' CDSs and joins their
> translations where possible, so that they can also be searched as
> full-length sequences while we are still in progress.
> Using the prEMBL modules I can get the location in the form:
> <1..228
> which indicates a partial CDS. There is no /partial qualifier as this
> seems to be optional where the < implies that information. Neither
> does there have to be a /codon_start=n, which would also warn of a
> partial.
> Looking at Bio::SeqIO::embl I can see a couple of lines where the <>'s
> are stripped out, but can't see where this information has been stored
> (has it been?). The resultant feature tells me that it starts at base
> 1, whereas I would say that biologically speaking it starts beyond
> base 1 (of the cosmid). It's a vital piece of information.

Yup. I find this a really annoying feature of EMBL parsing. (semantics
about partial predictions put into the range feature. Ugh). We
need to add a tag when this happens.

> It looks like I'm going to be writing some Embl->Ace stuff too and
> need the same info to tag CDSs with Start_not_found/End_not_found.


> Secondly, what behaviour can we expect for non-standard tags? Our
> annotations are littered with 'made up qualifiers' while they are in
> progress. At the moment the splitting of qualifiers into tag & value
> seems to be unpredictable as I'm getting codon_start=2 or
> label=SCJ4.52c as tags with the value being empty (looks like I should
> send that to the bug list).

Yup. Or fix it. They should be going into the tags cleanly, but it
is possible that there is a parsing error. The parsing code here is 
extremely messy as the way EMBL/GenBank allow a number of ways of
constructing these tags (including the horrible "" system).

> Finally, I'm not clear on which way the treatment of stop codons is
> going to go. A CDS in EMBL includes the stop codon in its range, so
> calling translate returns a string of amino acids with a * at the
> end. Trying to do this with such a CDS
> my $bio_aa = $bio_nt->translate();
> throws an exception because the translation has a * which isn't
> allowed by the sub seq in BIO::PrimarySeq (so I hacked it to allow my
> tests to work). Also the length function includes the * when it
> calculates the aa length (reported).

The '*' making an exception has been fixed. Translate should probably
remove the '*' when it is at the end.

> So will the *'s be staying in the translations, or will they be
> disappearing in future? I can't think of a compelling reason for
> either choice, myself.
> On the whole it looks like a transition to Bioperl for some of the
> Pathogens group scripts won't make me want to chew off my own limbs ;)

That is good. But - it is clearly time you got a cvs login so you can
fix these problems yourself.

> cheers,
> Keith
> -- 
> Keith James  --  --
> The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA
> =========== Bioperl Project Mailing List Message Footer =======
> Project URL:
> For info about how to (un)subscribe, where messages are archived, etc:
> ====================================================================

Ewan Birney. Mobile: +44 (0)7970 151230

=========== Bioperl Project Mailing List Message Footer =======
Project URL:
For info about how to (un)subscribe, where messages are archived, etc: