[emboss-dev] Problems with EMBOSS seqret GenBank to GFF3

Peter Cock p.j.a.cock at googlemail.com
Fri Aug 26 02:27:31 UTC 2011

On Thu, Aug 25, 2011 at 2:52 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
> On 25/08/2011 01:44, Peter Cock wrote:
>> On Wed, Aug 24, 2011 at 11:36 AM, Peter Rice<pmr at ebi.ac.uk>  wrote:
>>> However, as GFF3 is such a pain, I am wondering whether to switch the
>>> default feature format to something else - back to GFF2 or maybe to use
>>> GTF.
>> Sadly I have to agree with you - the current version of the GFF3
>> spec leaves far too much open to multiple interpretation, as we
>> have been discussing on the song-devel mailing lists. I'm not
>> sure that GFF2 or GTF are any better though.
> GTF is no good for EMBOSS ... way too picky about start and stop codons
> If pushed we could read it in using a version of the GTF parser but I see no
> point trying to write it using data from any source
>> I was expecting something like this (done by hand) where we follow the
>> example on http://www.sequenceontology.org/gff3.shtml and have a
>> single GFF gene feature represented by three lines linked by virtue of
>> having the same ID:
>> ...
>> On the downside, I have repeated all the annotation three times - but
>> that is what was done in the GFF3 example in the spec.
> Urgh. How about a gene with 80 exons? That's what I was trying to avoid.
> How would you plan to read it back in? Transferring all features to the
> parent perhaps, with checks every time for an existing exact copy?

It would make sense to propose that the first line has all the annotation,
and the subsequence lines from the same feature just need the ID,
and if it is adopted the part tag recently discussed on the song-devel
list to make the order of the sub-parts explicit.

> I am less impressed with GFF3 each time I look.

Me too.

> I think we'll go with the annotation of the "biological_region" parent and
> wait for anyone with a use case that actually requires massively replicated
> annotation.

Have you looked at the BioPerl GenBank to GFF3 conversion?
I understand GBrowse recommends this as a way to get
GenBank format data into GBrowse. I'm also pretty sure that
this is being used inside TogoWS for GenBank/EMBL to GFF3:

http://togows.dbcls.jp/entry/embl/V00508  <-- original EMBL
http://togows.dbcls.jp/entry/embl/V00508.gff  <-- as GFF3

Interestingly their GFF3 output is pretty close to your proposed
EMBOSS output, only they've got a "region" rather than
"biological_region" for the parent meta-feature.

However, I think introducing extra biological_region features to
act as the parent of multi-location features would run counter to
the canonical gene model given in the GFF3 specification (which
appears to be just a suggestion rather than a requirement).

Also, introducing this meta-feature would complicate any
future wish to try to express explicit parent/child relationships
between operon, gene, mRNA and CDS features. Of course, as
we've discussed, these biological relationships are only implicit
in the GenBank/EMBL feature table.

This is probably a good example to discuss on the GFF3
song-devel mailing list - small and apparently very simple
except for how to represent the (forward strand) join location.

Peter C.

More information about the emboss-dev mailing list