[emboss-dev] Problems with EMBOSS seqret GenBank to GFF3

Thu Aug 25 13:52:30 UTC 2011

On 25/08/2011 01:44, Peter Cock wrote:
> On Wed, Aug 24, 2011 at 11:36 AM, Peter Rice<pmr at ebi.ac.uk>  wrote:
>>
>> However, as GFF3 is such a pain, I am wondering whether to switch the
>> default feature format to something else - back to GFF2 or maybe to use GTF.
>>
>
> Sadly I have to agree with you - the current version of the GFF3
> spec leaves far too much open to multiple interpretation, as we
> have been discussing on the song-devel mailing lists. I'm not
> sure that GFF2 or GTF are any better though.

GTF is no good for EMBOSS ... way too picky about start and stop codons

If pushed we could read it in using a version of the GTF parser but I 
see no point trying to write it using data from any source

> I was expecting something like this (done by hand) where we follow the
> example on http://www.sequenceontology.org/gff3.shtml and have a
> single GFF gene feature represented by three lines linked by virtue of
> having the same ID:
>
>
> V00508  EMBL    databank_entry  1       3919    .       +       .
> ID=V00508.1;organism=Homo sapiens;mol_type=genomic
> DNA;db_xref=taxon:9606
> V00508  EMBL    CDS     2079    2171    .       +       0
> ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH
> V00508  EMBL    CDS     2294    2515    .       +       0
> ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH
> V00508  EMBL    CDS     3371    3499    .       +       0
> ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH
>
> On the downside, I have repeated all the annotation three times - but
> that is what was done in the GFF3 example in the spec.

Urgh. How about a gene with 80 exons? That's what I was trying to avoid.

How would you plan to read it back in? Transferring all features to the 
parent perhaps, with checks every time for an existing exact copy?

I am less impressed with GFF3 each time I look.

I think we'll go with the annotation of the "biological_region" parent 
and wait for anyone with a use case that actually requires massively 
replicated annotation.

regards,

Peter Rice
EMBOSS Team