[emboss-dev] Problems with EMBOSS seqret GenBank to GFF3
Peter Rice
pmr at ebi.ac.uk
Thu Aug 25 13:52:30 UTC 2011
On 25/08/2011 01:44, Peter Cock wrote:
> On Wed, Aug 24, 2011 at 11:36 AM, Peter Rice<pmr at ebi.ac.uk> wrote:
>>
>> However, as GFF3 is such a pain, I am wondering whether to switch the
>> default feature format to something else - back to GFF2 or maybe to use GTF.
>>
>
> Sadly I have to agree with you - the current version of the GFF3
> spec leaves far too much open to multiple interpretation, as we
> have been discussing on the song-devel mailing lists. I'm not
> sure that GFF2 or GTF are any better though.
GTF is no good for EMBOSS ... way too picky about start and stop codons
If pushed we could read it in using a version of the GTF parser but I
see no point trying to write it using data from any source
> I was expecting something like this (done by hand) where we follow the
> example on http://www.sequenceontology.org/gff3.shtml and have a
> single GFF gene feature represented by three lines linked by virtue of
> having the same ID:
>
>
> V00508 EMBL databank_entry 1 3919 . + .
> ID=V00508.1;organism=Homo sapiens;mol_type=genomic
> DNA;db_xref=taxon:9606
> V00508 EMBL CDS 2079 2171 . + 0
> ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH
> V00508 EMBL CDS 2294 2515 . + 0
> ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH
> V00508 EMBL CDS 3371 3499 . + 0
> ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH
>
> On the downside, I have repeated all the annotation three times - but
> that is what was done in the GFF3 example in the spec.
Urgh. How about a gene with 80 exons? That's what I was trying to avoid.
How would you plan to read it back in? Transferring all features to the
parent perhaps, with checks every time for an existing exact copy?
I am less impressed with GFF3 each time I look.
I think we'll go with the annotation of the "biological_region" parent
and wait for anyone with a use case that actually requires massively
replicated annotation.
regards,
Peter Rice
EMBOSS Team
More information about the emboss-dev
mailing list