[emboss-dev] Problems with EMBOSS seqret GenBank to GFF3
Peter Cock
p.j.a.cock at googlemail.com
Thu Aug 25 00:44:47 UTC 2011
On Wed, Aug 24, 2011 at 11:36 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
>
> However, as GFF3 is such a pain, I am wondering whether to switch the
> default feature format to something else - back to GFF2 or maybe to use GTF.
>
Sadly I have to agree with you - the current version of the GFF3
spec leaves far too much open to multiple interpretation, as we
have been discussing on the song-devel mailing lists. I'm not
sure that GFF2 or GTF are any better though.
On Wed, Aug 24, 2011 at 3:45 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
> On 08/24/2011 11:36 AM, Peter Rice wrote:
>>
>> On 08/17/2011 11:37 AM, Peter Cock wrote:
>>
>>> ------------------------------------------
>>>
>>> Problem Seven - No parent/child relationships
>>>
>>> The EMBOSS 6.4.0 GFF3 file does use parent/child relationships
>>> but not in the way I expected (and not in a way the validator likes).
>
> As a first attempt, using the EMBL entry v00508 in the EMBOSS test set, I
> can make the CDS "parent" feature change its type to "biological_region" and
> add a featflags tag with the true type. Code (not yet checked in) can
> reconstruct the EMBL feature table from this GFF.
>
> However, the EMBL tags are all on the parent (now biological_region)
> feature.
>
> Any suggestions where I should stick them for them to be useful in GFF3?
>
> EMBL feature table:
>
> FT source 1..3919
> FT /organism="Homo sapiens"
> FT /mol_type="genomic DNA"
> FT /db_xref="taxon:9606"
> FT CDS join(2079..2171,2294..2515,3371..3499)
> FT /db_xref="GDB:119299"
> FT /db_xref="GOA:P02100"
> FT /db_xref="HGNC:4830"
> FT /db_xref="InterPro:IPR000971"
> FT /db_xref="InterPro:IPR002337"
> FT /db_xref="InterPro:IPR009050"
> FT /db_xref="InterPro:IPR012292"
> FT /db_xref="PDB:1A9W"
> FT /db_xref="UniProtKB/Swiss-Prot:P02100"
> FT /protein_id="CAA23766.1"
> FT /translation="MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDS
> FT FGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENF
> FT KLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH"
>
> proposed GFF3 version
>
> V00508 EMBL databank_entry 1 3919 . + .
> ID=V00508.1;organism=Homo sapiens;mol_type=genomic DNA;db_xref=taxon:9606
> V00508 EMBL biological_region 2079 3499 . + 0
> ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_x
> ref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLV
> VYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH
> V00508 EMBL CDS 2079 2171 . + 0
> Parent=V00508.2
> V00508 EMBL CDS 2294 2515 . + 0
> Parent=V00508.2
> V00508 EMBL CDS 3371 3499 . + 0
> Parent=V00508.2
>
I was expecting something like this (done by hand) where we follow the
example on http://www.sequenceontology.org/gff3.shtml and have a
single GFF gene feature represented by three lines linked by virtue of
having the same ID:
V00508 EMBL databank_entry 1 3919 . + .
ID=V00508.1;organism=Homo sapiens;mol_type=genomic
DNA;db_xref=taxon:9606
V00508 EMBL CDS 2079 2171 . + 0
ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH
V00508 EMBL CDS 2294 2515 . + 0
ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH
V00508 EMBL CDS 3371 3499 . + 0
ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH
On the downside, I have repeated all the annotation three times - but
that is what was done in the GFF3 example in the spec.
Perhaps this should be raised on the song-devel mailing list along
with our other GFF3 queries.
Regards,
Peter C.
More information about the emboss-dev
mailing list