[emboss-dev] Problems with EMBOSS seqret GenBank to GFF3

Peter Cock p.j.a.cock at googlemail.com
Wed Aug 17 16:05:13 UTC 2011


On Wed, Aug 17, 2011 at 4:52 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
> On 17/08/2011 11:37, Peter Cock wrote:
>> ------------------------------------------
>>
>> Problem Four - Wrong tag for database cross references
>>
>> I had noticed the NCBI using a local tag (lower case) db_xref rather
>> than the standard (upper case = reserved) tag Dbxref. EMBOSS
>> does the same - is this deliberate and if so why?
>
> It is deliberate - we are using the db_xref tag from the EMBL/GenBank
> feature table.
>
> But we could convert to the GFF3 tag (and back again on reading). I'll
> have a look at how easy that would be.

Do you want to check this one with Lincoln on the song-devel mailing list
first - after all, using a lower case tag is quite allowable and valid GFF3.
My point is it does seem to be exactly what the reserved tag Dbxref is
intended for.

>> ------------------------------------------
>>
>> Problem Seven - No parent/child relationships
>>
>> The NCBI GFF3 file had no parent/child relationships at all.
>>
>> The EMBOSS 6.4.0 GFF3 file does use parent/child relationships
>> but not in the way I expected (and not in a way the validator likes).
>> As discussed above, for the GenBank join locations EMBOSS
>> seems to create broad parent features with children for each
>> sub-location (parent/child relations of the same type = bad).
>>
>> What I'm expecting instead is parent child relationships between
>> the CDS and gene features, between tRNA and gene features, etc.
>> Note that these relationships are implicit in the GenBank (and EMBL)
>> flat files, so I accept trying to deduce them might be hard (and
>> perhaps best not doing immediately - the other issues are more
>> pressing).
>
> Could be possible by matching common exons (stored internally as
> subfeatures). I'll have a look.

Usually yes, but not all the time. I've seen GenBank files where
the gene and CDS features have slightly different locations which
makes doing this automatically hard. Off the top of my head this
was a programmed frame shift example... I'll see if I can find you
a specific example.

>> ------------------------------------------
>>
>> So my conclusion is that while the EMBOSS generated GFF3 is
>> better than those produced by the NCBI, it still is invalid and needs
>> some work.
>>
>> As usual, I am of course happy to help with testing fixes. And if
>> there are any mistakes in my understanding of the GFF3 spec,
>> please tell me ;)
>
> Many, many thanks for finding these.

I've come to value NC_005213.gbk as a reasonably small circular
genome with some rather complicated annotation - its one of my
favourite test cases.

> EMBOSS feature internals had a major rewrite in 6.4.0 to sore exons as
> subfeatures, which makes all this much easier to handle.

Oh good - that restructuring should now pay dividends :)

Peter C.



More information about the emboss-dev mailing list