[emboss-dev] Problems with EMBOSS seqret GenBank to GFF3
Peter Rice
pmr at ebi.ac.uk
Wed Aug 17 15:52:21 UTC 2011
On 17/08/2011 11:37, Peter Cock wrote:
> Hi again Peter R. (et al.),
>
> Following yesterday's discussion about GFF3 files from UniProt,
> I'm trying seqret to produce GFF3 from GenBank files. I'd already
> found the NCBI currently provides some very broken GFF3 files:
>
> ------------------------------------------
>
> Problem Two - Circular features not marked
>
> EMBOSS is also lacking in this area.
>
> EMBOSS has used feature type databank_entry and generated feature ID
> NC_005213.1 for the landmark. However, this should include the special
> tag entry Is_circular=true, since this is the landmark feature for the whole
> circular chromosome.
Thanks. I'll make sure we add it for the next release.
> ------------------------------------------
>
> Problem Three - Missing ID tags on multi-location features
>
> Unlike the NCBI file which fails to cross link multi-location features like
> trans-spliced NEQ_t01, EMBOSS looks better. However, I don't think
> you are following the expected pattern as used in the canonical GFF3
> examples.
>
> In the GenBank file, this tRNA is join(35233..35301,3254..3289)
>
> For the gene and tRNA features for NEQ_t01, EMBOSS is generating
> three GFF3 lines. First a very broad parent feature 3254 to 35301,
> then two children 35233 to 35301 and 3254 to 3289.
>
> I would expect two GFF3 lines (for each of gene and tRNA), just
> 35233 to 35301 and 3254 to 3289 which would be linked by virtue
> of having the same ID.
EMBOSS is reporting what is stored internally (feature and subfeatures
for the exons). Looks like we should skip reporting the feature. I'll
check what that means for the IDs.
> This is related to "Problem Six" and "Problem Seven" below.
>
> ------------------------------------------
>
> Problem Four - Wrong tag for database cross references
>
> I had noticed the NCBI using a local tag (lower case) db_xref rather
> than the standard (upper case = reserved) tag Dbxref. EMBOSS
> does the same - is this deliberate and if so why?
It is deliberate - we are using the db_xref tag from the EMBL/GenBank
feature table.
But we could convert to the GFF3 tag (and back again on reading). I'll
have a look at how easy that would be.
> ------------------------------------------
>
> Problem Six - Features wrapping the origin of a circular genome
>
> Related to the landmark feature lacking the Is_curcular=true tag, the
> gene and CDS features for origin wrapping NEQ003 look funny to me.
> EMBOSS seems to be generating three GFF3 lines for the gene and CDS
> for NEQ003, a surprisingly broad entry 1 to 490885 and two children
> 490883 to 490885 and 1 to 879 (which do look sensible).
>
> This is essentially the same point I raised above with NEQ_t01, but
> with the added complication of spanning the origin.
Ah, something to do with the way start and end positions are stored
internally. I'll fix that along with other circular feature issues.
> Thankfully this potential confusion has been address in the updated
> specification, so I would expect a single GFF3 line for each of the gene
> and CDS for NEQ003, using start 490883 and end of 879+490885=491764.
I'll try to write (and read) that way too.
> ------------------------------------------
>
> Problem Seven - No parent/child relationships
>
> The NCBI GFF3 file had no parent/child relationships at all.
>
> The EMBOSS 6.4.0 GFF3 file does use parent/child relationships
> but not in the way I expected (and not in a way the validator likes).
> As discussed above, for the GenBank join locations EMBOSS
> seems to create broad parent features with children for each
> sub-location (parent/child relations of the same type = bad).
>
> What I'm expecting instead is parent child relationships between
> the CDS and gene features, between tRNA and gene features, etc.
> Note that these relationships are implicit in the GenBank (and EMBL)
> flat files, so I accept trying to deduce them might be hard (and
> perhaps best not doing immediately - the other issues are more
> pressing).
Could be possible by matching common exons (stored internally as
subfeatures). I'll have a look.
> ------------------------------------------
>
> Problem Eight - Invalid tags
>
> The online validator complains that EMBOSS too is using EC_number
> (uppercase tags are reserved
Pah! We use the EMBL/Genbank tag names. Looks like we will have to
convert to lower case so may as well include that with the
db_xref/Dbxref conversion in GFF3 writing and reading
> ------------------------------------------
>
> So my conclusion is that while the EMBOSS generated GFF3 is
> better than those produced by the NCBI, it still is invalid and needs
> some work.
>
> As usual, I am of course happy to help with testing fixes. And if
> there are any mistakes in my understanding of the GFF3 spec,
> please tell me ;)
Many, many thanks for finding these.
EMBOSS feature internals had a major rewrite in 6.4.0 to sore exons as
subfeatures, which makes all this much easier to handle.
regards,
Peter Rice
EMBOSS Team
More information about the emboss-dev
mailing list