[emboss-dev] Problems with EMBOSS seqret GenBank to GFF3

Peter Cock p.j.a.cock at googlemail.com
Wed Aug 17 10:37:06 UTC 2011


Hi again Peter R. (et al.),

Following yesterday's discussion about GFF3 files from UniProt,
I'm trying seqret to produce GFF3 from GenBank files. I'd already
found the NCBI currently provides some very broken GFF3 files:

http://blastedbio.blogspot.com/2011/08/why-are-ncbi-gff3-files-still-broken.html

$ wget ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans_Kin4_M_uid58009/NC_005213.gff
$ wget ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans_Kin4_M_uid58009/NC_005213.gbk
$ seqret --version
EMBOSS:6.4.0.0
$ seqret -filter -feature -sequence NC_005213.gbk -sformat=genbank
-osformat=gff3 | head -n 20
##gff-version 3
##sequence-region NC_005213 1 490885
#!Date 2011-08-17
#!Type DNA
#!Source-version EMBOSS 6.4.0.0
NC_005213	EMBL	databank_entry	1	490885	.	+	.	ID=NC_005213.1;organism=Nanoarchaeum
equitans Kin4-M;mol_type=genomic
DNA;strain=Kin4-M;db_xref=taxon:228908
NC_005213	EMBL	gene	3254	35301	.	+	.	ID=NC_005213.2;locus_tag=NEQ_t01;experiment=experimental
evidence%2C no additional details
recorded;trans_splicing=true;db_xref=GeneID:3362429
NC_005213	EMBL	gene	35233	35301	.	+	.	Parent=NC_005213.2
NC_005213	EMBL	gene	3254	3289	.	+	.	Parent=NC_005213.2
NC_005213	EMBL	tRNA	3254	35287	.	+	.	ID=NC_005213.5;locus_tag=NEQ_t01;product=tRNA-Met;experiment=experimental
evidence%2C no additional details
recorded;trans_splicing=true;db_xref=GeneID:3362429
NC_005213	EMBL	tRNA	35249	35287	.	+	.	Parent=NC_005213.5
NC_005213	EMBL	tRNA	3254	3289	.	+	.	Parent=NC_005213.5
NC_005213	EMBL	gene	1	490885	.	-	.	ID=NC_005213.8;locus_tag=NEQ001;db_xref=GeneID:2732620
NC_005213	EMBL	gene	490883	490885	.	-	.	Parent=NC_005213.8
NC_005213	EMBL	gene	1	879	.	-	.	Parent=NC_005213.8
NC_005213	EMBL	CDS	1	490885	.	-	0	ID=NC_005213.11;locus_tag=NEQ001;note=conserved
hypothetical [Methanococcus jannaschii]%3B COG1583:Uncharacterized
ACR%3B IPR001472:Bipartite nuclear localization signal%3B IPR002743:
Protein of unknown function
DUF57;codon_start=1;transl_table=11;product=hypothetical
protein;protein_id=NP_963295.1;db_xref=GI:41614797;db_xref=GeneID:2732620;translation=MRLLLELKALNSIDKKQLSNYLIQGFIYNILKNTEYSWLHNWKKEKYFNFTLIPKKDIIENKRYYLIISSPDKRFIEVLHNKIKDLDIITIGLAQFQLRKTKKFDPKLRFPWVTITPIVLREGKIVILKGDKYYKVFVKRLEELKKYNLIKKKEPILEEPIEISLNQIKDGWKIIDVKDRYYDFRNKSFSAFSNWLRDLKEQSLRKYNNFCGKNFYFEEAIFEGFTFYKTVSIRIRINRGEAVYIGTLWKELNVYRKLDKEEREFYKFLYDCGLGSLNSMGFGFVNTKKNSAR
NC_005213	EMBL	CDS	490883	490885	.	-	0	Parent=NC_005213.11
NC_005213	EMBL	CDS	1	879	.	-	0	Parent=NC_005213.11
NC_005213	EMBL	sequence_feature	7	879	.	-	.	ID=NC_005213.14;locus_tag=NEQ001;note=CRISPR/Cas
system-associated RAMP superfamily protein Cas6%3B Region:
Cas6-I-III%3B cl11443;db_xref=CDD:196236
NC_005213	EMBL	gene	883	2691	.	+	.	ID=NC_005213.15;locus_tag=NEQ003;db_xref=GeneID:2654355

I've deliberately cut the example here to include all of NEQ_t01, and
interesting trans-spliced tRNA, and all of NEQ001, an interesting gene
because it spans the origin of this circular genome. I use these examples
in the blog post and discuss them again below.

Given some of the points below, I suspect EMBOSS is producing GFF3
prior to the additions made in v1.18 (24 June 2010) regarding circular
genomes.

The following numbering reflects the issues listed on my blog post
about the NCBI version of the GFF3 file (link given above).

------------------------------------------

Problem One - Invalid Feature Types

EMBOSS looks OK here, you're converting the GenBank feature types
source and misc_feature into databank_entry and sequence_feature
respectively.

------------------------------------------

Problem Two - Circular features not marked

EMBOSS is also lacking in this area.

EMBOSS has used feature type databank_entry and generated feature ID
NC_005213.1 for the landmark. However, this should include the special
tag entry Is_circular=true, since this is the landmark feature for the whole
circular chromosome.

------------------------------------------

Problem Three - Missing ID tags on multi-location features

Unlike the NCBI file which fails to cross link multi-location features like
trans-spliced NEQ_t01, EMBOSS looks better. However, I don't think
you are following the expected pattern as used in the canonical GFF3
examples.

In the GenBank file, this tRNA is join(35233..35301,3254..3289)

For the gene and tRNA features for NEQ_t01, EMBOSS is generating
three GFF3 lines. First a very broad parent feature 3254 to 35301,
then two children 35233 to 35301 and 3254 to 3289.

I would expect two GFF3 lines (for each of gene and tRNA), just
35233 to 35301 and 3254 to 3289 which would be linked by virtue
of having the same ID.

The online GFF3 validator would seem to support my interpretation,
reporting errors like this:

8            [ERROR]   invalid type pair - check all parents (at line
7; gene to gene)
11           [ERROR]   invalid type pair - check all parents (at line
10; tRNA to tRNA)
14           [ERROR]   invalid type pair - check all parents (at line
13; gene to gene)
17           [ERROR]   invalid type pair - check all parents (at line
16; CDS to CDS)
28           [ERROR]   invalid type pair - check all parents (at line
27; sequence_feature to
             sequence_feature)


This is related to "Problem Six" and "Problem Seven" below.

------------------------------------------

Problem Four - Wrong tag for database cross references

I had noticed the NCBI using a local tag (lower case) db_xref rather
than the standard (upper case = reserved) tag Dbxref. EMBOSS
does the same - is this deliberate and if so why?

------------------------------------------

Problem Five - Missing stop codon in CDS features

EMBOSS looks OK here

------------------------------------------

Problem Six - Features wrapping the origin of a circular genome

Related to the landmark feature lacking the Is_curcular=true tag, the
gene and CDS features for origin wrapping NEQ003 look funny to me.
EMBOSS seems to be generating three GFF3 lines for the gene and CDS
for NEQ003, a surprisingly broad entry 1 to 490885 and two children
490883 to 490885 and 1 to 879 (which do look sensible).

This is essentially the same point I raised above with NEQ_t01, but
with the added complication of spanning the origin.

Based on the old specification, I had expected two GFF3 lines each for the
gene and CDS, giving the regions 490883 to 490885 and 1 to 879, linked
by virtue of the having the same ID.

Thankfully this potential confusion has been address in the updated
specification, so I would expect a single GFF3 line for each of the gene
and CDS for NEQ003, using start 490883 and end of 879+490885=491764.

------------------------------------------

Problem Seven - No parent/child relationships

The NCBI GFF3 file had no parent/child relationships at all.

The EMBOSS 6.4.0 GFF3 file does use parent/child relationships
but not in the way I expected (and not in a way the validator likes).
As discussed above, for the GenBank join locations EMBOSS
seems to create broad parent features with children for each
sub-location (parent/child relations of the same type = bad).

What I'm expecting instead is parent child relationships between
the CDS and gene features, between tRNA and gene features, etc.
Note that these relationships are implicit in the GenBank (and EMBL)
flat files, so I accept trying to deduce them might be hard (and
perhaps best not doing immediately - the other issues are more
pressing).

------------------------------------------

Problem Eight - Invalid tags

The online validator complains that EMBOSS too is using EC_number
(uppercase tags are reserved

------------------------------------------

So my conclusion is that while the EMBOSS generated GFF3 is
better than those produced by the NCBI, it still is invalid and needs
some work.

As usual, I am of course happy to help with testing fixes. And if
there are any mistakes in my understanding of the GFF3 spec,
please tell me ;)

Regards,

Peter C.




More information about the emboss-dev mailing list