[emboss-dev] Problems with EMBOSS seqret GenBank to GFF3

Peter Rice pmr at ebi.ac.uk
Wed Aug 24 10:36:34 UTC 2011


On 08/17/2011 11:37 AM, Peter Cock wrote:
> Hi again Peter R. (et al.),
>
> Following yesterday's discussion about GFF3 files from UniProt,
> I'm trying seqret to produce GFF3 from GenBank files.
>
> ------------------------------------------
>
> Problem Two - Circular features not marked
>
> EMBOSS is also lacking in this area.

Current status: circular tags will be passed better i the next EMBOSS 
release. Sequence inputs will have a new -scircular qualifier and 
feature inputs will have -fcircular to cover cases where the input 
format does not define a circular sequence (but if it does, these will 
not turn it off)

We will tag a feature with Is_circular in the output, even if we have to 
make one up.

> ------------------------------------------
>
> Problem Six - Features wrapping the origin of a circular genome
>
> Related to the landmark feature lacking the Is_circular=true tag, the
> gene and CDS features for origin wrapping NEQ003 look funny to me.
> EMBOSS seems to be generating three GFF3 lines for the gene and CDS
> for NEQ003, a surprisingly broad entry 1 to 490885 and two children
> 490883 to 490885 and 1 to 879 (which do look sensible).
>
> Based on the old specification, I had expected two GFF3 lines each for the
> gene and CDS, giving the regions 490883 to 490885 and 1 to 879, linked
> by virtue of the having the same ID.
>
> Thankfully this potential confusion has been address in the updated
> specification, so I would expect a single GFF3 line for each of the gene
> and CDS for NEQ003, using start 490883 and end of 879+490885=491764.

Unfortunately GFF3 is sadly lacking in details on how to define the 
sequence length. It appears there is no standard for defining the 
length, yet it is critical to interpreting a circular feature that goes 
across the origin as GFF3 makes the end position greater than the length.

We will make a best guess but cannot guarantee we get the right answer.

> ------------------------------------------
>
> Problem Seven - No parent/child relationships
>
> The EMBOSS 6.4.0 GFF3 file does use parent/child relationships
> but not in the way I expected (and not in a way the validator likes).
> As discussed above, for the GenBank join locations EMBOSS
> seems to create broad parent features with children for each
> sub-location (parent/child relations of the same type = bad).
>
> What I'm expecting instead is parent child relationships between
> the CDS and gene features, between tRNA and gene features, etc.
> Note that these relationships are implicit in the GenBank (and EMBL)
> flat files, so I accept trying to deduce them might be hard (and
> perhaps best not doing immediately - the other issues are more
> pressing).

The obvious fix is to lie about the feature types of the exons so the 
validator is happy. We could call them exons, but "region" would be safer.

But there is a silly complication with CDS features: we could keep the 
CDS parent record and have it as a parent of a group of "regions" for 
the processed exons. But GFF3 wants the exons to be type "CDS" so what 
do we call the parent?

So in the cobbled together example below, ignoring the circular aspects, 
we would want to keep the CDS on the parent (ID=NC_005213.11) record 
where all the annotation tags are, but I suspect GFF3 wants that to be 
something else. We could of course specifically lie about CDS features 
for EMBOSS generated GFF3 files (we tag the header) so we can restore 
the correct internal structure on input.

NC_005213	EMBL	CDS	490883	491764  .	-	0 
ID=NC_005213.11;locus_tag=NEQ001;note=conserved
hypothetical [Methanococcus jannaschii]%3B COG1583:Uncharacterized
ACR%3B IPR001472:Bipartite nuclear localization signal%3B IPR002743:
Protein of unknown function
DUF57;codon_start=1;transl_table=11;product=hypothetical
protein;protein_id=NP_963295.1;db_xref=GI:41614797;db_xref=GeneID:2732620;translation=MRLLLELKALNSIDKKQLSNYLIQGFIYNILKNTEYSWLHNWKKEKYFNFTLIPKKDIIENKRYYLIISSPDKRFIEVLHNKIKDLDIITIGLAQFQLRKTKKFDPKLRFPWVTITPIVLREGKIVILKGDKYYKVFVKRLEELKKYNLIKKKEPILEEPIEISLNQIKDGWKIIDVKDRYYDFRNKSFSAFSNWLRDLKEQSLRKYNNFCGKNFYFEEAIFEGFTFYKTVSIRIRINRGEAVYIGTLWKELNVYRKLDKEEREFYKFLYDCGLGSLNSMGFGFVNTKKNSAR
NC_005213	EMBL	CDS	490883	490885	.	-	0	ID=NC_005213.12;Parent=NC_005213.11
NC_005213	EMBL	CDS	1	879	.	-	0	ID=NC_005213.13;Parent=NC_005213.11

> ------------------------------------------
>
> Problem Eight - Invalid tags
>
> The online validator complains that EMBOSS too is using EC_number
> (uppercase tags are reserved

Fixed and we can patch the release. Making all tags lower case is 
trivial - they are automatically converted on input to the internal 
mixed case.

> ------------------------------------------
>
> So my conclusion is that while the EMBOSS generated GFF3 is
> better than those produced by the NCBI, it still is invalid and needs
> some work.
>
> As usual, I am of course happy to help with testing fixes. And if
> there are any mistakes in my understanding of the GFF3 spec,
> please tell me ;)

Hope this helps. Progress is being made.

However, as GFF3 is such a pain, I am wondering whether to switch the 
default feature format to something else - back to GFF2 or maybe to use GTF.

Does anyone have a preference?

regards,

Peter Rice
EMBOSS Team



More information about the emboss-dev mailing list