[emboss-dev] Problems with EMBOSS seqret GenBank to GFF3

Peter Rice pmr at ebi.ac.uk
Tue Aug 30 15:48:25 UTC 2011


On 08/26/2011 03:27 AM, Peter Cock wrote:
> On Thu, Aug 25, 2011 at 2:52 PM, Peter Rice<pmr at ebi.ac.uk>  wrote:
>> On 25/08/2011 01:44, Peter Cock wrote:

> It would make sense to propose that the first line has all the annotation,
> and the subsequence lines from the same feature just need the ID,
> and if it is adopted the part tag recently discussed on the song-devel
> list to make the order of the sub-parts explicit.
> http://sourceforge.net/mailarchive/message.php?msg_id=27960475

The part tag is interesting and would map to the internal "exon" 
attribute in EMBOSS which we reserve for sorting.

>> I think we'll go with the annotation of the "biological_region" parent and
>> wait for anyone with a use case that actually requires massively replicated
>> annotation.
>>
>
> Have you looked at the BioPerl GenBank to GFF3 conversion?
> I understand GBrowse recommends this as a way to get
> GenBank format data into GBrowse. I'm also pretty sure that
> this is being used inside TogoWS for GenBank/EMBL to GFF3:
>
> http://togows.dbcls.jp/entry/embl/V00508<-- original EMBL
> http://togows.dbcls.jp/entry/embl/V00508.gff<-- as GFF3

Hmmm .... the GFF3 has Parent references to the protein_id, but it 
doesn't appear as an ID.

I do not like using a second region to put the description line in. 
Using the organism as the ID for the source line also looks odd.

> Interestingly their GFF3 output is pretty close to your proposed
> EMBOSS output, only they've got a "region" rather than
> "biological_region" for the parent meta-feature.

I don't see a parent meta-feature there.

> However, I think introducing extra biological_region features to
> act as the parent of multi-location features would run counter to
> the canonical gene model given in the GFF3 specification (which
> appears to be just a suggestion rather than a requirement).
>
> Also, introducing this meta-feature would complicate any
> future wish to try to express explicit parent/child relationships
> between operon, gene, mRNA and CDS features. Of course, as
> we've discussed, these biological relationships are only implicit
> in the GenBank/EMBL feature table.

I tried the canonical gene example:

##gff-version 3
##sequence-region ctg123 1 9000
ctg123	.	gene	1000	9000	.	+	.	ID=gene00001;Name=EDEN
ctg123	.	TF_binding_site	1000	1012	.	+	.	ID=tfbs00001;Parent=gene00001
ctg123	.	mRNA	1050	9000	.	+	.	ID=mRNA00001;Parent=gene00001;Name=EDEN.1
ctg123	.	five_prime_UTR	1050	1200	.	+	.	Parent=mRNA00001
ctg123	.	CDS	1201	1500	.	+	0	ID=cds00001;Parent=mRNA00001
ctg123	.	CDS	3000	3902	.	+	0	ID=cds00001;Parent=mRNA00001
ctg123	.	CDS	5000	5500	.	+	0	ID=cds00001;Parent=mRNA00001
ctg123	.	CDS	7000	7600	.	+	0	ID=cds00001;Parent=mRNA00001
ctg123	.	three_prime_UTR	7601	9000	.	+	.	Parent=mRNA00001
ctg123	.	cDNA_match	1050	1500	5.8e-42	+	. 
ID=match00001;Target=cdna0123+12+462
ctg123	.	cDNA_match	5000	5500	8.1e-43	+	. 
ID=match00001;Target=cdna0123+463+963
ctg123	.	cDNA_match	7000	9000	1.4e-40	+	. 
ID=match00001;Target=cdna0123+964+2964
##FASTA
>ctg123
cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg
tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta
tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa
aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat
aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat
cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc
gtatttgatttgggtttactatcgaataatgagaattttcaggcttaggc
ttaggcttaggcttaggcttaggcttaggcttaggcttaggcttaggctt
aggcttaggcttaggcttaggcttaggcttaggcttaggcttaggcttag
aatctagctagctatccgaaattcgaggcctgaaaagtgtgacgccattc
>cdna0123
ttcaagtgctcagtcaatgtgattcacagtatgtcaccaaatattttggc
agctttctcaagggatcaaaattatggatcattatggaatacctcggtgg
aggctcagcgctcgatttaactaaaagtggaaagctggacgaaagtcata
tcgctgtgattcttcgcgaaattttgaaaggtctcgagtatctgcatagt
gaaagaaaaatccacagagatattaaaggagccaacgttttgttggaccg
tcaaacagcggctgtaaaaatttgtgattatggttaaagg

I can not (code not yet checked in) reproduce this, subject to the 
sequence being too short.

Internally, EMBOSS generates parent features for CDS and cDNA_match 
(where several features share an ID), and the parent structure is preserved.

On output, the generated features are not reported so GFF3 input is 
identical.

If we read EMBL/GenBank entries then we will generate a parent feature 
with type "biological region" to attach the annotation from the join. 
Reproducing the "parent" relationships is a separate exercise that could 
be a separate application. In terms of reading one format and writing 
another I prefer to not generate any GFF3-specific extras.

> This is probably a good example to discuss on the GFF3
> song-devel mailing list - small and apparently very simple
> except for how to represent the (forward strand) join location.

We could propose something for the 
http://www.sequenceontology.org/wiki/index.php/GFF3_best_practices page 
to describe how to represent EMBL/GenBank entries in GFF3 (after due 
discussion on the SONG-devel list)

regards,

Peter Rice
EMBSOS Team



More information about the emboss-dev mailing list