[emboss-dev] Problems with EMBOSS seqret GenBank to GFF3
Peter Rice
pmr at ebi.ac.uk
Tue Aug 30 15:48:25 UTC 2011
On 08/26/2011 03:27 AM, Peter Cock wrote:
> On Thu, Aug 25, 2011 at 2:52 PM, Peter Rice<pmr at ebi.ac.uk> wrote:
>> On 25/08/2011 01:44, Peter Cock wrote:
> It would make sense to propose that the first line has all the annotation,
> and the subsequence lines from the same feature just need the ID,
> and if it is adopted the part tag recently discussed on the song-devel
> list to make the order of the sub-parts explicit.
> http://sourceforge.net/mailarchive/message.php?msg_id=27960475
The part tag is interesting and would map to the internal "exon"
attribute in EMBOSS which we reserve for sorting.
>> I think we'll go with the annotation of the "biological_region" parent and
>> wait for anyone with a use case that actually requires massively replicated
>> annotation.
>>
>
> Have you looked at the BioPerl GenBank to GFF3 conversion?
> I understand GBrowse recommends this as a way to get
> GenBank format data into GBrowse. I'm also pretty sure that
> this is being used inside TogoWS for GenBank/EMBL to GFF3:
>
> http://togows.dbcls.jp/entry/embl/V00508<-- original EMBL
> http://togows.dbcls.jp/entry/embl/V00508.gff<-- as GFF3
Hmmm .... the GFF3 has Parent references to the protein_id, but it
doesn't appear as an ID.
I do not like using a second region to put the description line in.
Using the organism as the ID for the source line also looks odd.
> Interestingly their GFF3 output is pretty close to your proposed
> EMBOSS output, only they've got a "region" rather than
> "biological_region" for the parent meta-feature.
I don't see a parent meta-feature there.
> However, I think introducing extra biological_region features to
> act as the parent of multi-location features would run counter to
> the canonical gene model given in the GFF3 specification (which
> appears to be just a suggestion rather than a requirement).
>
> Also, introducing this meta-feature would complicate any
> future wish to try to express explicit parent/child relationships
> between operon, gene, mRNA and CDS features. Of course, as
> we've discussed, these biological relationships are only implicit
> in the GenBank/EMBL feature table.
I tried the canonical gene example:
##gff-version 3
##sequence-region ctg123 1 9000
ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN
ctg123 . TF_binding_site 1000 1012 . + . ID=tfbs00001;Parent=gene00001
ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1
ctg123 . five_prime_UTR 1050 1200 . + . Parent=mRNA00001
ctg123 . CDS 1201 1500 . + 0 ID=cds00001;Parent=mRNA00001
ctg123 . CDS 3000 3902 . + 0 ID=cds00001;Parent=mRNA00001
ctg123 . CDS 5000 5500 . + 0 ID=cds00001;Parent=mRNA00001
ctg123 . CDS 7000 7600 . + 0 ID=cds00001;Parent=mRNA00001
ctg123 . three_prime_UTR 7601 9000 . + . Parent=mRNA00001
ctg123 . cDNA_match 1050 1500 5.8e-42 + .
ID=match00001;Target=cdna0123+12+462
ctg123 . cDNA_match 5000 5500 8.1e-43 + .
ID=match00001;Target=cdna0123+463+963
ctg123 . cDNA_match 7000 9000 1.4e-40 + .
ID=match00001;Target=cdna0123+964+2964
##FASTA
>ctg123
cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg
tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta
tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa
aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat
aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat
cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc
gtatttgatttgggtttactatcgaataatgagaattttcaggcttaggc
ttaggcttaggcttaggcttaggcttaggcttaggcttaggcttaggctt
aggcttaggcttaggcttaggcttaggcttaggcttaggcttaggcttag
aatctagctagctatccgaaattcgaggcctgaaaagtgtgacgccattc
>cdna0123
ttcaagtgctcagtcaatgtgattcacagtatgtcaccaaatattttggc
agctttctcaagggatcaaaattatggatcattatggaatacctcggtgg
aggctcagcgctcgatttaactaaaagtggaaagctggacgaaagtcata
tcgctgtgattcttcgcgaaattttgaaaggtctcgagtatctgcatagt
gaaagaaaaatccacagagatattaaaggagccaacgttttgttggaccg
tcaaacagcggctgtaaaaatttgtgattatggttaaagg
I can not (code not yet checked in) reproduce this, subject to the
sequence being too short.
Internally, EMBOSS generates parent features for CDS and cDNA_match
(where several features share an ID), and the parent structure is preserved.
On output, the generated features are not reported so GFF3 input is
identical.
If we read EMBL/GenBank entries then we will generate a parent feature
with type "biological region" to attach the annotation from the join.
Reproducing the "parent" relationships is a separate exercise that could
be a separate application. In terms of reading one format and writing
another I prefer to not generate any GFF3-specific extras.
> This is probably a good example to discuss on the GFF3
> song-devel mailing list - small and apparently very simple
> except for how to represent the (forward strand) join location.
We could propose something for the
http://www.sequenceontology.org/wiki/index.php/GFF3_best_practices page
to describe how to represent EMBL/GenBank entries in GFF3 (after due
discussion on the SONG-devel list)
regards,
Peter Rice
EMBSOS Team
More information about the emboss-dev
mailing list