[Open-bio-l] Best practice for modelling data in GFF

Leighton Pritchard lpritc at scri.ac.uk
Tue Jul 6 10:58:44 UTC 2010


Hi Dan,

GFF3 is just a file format, capable of representing the SO's hierarchical
subfeatures.  You can represent other things (including other ontologies) in
the same format.  How strictly you choose to stick to the SO's hierarchy is
up to you, whether you use GFF3 to represent your data or not: you are free
to be as canonical or noncanonical as you like.  You are not constrained to
having mRNA be the parent of a CDS by the file format - you can happily
create a model that has things the other way round and represent it in valid
GFF3.  It will be biological nonsense and not SO-compliant, but you can do
it:

broken  .   CDS 100 1100    .   +   0   ID=cds01
broken  .   mRNA    0   1500    .   +   .   ID=mRNA01;Parent=cds01

Putting your model into GFF3 is a separate issue to building the model, so
long as GFF3 is capable of representing your model.  And if the package
you're loading your GFF3 into doesn't care about ontologies and
relationships, you'll get away with it.

So yes, if you want to build a SO-compatible gene model, you had better make
sure the parent-child relationships correspond to the hierarchy in the SO.
This is true whether you want to represent the model in GFF3 or not.

Now, for your specific question about the exon/mRNA terms: an exon is_a
transcript_region, and a transcript_region is part_of a transcript.  [And a
transcript is_a gene_member_region, and a gene_member_region is a member_of
a gene.]

Now, an mRNA is_a mature_transcript, which is_a transcript.  The exon that
is part_of a transcript can therefore be part_of an mRNA, because an mRNA is
a transcript.

So in the model at http://www.sequenceontology.org/gff3.shtml the transcript
you're looking for is the mRNA.  The same would be true if the parent
feature was a monocistronic_mRNA, which is_a mRNA, and also is_a
monocistronic_transcript, which is_a transcript.

Have you had a look at OBO-Edit?  It's a useful learning tool for getting
your head around these things, and you can browse through the SO in it.

Cheers,

L.


On 06/07/2010 Tuesday, July 6, 11:10, "Dan Bolser" <dan.bolser at gmail.com>
wrote:

> When you don't get a reply, you never know if your question was too
> dumb, too smart, or totally off topic.
> 
> Any hints?
> 
> Cheers,
> Dan.
> 
> On 1 July 2010 11:12, Dan Bolser <dan.bolser at gmail.com> wrote:
>> On 29 May 2010 00:08, Dan Bolser <dan.bolser at gmail.com> wrote:
>>> Thanks all for replies.
>> 
>> <snip>
>> 
>>> There is a canonical way to model a gene, so I was wondering if it
>>> makes sense to describe similar 'biology' (or in this case molecular
>>> biology) in standard ways (when the feature isn't simply described by
>>> a single line of GFF)?
>>> 
>>> Perhaps I've not understood SO properly, but I'm not sure how its
>>> structure is translated into GFF structure ... is there a 1 to 1
>>> mapping?
>> 
>> Lack of replies lead me to believe that indeed, the GFF Parent
>> attribute should reflect (or be strictly determined by) the SO
>> 'relationships' (are they all 'part_of' relationships?)
>> 
>> However, I was trying to get some concepts clear in my head, and I
>> ended up creating a figure of a 'canonical gene' in SO [1], based on
>> the one in the GFF docs [2].
>> 
>> [1] http://imagebin.ca/view/Ni9BFbK.html
>> [2] http://www.sequenceontology.org/gff3.shtml
>> 
>> 
>> There is a transitive part_of relationships between 'mRNA' and 'gene',
>> which explains line 4 to 6 of the canonical gene GFF [2].
>> 
>> However, the figure shows that 'exon' is part_of 'transcript', and not
>> part_of 'mRNA'. If I got the figure right, and if I understand
>> correctly, there is no way to transitively infer that exon is part_of
>> mRNA (line 7 to 11 of the GFF [2]).
>> 
>> This implies that the 'structure' in GFF isn't strictly determined by SO.
>> 
>> Or is it a mistake in SO?
>> 
>> 
>> Sorry if this is a 'gotcha' that has been discussed before. Any links
>> to help me understand would be great.
>> 
>> Dan.
> _______________________________________________
> Open-Bio-l mailing list
> Open-Bio-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/open-bio-l
> 
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email
> ______________________________________________________________________

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________



More information about the Open-Bio-l mailing list