[Open-bio-l] Best practice for modelling data in GFF

Brad Chapman chapmanb at 50mail.com
Tue Jun 1 11:34:20 UTC 2010


Dan;
If what you are trying to do is represent your data in a way that the
most people can parse and reuse it, my suggestion would be to use
SAM/BAM to represent your alignments. You'll be using a standardized and
well-supported format specifically designed for this type of data.

While you can do this with GFF, the parser support for correctly
dealing with match_part or part_of is likely to be less robust.
As data providers standardize on one way to represent nested
features, it should become easier to deal with them.

Brad

> Thanks all for replies.
> 
> I'm aware of the GFF spec, and the SO ontology terms. The issue here
> (as I understand it) is that the feature isn't 'flat', but is a
> combination of two matching 'reads' that are grouped into a mate-pair
> depending on their proximity and orientation. As pointed out, not
> every pair is successfully mapped, specifically one read may be
> 'missing' from the pair, the pair may span two reference sequences, or
> the proximity or orientation of the pair may be incorrect.
> 
> Strictly speaking this can be handled by match and match_part (or
> read_pair and part_of) terms, however, the question is, does this
> reflect the biology adequately? (And specifically which terms should
> be used?)
> 
> There is a canonical way to model a gene, so I was wondering if it
> makes sense to describe similar 'biology' (or in this case molecular
> biology) in standard ways (when the feature isn't simply described by
> a single line of GFF)?
> 
> Perhaps I've not understood SO properly, but I'm not sure how its
> structure is translated into GFF structure ... is there a 1 to 1
> mapping?
> 
> 
> Cheers,
> Dan.
> 
> On 28 May 2010 18:49, Chris Fields <cjfields at illinois.edu> wrote:
> > All,
> >
> > Appears that link isn't up to date.  Current GFF3 spec (v. 1.16, updated May 25) here:
> >
> > http://www.sequenceontology.org/gff3.shtml
> >
> > chris
> >
> > On May 28, 2010, at 12:06 PM, Jason Stajich wrote:
> >
> >> It's covered in the GFF3 spec as match_part if that helps.
> >> http://song.sourceforge.net/gff3.shtml
> >>
> >> Dan Bolser wrote, On 5/28/10 9:29 AM:
> >>> Hi guys,
> >>>
> >>> Not sure if this is the right forum, but I just thought I'd ask...
> >>>
> >>> Where can I find information on 'best practices' for modelling
> >>> biological data in GFF?
> >>>
> >>> For example, I'd like to model paired-end sequence alignments in GFF.
> >>> One suggestion was to use match/match_part to link each end into a
> >>> pair. Another option is to use 'read_pair' with 'contig' for the
> >>> parent feature...
> >>>
> >>> Should I just be using SAM/BAM?
> >>>
> >>> Seems a shame not to have a standard way to do this in GFF...
> >>>
> >>>
> >>> Cheers,
> >>> Dan.
> >>> _______________________________________________
> >>> Open-Bio-l mailing list
> >>> Open-Bio-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/open-bio-l
> >>>
> >> _______________________________________________
> >> Open-Bio-l mailing list
> >> Open-Bio-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/open-bio-l
> >
> >
> 
> _______________________________________________
> Open-Bio-l mailing list
> Open-Bio-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/open-bio-l



More information about the Open-Bio-l mailing list