[Bioperl-l] Proposal for Meta data

Mon Dec 18 13:51:50 UTC 2006

Reading the discussion, I think it is time to draw some guidelines.

1. Base the Meta implementation to a real use cases.

   MSA is a good example.

2. Allow generalisations

   If you can see an other implementation of the same idea that can be merged 
   with the first do it but do not hurt yourself if you can not.

The most difficult question is how to separate case-specific attributes that 
are best implemented by subclassing with additional methods from truly widely 
variable meta data that is best done as a parallel track meta information 
holding class.

The problem I see with undefined, totally open meta annotation, is that if you 
can put anything in there, it is also totally confusing to a user. If you can 
put anything in, how do you know what to get get out and know that it is 
there?

That leads to the the third guideline:

3. Use separate meta classes only when there are several different ways of 
encoding data that is present in large numbers *and* when you are expecting 
to be assessing the data computationally rather than just checking if an 
attribute is there. 

	-Heikki

On Friday 15 December 2006 19:23, Chris Fields wrote:
> On Dec 15, 2006, at 8:28 AM, Jason Stajich wrote:
> > On Dec 14, 2006, at 9:21 PM, Chris Fields wrote:
> >> On Dec 14, 2006, at 7:45 PM, David Messina wrote:
> >>> Hey Chris,
> >>>
> >>> My thoughts below.
> >>>
> >>>> [Chris]
> >>>> This could be used to annotate any
> >>>> PrimarySeq, LocatableSeq, SimpleAlign, SeqFeature, or what-have-
> >>>> you,
> >>>> maybe in a collection (similar to AnnotationCollection).  I thought
> >>>> something like this may be of general use for any PrimarySeq
> >>>> (quality, structure), alignments like NEXUS and Stockholm,
> >>>> SeqFeatures where structure could be stored (tRNA or riboswitches),
> >>>> etc.
> >>>>
> >>>> However, this also seems to fall into the category of sequence
> >>>> annotation.  So, would it be better to have a set of
> >>>> Bio::Annotation
> >>>> classes used for this purpose?
> >>>
> >>> To me, all meta data is equal. That is, your classic Genbank feature
> >>> annotation and a user's arbitrary meta-tag like "Bob thinks this
> >>> is a
> >>> kinase domain" aren't different in kind even if they are
> >>> different in
> >>> content.
> >>>
> >>> As resequencing projects multiply, the ability to create arbitrary
> >>> meta tags, attach them to different types of objects, and use those
> >>> tags to link them together will become desirable, if not essential.
> >>>
> >>> Keeping a common interface to all of these meta data types would be
> >>> advantageous, plus new users won't have to determine whether they
> >>> need to use Bio::Meta objects or Bio::Annotation objects.
> >>>
> >>> So I would argue for all of the meta data types to live "under one
> >>> roof". Which roof isn't as important. Bio::Annotation, since it
> >>> already exists for today's meta data, seems like a reasonable
> >>> choice.
> >>> (assuming Annotation objects are flexible enough to be extended as
> >>> you propose)
> >>>
> >>> There, and no flames or jibes even. :)
> >>
> >> I guess what I want to know is whether there should to be a
> >> distinction between 'normal' sequence annotation (comments,
> >> references, and so on) and annotation that could be best described as
> >> position-specific (like RNA or protein structural annotation).  The
> >> current meta implementation is for sequence data only; I felt it
> >> would be nice to have a generic implementation that would be
> >> applicable to any object data.
> >
> > my stream-of-consciousness for right now:
> >
> > I was thinking Bio::Annotation is where this should go - that
> > system doesn't have anything about it that makes it explicitly
> > sequence related. What we're trying to hammer out here on the
> > Alignment side - which fits with your RNA example - is have
> > features, basically SeqFeatures - associated with alignments so
> > columns can be annotated to cover things like character sets and
> > partitions for phylogenetic analyses.  As for data which annotates
> > non-contiguous things like RNAstems we may have  to be more
> > creative about that or model it with a splitLocation.
> >
> > So currently we've added code so that an Alignment is-a
> > Bio::AnnotableI and is-a Bio::FeatureHolderI to move towards this
> > end, with the goal of being able to capture more of the data that
> > can be represented in a NEXUS file.
> >
> > It feels more like a hack than an elegant Meta-data solution, but I
> > am totally sure whether the data you are thinking about doing at
> > this point, perhaps I need to spend more time thinking about it.
> > Or are you worried about the idea of whether the semantic mapping
> > of the data into features or annotations is confusing users?
>
> Sorry in advance for the longish response here...
>
> My original thought was to have a generic abstract class capable of
> positionally describing data in any another class, similar to
> Heikki's Bio::Seq::MetaI but not constrained to sequence data only.
> Implementing classes would be capable of having different data
> structures based on their use (simple string, array, AoA, AoH, AoO).
> One MetaCollection class to contain them all in a tag-like system, so
> you could have mixed data types describe the same object.  The latter
> Collection class is so similar to AnnotationCollection that I agree
> Bio::Annotation would be the best place for this.
>
> The way I reconfigured Stockholm alignment parsing/writing is to use
> Bio::Seq::Meta objects (which are LocatableSeq).  Each Seq::Meta is
> capable of holding a sequence and several meta strings, stored as
> tags or 'names'.  However, there is no Meta object for alignments
> (for RNA/protein structure consensus and other Rfam/Pfam markup); I
> hacked around this by using a Bio::Seq::Meta w/o a seq, but I would
> rather have a generic Meta object independent of the sequence cruft.
>
> So for this partial Pfam alignment,
>
> Q92SV1_RHIME/122-299         LAMALNLARGI...VDADVDF..REG
> #=GR Q92SV1_RHIME/122-299 pAS .........................
> Q883D2_PSESM/110-290         LGLMLGLRRRL...FDGNGAV..KRS
> Q8ZXP5_PYRAE/91-262          LALLLAPYKRI...IQYGEKM..KRG
> #=GR Q8ZXP5_PYRAE/91-262 SS  HHHHHHHHTTH...HHHHHHX..HTT
> #=GR Q8ZXP5_PYRAE/91-262 SA  00000000000...120030X..474
> #=GC SS_cons                 HHHHHHHHTTH...HHHHHHH..HTT
> #=GC SA_cons                 03002200312...1312414..676
> #=GC seq_cons                luhhLuhsRpl...hthppth..+pG
> //
>
> '#=GC' lines would be in generic meta string objects in the
> alignment, while '#=GR' tags would be in similar meta objects in the
> relevant sequences.  As long as both aren't AnnotatableI this isn't
> an issue.
>
> Similarly, NEXUS files which contained any position-based values
> could hold a meta string/array object in a similar tag.
>
> The basic scheme is:
>                      |--String
>
> Annotation::Meta----|--Array
>
>                      |--HorriblyComplexDataStruct
>
> Then I started thinking about where this could be applied, and
> whether a true Meta object needs to be constrained only to describing
> position-based data.  This somewhat relates to this bug:
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=1825
>
> which seems to need a simple but unconstrained hash-of-arrays-based
> meta object.
>
> Then my head appropriately exploded...
>
> Hope everything is going well at the hackathon!  Looks like some
> interesting stuff coming out of it.
>
> chris
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

-- 
______ _/      _/_____________________________________________________
      _/      _/
     _/  _/  _/  Heikki Lehvaslaiho    heikki at_sanbi _ac _za
    _/_/_/_/_/  Associate Professor    skype: heikki_lehvaslaiho
   _/  _/  _/  SANBI, South African National Bioinformatics Institute
  _/  _/  _/  University of Western Cape, South Africa
     _/      Phone: +27 21 959 2096   FAX: +27 21 959 2512
___ _/_/_/_/_/________________________________________________________