[Bioperl-l] Proposal for Meta data

Mon Dec 18 21:50:00 UTC 2006

I agree with everything Heikki is saying, I just wanted to highlight  
one paragraph:

> The problem I see with undefined, totally open meta annotation, is  
> that if you
> can put anything in there, it is also totally confusing to a user.  
> If you can
> put anything in, how do you know what to get get out and know that  
> it is
> there?

One solution is to give your annotation/metadata-model formal  
computational semantics and use ontologies to give additional  
semantics to your metadata tags. This provides both user information  
in the form of documentation, and a means of specifying to the  
computer exactly what should be done with the tags.

This is probably overkill for bioperl; but if the use cases being  
proposed do lean in the direction of a new metadata system that is  
not necessarily backwards compatible with the existing one, then I'd  
recommend checking out what's already out there before re-inventing  
the wheel. Perl RDF libraries are getting a little better.

If anyone is interested in pursuing this sort of thing (probably on a  
branch), let me know

On Dec 18, 2006, at 5:51 AM, Heikki Lehvaslaiho wrote:

>
> Reading the discussion, I think it is time to draw some guidelines.
>
> 1. Base the Meta implementation to a real use cases.
>
>    MSA is a good example.
>
> 2. Allow generalisations
>
>    If you can see an other implementation of the same idea that can  
> be merged
>    with the first do it but do not hurt yourself if you can not.
>
>
> The most difficult question is how to separate case-specific  
> attributes that
> are best implemented by subclassing with additional methods from  
> truly widely
> variable meta data that is best done as a parallel track meta  
> information
> holding class.
>
> The problem I see with undefined, totally open meta annotation, is  
> that if you
> can put anything in there, it is also totally confusing to a user.  
> If you can
> put anything in, how do you know what to get get out and know that  
> it is
> there?
>
> That leads to the the third guideline:
>
> 3. Use separate meta classes only when there are several different  
> ways of
> encoding data that is present in large numbers *and* when you are  
> expecting
> to be assessing the data computationally rather than just checking  
> if an
> attribute is there.
>
>
> 	-Heikki
>
>
>
> On Friday 15 December 2006 19:23, Chris Fields wrote:
>> On Dec 15, 2006, at 8:28 AM, Jason Stajich wrote:
>>> On Dec 14, 2006, at 9:21 PM, Chris Fields wrote:
>>>> On Dec 14, 2006, at 7:45 PM, David Messina wrote:
>>>>> Hey Chris,
>>>>>
>>>>> My thoughts below.
>>>>>
>>>>>> [Chris]
>>>>>> This could be used to annotate any
>>>>>> PrimarySeq, LocatableSeq, SimpleAlign, SeqFeature, or what-have-
>>>>>> you,
>>>>>> maybe in a collection (similar to AnnotationCollection).  I  
>>>>>> thought
>>>>>> something like this may be of general use for any PrimarySeq
>>>>>> (quality, structure), alignments like NEXUS and Stockholm,
>>>>>> SeqFeatures where structure could be stored (tRNA or  
>>>>>> riboswitches),
>>>>>> etc.
>>>>>>
>>>>>> However, this also seems to fall into the category of sequence
>>>>>> annotation.  So, would it be better to have a set of
>>>>>> Bio::Annotation
>>>>>> classes used for this purpose?
>>>>>
>>>>> To me, all meta data is equal. That is, your classic Genbank  
>>>>> feature
>>>>> annotation and a user's arbitrary meta-tag like "Bob thinks this
>>>>> is a
>>>>> kinase domain" aren't different in kind even if they are
>>>>> different in
>>>>> content.
>>>>>
>>>>> As resequencing projects multiply, the ability to create arbitrary
>>>>> meta tags, attach them to different types of objects, and use  
>>>>> those
>>>>> tags to link them together will become desirable, if not  
>>>>> essential.
>>>>>
>>>>> Keeping a common interface to all of these meta data types  
>>>>> would be
>>>>> advantageous, plus new users won't have to determine whether they
>>>>> need to use Bio::Meta objects or Bio::Annotation objects.
>>>>>
>>>>> So I would argue for all of the meta data types to live "under one
>>>>> roof". Which roof isn't as important. Bio::Annotation, since it
>>>>> already exists for today's meta data, seems like a reasonable
>>>>> choice.
>>>>> (assuming Annotation objects are flexible enough to be extended as
>>>>> you propose)
>>>>>
>>>>> There, and no flames or jibes even. :)
>>>>
>>>> I guess what I want to know is whether there should to be a
>>>> distinction between 'normal' sequence annotation (comments,
>>>> references, and so on) and annotation that could be best  
>>>> described as
>>>> position-specific (like RNA or protein structural annotation).  The
>>>> current meta implementation is for sequence data only; I felt it
>>>> would be nice to have a generic implementation that would be
>>>> applicable to any object data.
>>>
>>> my stream-of-consciousness for right now:
>>>
>>> I was thinking Bio::Annotation is where this should go - that
>>> system doesn't have anything about it that makes it explicitly
>>> sequence related. What we're trying to hammer out here on the
>>> Alignment side - which fits with your RNA example - is have
>>> features, basically SeqFeatures - associated with alignments so
>>> columns can be annotated to cover things like character sets and
>>> partitions for phylogenetic analyses.  As for data which annotates
>>> non-contiguous things like RNAstems we may have  to be more
>>> creative about that or model it with a splitLocation.
>>>
>>> So currently we've added code so that an Alignment is-a
>>> Bio::AnnotableI and is-a Bio::FeatureHolderI to move towards this
>>> end, with the goal of being able to capture more of the data that
>>> can be represented in a NEXUS file.
>>>
>>> It feels more like a hack than an elegant Meta-data solution, but I
>>> am totally sure whether the data you are thinking about doing at
>>> this point, perhaps I need to spend more time thinking about it.
>>> Or are you worried about the idea of whether the semantic mapping
>>> of the data into features or annotations is confusing users?
>>
>> Sorry in advance for the longish response here...
>>
>> My original thought was to have a generic abstract class capable of
>> positionally describing data in any another class, similar to
>> Heikki's Bio::Seq::MetaI but not constrained to sequence data only.
>> Implementing classes would be capable of having different data
>> structures based on their use (simple string, array, AoA, AoH, AoO).
>> One MetaCollection class to contain them all in a tag-like system, so
>> you could have mixed data types describe the same object.  The latter
>> Collection class is so similar to AnnotationCollection that I agree
>> Bio::Annotation would be the best place for this.
>>
>> The way I reconfigured Stockholm alignment parsing/writing is to use
>> Bio::Seq::Meta objects (which are LocatableSeq).  Each Seq::Meta is
>> capable of holding a sequence and several meta strings, stored as
>> tags or 'names'.  However, there is no Meta object for alignments
>> (for RNA/protein structure consensus and other Rfam/Pfam markup); I
>> hacked around this by using a Bio::Seq::Meta w/o a seq, but I would
>> rather have a generic Meta object independent of the sequence cruft.
>>
>> So for this partial Pfam alignment,
>>
>> Q92SV1_RHIME/122-299         LAMALNLARGI...VDADVDF..REG
>> #=GR Q92SV1_RHIME/122-299 pAS .........................
>> Q883D2_PSESM/110-290         LGLMLGLRRRL...FDGNGAV..KRS
>> Q8ZXP5_PYRAE/91-262          LALLLAPYKRI...IQYGEKM..KRG
>> #=GR Q8ZXP5_PYRAE/91-262 SS  HHHHHHHHTTH...HHHHHHX..HTT
>> #=GR Q8ZXP5_PYRAE/91-262 SA  00000000000...120030X..474
>> #=GC SS_cons                 HHHHHHHHTTH...HHHHHHH..HTT
>> #=GC SA_cons                 03002200312...1312414..676
>> #=GC seq_cons                luhhLuhsRpl...hthppth..+pG
>> //
>>
>> '#=GC' lines would be in generic meta string objects in the
>> alignment, while '#=GR' tags would be in similar meta objects in the
>> relevant sequences.  As long as both aren't AnnotatableI this isn't
>> an issue.
>>
>> Similarly, NEXUS files which contained any position-based values
>> could hold a meta string/array object in a similar tag.
>>
>> The basic scheme is:
>>                      |--String
>>
>> Annotation::Meta----|--Array
>>
>>                      |--HorriblyComplexDataStruct
>>
>> Then I started thinking about where this could be applied, and
>> whether a true Meta object needs to be constrained only to describing
>> position-based data.  This somewhat relates to this bug:
>>
>> http://bugzilla.open-bio.org/show_bug.cgi?id=1825
>>
>> which seems to need a simple but unconstrained hash-of-arrays-based
>> meta object.
>>
>> Then my head appropriately exploded...
>>
>> Hope everything is going well at the hackathon!  Looks like some
>> interesting stuff coming out of it.
>>
>> chris
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> -- 
> ______ _/      _/_____________________________________________________
>       _/      _/
>      _/  _/  _/  Heikki Lehvaslaiho    heikki at_sanbi _ac _za
>     _/_/_/_/_/  Associate Professor    skype: heikki_lehvaslaiho
>    _/  _/  _/  SANBI, South African National Bioinformatics Institute
>   _/  _/  _/  University of Western Cape, South Africa
>      _/      Phone: +27 21 959 2096   FAX: +27 21 959 2512
> ___ _/_/_/_/_/________________________________________________________
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>