[Bioperl-l] Proposal for Meta data
Chris Mungall
cjm at fruitfly.org
Mon Dec 18 21:50:00 UTC 2006
I agree with everything Heikki is saying, I just wanted to highlight
one paragraph:
> The problem I see with undefined, totally open meta annotation, is
> that if you
> can put anything in there, it is also totally confusing to a user.
> If you can
> put anything in, how do you know what to get get out and know that
> it is
> there?
One solution is to give your annotation/metadata-model formal
computational semantics and use ontologies to give additional
semantics to your metadata tags. This provides both user information
in the form of documentation, and a means of specifying to the
computer exactly what should be done with the tags.
This is probably overkill for bioperl; but if the use cases being
proposed do lean in the direction of a new metadata system that is
not necessarily backwards compatible with the existing one, then I'd
recommend checking out what's already out there before re-inventing
the wheel. Perl RDF libraries are getting a little better.
If anyone is interested in pursuing this sort of thing (probably on a
branch), let me know
On Dec 18, 2006, at 5:51 AM, Heikki Lehvaslaiho wrote:
>
> Reading the discussion, I think it is time to draw some guidelines.
>
> 1. Base the Meta implementation to a real use cases.
>
> MSA is a good example.
>
> 2. Allow generalisations
>
> If you can see an other implementation of the same idea that can
> be merged
> with the first do it but do not hurt yourself if you can not.
>
>
> The most difficult question is how to separate case-specific
> attributes that
> are best implemented by subclassing with additional methods from
> truly widely
> variable meta data that is best done as a parallel track meta
> information
> holding class.
>
> The problem I see with undefined, totally open meta annotation, is
> that if you
> can put anything in there, it is also totally confusing to a user.
> If you can
> put anything in, how do you know what to get get out and know that
> it is
> there?
>
> That leads to the the third guideline:
>
> 3. Use separate meta classes only when there are several different
> ways of
> encoding data that is present in large numbers *and* when you are
> expecting
> to be assessing the data computationally rather than just checking
> if an
> attribute is there.
>
>
> -Heikki
>
>
>
> On Friday 15 December 2006 19:23, Chris Fields wrote:
>> On Dec 15, 2006, at 8:28 AM, Jason Stajich wrote:
>>> On Dec 14, 2006, at 9:21 PM, Chris Fields wrote:
>>>> On Dec 14, 2006, at 7:45 PM, David Messina wrote:
>>>>> Hey Chris,
>>>>>
>>>>> My thoughts below.
>>>>>
>>>>>> [Chris]
>>>>>> This could be used to annotate any
>>>>>> PrimarySeq, LocatableSeq, SimpleAlign, SeqFeature, or what-have-
>>>>>> you,
>>>>>> maybe in a collection (similar to AnnotationCollection). I
>>>>>> thought
>>>>>> something like this may be of general use for any PrimarySeq
>>>>>> (quality, structure), alignments like NEXUS and Stockholm,
>>>>>> SeqFeatures where structure could be stored (tRNA or
>>>>>> riboswitches),
>>>>>> etc.
>>>>>>
>>>>>> However, this also seems to fall into the category of sequence
>>>>>> annotation. So, would it be better to have a set of
>>>>>> Bio::Annotation
>>>>>> classes used for this purpose?
>>>>>
>>>>> To me, all meta data is equal. That is, your classic Genbank
>>>>> feature
>>>>> annotation and a user's arbitrary meta-tag like "Bob thinks this
>>>>> is a
>>>>> kinase domain" aren't different in kind even if they are
>>>>> different in
>>>>> content.
>>>>>
>>>>> As resequencing projects multiply, the ability to create arbitrary
>>>>> meta tags, attach them to different types of objects, and use
>>>>> those
>>>>> tags to link them together will become desirable, if not
>>>>> essential.
>>>>>
>>>>> Keeping a common interface to all of these meta data types
>>>>> would be
>>>>> advantageous, plus new users won't have to determine whether they
>>>>> need to use Bio::Meta objects or Bio::Annotation objects.
>>>>>
>>>>> So I would argue for all of the meta data types to live "under one
>>>>> roof". Which roof isn't as important. Bio::Annotation, since it
>>>>> already exists for today's meta data, seems like a reasonable
>>>>> choice.
>>>>> (assuming Annotation objects are flexible enough to be extended as
>>>>> you propose)
>>>>>
>>>>> There, and no flames or jibes even. :)
>>>>
>>>> I guess what I want to know is whether there should to be a
>>>> distinction between 'normal' sequence annotation (comments,
>>>> references, and so on) and annotation that could be best
>>>> described as
>>>> position-specific (like RNA or protein structural annotation). The
>>>> current meta implementation is for sequence data only; I felt it
>>>> would be nice to have a generic implementation that would be
>>>> applicable to any object data.
>>>
>>> my stream-of-consciousness for right now:
>>>
>>> I was thinking Bio::Annotation is where this should go - that
>>> system doesn't have anything about it that makes it explicitly
>>> sequence related. What we're trying to hammer out here on the
>>> Alignment side - which fits with your RNA example - is have
>>> features, basically SeqFeatures - associated with alignments so
>>> columns can be annotated to cover things like character sets and
>>> partitions for phylogenetic analyses. As for data which annotates
>>> non-contiguous things like RNAstems we may have to be more
>>> creative about that or model it with a splitLocation.
>>>
>>> So currently we've added code so that an Alignment is-a
>>> Bio::AnnotableI and is-a Bio::FeatureHolderI to move towards this
>>> end, with the goal of being able to capture more of the data that
>>> can be represented in a NEXUS file.
>>>
>>> It feels more like a hack than an elegant Meta-data solution, but I
>>> am totally sure whether the data you are thinking about doing at
>>> this point, perhaps I need to spend more time thinking about it.
>>> Or are you worried about the idea of whether the semantic mapping
>>> of the data into features or annotations is confusing users?
>>
>> Sorry in advance for the longish response here...
>>
>> My original thought was to have a generic abstract class capable of
>> positionally describing data in any another class, similar to
>> Heikki's Bio::Seq::MetaI but not constrained to sequence data only.
>> Implementing classes would be capable of having different data
>> structures based on their use (simple string, array, AoA, AoH, AoO).
>> One MetaCollection class to contain them all in a tag-like system, so
>> you could have mixed data types describe the same object. The latter
>> Collection class is so similar to AnnotationCollection that I agree
>> Bio::Annotation would be the best place for this.
>>
>> The way I reconfigured Stockholm alignment parsing/writing is to use
>> Bio::Seq::Meta objects (which are LocatableSeq). Each Seq::Meta is
>> capable of holding a sequence and several meta strings, stored as
>> tags or 'names'. However, there is no Meta object for alignments
>> (for RNA/protein structure consensus and other Rfam/Pfam markup); I
>> hacked around this by using a Bio::Seq::Meta w/o a seq, but I would
>> rather have a generic Meta object independent of the sequence cruft.
>>
>> So for this partial Pfam alignment,
>>
>> Q92SV1_RHIME/122-299 LAMALNLARGI...VDADVDF..REG
>> #=GR Q92SV1_RHIME/122-299 pAS .........................
>> Q883D2_PSESM/110-290 LGLMLGLRRRL...FDGNGAV..KRS
>> Q8ZXP5_PYRAE/91-262 LALLLAPYKRI...IQYGEKM..KRG
>> #=GR Q8ZXP5_PYRAE/91-262 SS HHHHHHHHTTH...HHHHHHX..HTT
>> #=GR Q8ZXP5_PYRAE/91-262 SA 00000000000...120030X..474
>> #=GC SS_cons HHHHHHHHTTH...HHHHHHH..HTT
>> #=GC SA_cons 03002200312...1312414..676
>> #=GC seq_cons luhhLuhsRpl...hthppth..+pG
>> //
>>
>> '#=GC' lines would be in generic meta string objects in the
>> alignment, while '#=GR' tags would be in similar meta objects in the
>> relevant sequences. As long as both aren't AnnotatableI this isn't
>> an issue.
>>
>> Similarly, NEXUS files which contained any position-based values
>> could hold a meta string/array object in a similar tag.
>>
>> The basic scheme is:
>> |--String
>>
>> Annotation::Meta----|--Array
>>
>> |--HorriblyComplexDataStruct
>>
>> Then I started thinking about where this could be applied, and
>> whether a true Meta object needs to be constrained only to describing
>> position-based data. This somewhat relates to this bug:
>>
>> http://bugzilla.open-bio.org/show_bug.cgi?id=1825
>>
>> which seems to need a simple but unconstrained hash-of-arrays-based
>> meta object.
>>
>> Then my head appropriately exploded...
>>
>> Hope everything is going well at the hackathon! Looks like some
>> interesting stuff coming out of it.
>>
>> chris
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> --
> ______ _/ _/_____________________________________________________
> _/ _/
> _/ _/ _/ Heikki Lehvaslaiho heikki at_sanbi _ac _za
> _/_/_/_/_/ Associate Professor skype: heikki_lehvaslaiho
> _/ _/ _/ SANBI, South African National Bioinformatics Institute
> _/ _/ _/ University of Western Cape, South Africa
> _/ Phone: +27 21 959 2096 FAX: +27 21 959 2512
> ___ _/_/_/_/_/________________________________________________________
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
More information about the Bioperl-l
mailing list