[Bioperl-l] Sequence IDs and Comment()s

Wed, 24 Oct 2001 10:46:11 -0400

> Jason Eric Stajich wrote:
> > >
> > > So... What would people think of adding a type() (or class(),
> > > category(), meta(), etc.) method to Comment to optionally qualify the
> > > contents?
> > >
> >
> > I'm wary of this because it implies we are starting to interpret the data
> > rather than just provide a mechanism for storing it and manipulating it,
> > how would this work for a GenBank -> Seq -> BSML trip and back?

This is actually working pretty well in my hands, with orphaned data
being stuffed into BSML <Attribute>s. The major loss is join
information, since BSML2.2 does not support discontinuous features (it
looks like 3.0 will).

I recognize that fundamentally the concern is data loss vs. data
corruption - simple data structures are more likely to be properly
interpreted by diverse users and programs, but are not as capable at
capturing esoteric pieces of information as a structure with more
bells and whistles on it.

Hilmar Lapp wrote:
> 
> For me this rather calls for a generic tag/value function. Hm. Do
> we really want this? In an abstract sense, these to me looks like
> structured annotation added to a Comment object. Will the
> annotation overhaul support this?

I think that would be a great idea. I considered adding a GSF that
spanned the whole sequence (or hitchhiking on 'source' if present) and
decorating it with tag/values, but worried that as a proper feature it
might be misinterpreted (and it would be visual clutter).

I suspect that the vast majority of the users would treat unstructured
information with the appropriate caution, or would ignore it
altogether. For those that wanted to dig deeper, it would provide a
place to either mine with regexps, or at least to port to their
destination document's generic data containers. Presumably this is the
fate of the current Comment.

To make this a little more concrete, I'm porting sequence data out of
a database into Bio::Seq objects, then generating BSML from the
results. In addition to description lines (which can nicely be placed
in desc) the database includes tasty tidbits like "tissue
specificity", "alternative products" and "function" tags that are not
associated with a base range. These are specific enough that they
would not warrant their own Bio structures, but are of extreme
interest to researchers. Again, I'm now dumping these as
Comment->text("funtion: yadda yadda..."), and since I know to look for
": ", I can split the string to separate the meta data from the
content.

A generic (associated with a Seq rather than a Feature) tag/value
attribute should allow more reliable association of meta data with
their values. This does not eliminate parsing ambiguity, but reduces
it ("Note: not sure this is functional" less likely to be confused
with a "function" tag than "note"->"not sure etc.").

-Charles

-- 
Charles Tilford, Bioinformatics-Applied Genomics
Bristol-Myers Squibb PRI, Hopewell 3A039
P.O. Box 5400, Princeton, NJ 08543-5400, (609) 818-3213
charles.tilford@bms.com