[Bioperl-l] proposed additions to SeqFeatureI, RangeI and FeatureHolderI

Chris Mungall cjm at fruitfly.org
Wed Nov 19 21:47:13 EST 2003


I have some proposed changes I would like to commit to bioperl, mostly
for using GFF3.

In both SeqFeatureI and SeqFeature::Generic I would like to add some
accessor methods. They would all map to tag-values.

  ID         - synonym for tag_value('ID')[0]
  ParentIDs  - synonym for tag_value('Parent')

and also

  add_ParentID
  remove_ParentID
  remove_ParentIDs

Question - should the method be Parent or ParentID? In GFF3, the tag
is "Parent". But an accessor method called "Parents()" feels like it
should return objects, so I think ParentIDs() is better.

Also, I realise it's contrary to bioperl convention to have method
names in caps, but it's nice to be consistent with the GFF3 tags.

I also notice that in SeqFeatureI we have an accessor definition and
implementation for "primary_id". There is no definition for this.

I propose either eliminating this, or making it a synonym of ID()

I think we need clearly defined semantics for these fields. I think
the semantics should be such that the ID should uniquely identify the
feature. This is problemmatic, as most sources don't issue a unique
accession or identifier for features. For example, genbank files
provide a /gene for a lot of features, but this isn't even unique
e.g. with multicopy genes. In cases where the data source does not
provide a unique ID, we may want a way to generate them. So I think
there should also be a method:

  generateID()

which sets the ID field to something that's guaranteed unique. I'm not
sure how. Perhaps a combination of the timestamp and the object memory
reference?

Because I'm lazy I'd rather do all this in SeqFeatureI - it all
delegates to existing methods. But I am unsure as to bioperl
conventions regarding when an 'interface' has implementation code.

----

I also want to add some code to FeatureHolderI, for dealing with the
"nesting hierarchy" in bioperl, i.e. features that contain other
features.

The methods are:

  nest_features()

creates a feature nesting hierarchy based on the "ID" and "Parent"
tags. This is useful when parsing GFF3.

Also:

  flatten_features()

for flattening the nesting hierarchy (so top_SeqFeatures and
get_SeqFeatures return the same thing)

Also:

  set_ParentIDs_from_hierarchy()

This will go through the FeatureHolder hierarchy; any time it sees a
feature with subfeatures, it will set the children's "Parent" tag
according to the "ID" tag of the parent. If the parent does not have
an ID, one will be generated.

I particularly want this so I can take genbank files, feed them
through Bio::Seqfeature::Tools::Unflattener, call this method, then
dump the results as GFF3

The one reservation I have about this is that there are two (easily
interchangeable) ways of dealing with hierarchies in bioperl. The
alternative is to do this conversion on the fly in the GFF3
adapter. But this messes with people who want to get/set ID and Parent
tags explicitly.

----

And nothing to do with the above code, I would like to add methods to
RangeI for interbase coordinates. Love em or hate em, these methods
will make some people's code easier at no cost to bioperl.

First the interbase equivalent of start/end:

  istart
  iend

Of course, iend is just a synonym for end, but it's nice for
completion

This is the equivalent of chado fmin/fmax.

I would also like:

  ifrom
  ito

For interbase directional coordinates. This is equivalent to
istart,iend in the + strand, and the reverse of this in the - strand.

Let me know if there's any objections, otherwise I'll commit sometime
next week.





More information about the Bioperl-l mailing list