[Bioperl-l] new GFF3 support methods added

Fri Mar 5 20:52:31 EST 2004

I have committed some new stuff to bioperl-live:

the script seq/unflatten_seq will now generate GFF3 - the unflattener
module is used to build the 'feature graph' connecting genes, transcripts,
exons and CDSs together. This means we can have GFF3 for anything in
genbank!

As far as I'm aware, the only other sensible output formats to use here
(ie formats that support feature graphs/containment hierarchies) are:
chado, chaos, and the write-only asciitree.

This feature graph is written out in the GFF3 using the ID and Parent
tags. To do this there is an extra intermediate step - the bioperl
FeatureHolderI hierarchy is traversed and ID/Parent tags are generated.

Here is a description of the changes I have made:

[unless you're a bioperl hacker you don't really need to read the rest of
this]

You can get the context of what I'm on about from this thread:
http://bioperl.org/pipermail/bioperl-l/2003-December/014150.html

Two new public methods:

  FeatureHolderI->set_ParentIDs_from_hierarchy

    sets both ID and ParentID from FeatureHolder hierarchy

  SeqFeatureI->generate_unique_persistent_id

    this is required by the above method

    Lincoln wanted this to be private, but I think it has
    to be called from outside

  FeatureHolderI->create_hierarchy_from_ParentIDs

    the inverse of set_ParentIDs_from_hierarchy

(note that I have put the implementation in the interface - in the absence
of proper abstract classes, this was deemed the best thing to do in the
previous discussion on this)

Modifications:

  SeqFeatureI->primary_id

This now maps to the tag_value 'ID' (ie the tag that GFF3 uses to uniquely
identify a feature).

Minor modification

  Bio::Tools::GFF now allows the -noparse=>1 option

  this is simply to stop the module waiting on input from STDIN
  when used in write-mode (maybe there's a better way of doing this
  but I didn't want to mess with this module)

Proto-test

  t/FeatureHolder.x

  This unflattens a genbank sequences and roundtrips it to chadoxml
  via GFF3

  This doesn't work yet - if you dump a splitfeature as GFF3 and
  re-import it, it becomes two features. Any volunteers to help
  fix this?

Unique IDs in bioperl:

In the discussion that preceeded this, it seemed that people liked the
idea of persistent unique IDs, but there was no suggestions as to how to
go about it. This is inherently difficult with objects, but I borrowed a
solution from relational modeling.

A persistent unique ID is generated using

  seq_id
  primary_tag
  start
  end

It is assumed that these are all set and comprise a "unique key" over
features. Of course, there's no way to enforce this with objects. The
generated ID is simply these values concatenated with : delimiters. You
can think of this is a skolem function if you're that way inclined.

Another assumption is that seq_id is unique and persistent.

Of course, if you're dealing with data that changes with time, then
changing the coordinates of a feature will change it's id. This is fine.
If you want to use your own IDs rather than the generated ones, you can
simply set the primary_id() field - or if you are using genbank files, add
something like this

  /ID=CG12345-RA

to the feature.

Stuff still to do:

* fix GFF3 to deal with roundtripping splitlocations

* A nest_features() method, as discussed in the previous email. This is
the opposite of set_ParentIDs_from_hierarchy(), for reading in GFF3 (and
then writing to a feature-graph compatible format, or to a database such
as biosql or chado).

* Bio::SeqIO::GFF3

I know a lot of us would like this a lot - is there any plans to implement
this yet?

* A GeneModel factory

This would take the output of the unflattener (a set of feature graphs
typed to SO) and make SeqFeature::Gene objects

Cheers
Chris