[Bioperl-l] new GFF3 support methods added
Chris Mungall
cjm at fruitfly.org
Fri Mar 5 20:52:31 EST 2004
I have committed some new stuff to bioperl-live:
the script seq/unflatten_seq will now generate GFF3 - the unflattener
module is used to build the 'feature graph' connecting genes, transcripts,
exons and CDSs together. This means we can have GFF3 for anything in
genbank!
As far as I'm aware, the only other sensible output formats to use here
(ie formats that support feature graphs/containment hierarchies) are:
chado, chaos, and the write-only asciitree.
This feature graph is written out in the GFF3 using the ID and Parent
tags. To do this there is an extra intermediate step - the bioperl
FeatureHolderI hierarchy is traversed and ID/Parent tags are generated.
Here is a description of the changes I have made:
[unless you're a bioperl hacker you don't really need to read the rest of
this]
You can get the context of what I'm on about from this thread:
http://bioperl.org/pipermail/bioperl-l/2003-December/014150.html
Two new public methods:
FeatureHolderI->set_ParentIDs_from_hierarchy
sets both ID and ParentID from FeatureHolder hierarchy
SeqFeatureI->generate_unique_persistent_id
this is required by the above method
Lincoln wanted this to be private, but I think it has
to be called from outside
FeatureHolderI->create_hierarchy_from_ParentIDs
the inverse of set_ParentIDs_from_hierarchy
(note that I have put the implementation in the interface - in the absence
of proper abstract classes, this was deemed the best thing to do in the
previous discussion on this)
Modifications:
SeqFeatureI->primary_id
This now maps to the tag_value 'ID' (ie the tag that GFF3 uses to uniquely
identify a feature).
Minor modification
Bio::Tools::GFF now allows the -noparse=>1 option
this is simply to stop the module waiting on input from STDIN
when used in write-mode (maybe there's a better way of doing this
but I didn't want to mess with this module)
Proto-test
t/FeatureHolder.x
This unflattens a genbank sequences and roundtrips it to chadoxml
via GFF3
This doesn't work yet - if you dump a splitfeature as GFF3 and
re-import it, it becomes two features. Any volunteers to help
fix this?
Unique IDs in bioperl:
In the discussion that preceeded this, it seemed that people liked the
idea of persistent unique IDs, but there was no suggestions as to how to
go about it. This is inherently difficult with objects, but I borrowed a
solution from relational modeling.
A persistent unique ID is generated using
seq_id
primary_tag
start
end
It is assumed that these are all set and comprise a "unique key" over
features. Of course, there's no way to enforce this with objects. The
generated ID is simply these values concatenated with : delimiters. You
can think of this is a skolem function if you're that way inclined.
Another assumption is that seq_id is unique and persistent.
Of course, if you're dealing with data that changes with time, then
changing the coordinates of a feature will change it's id. This is fine.
If you want to use your own IDs rather than the generated ones, you can
simply set the primary_id() field - or if you are using genbank files, add
something like this
/ID=CG12345-RA
to the feature.
Stuff still to do:
* fix GFF3 to deal with roundtripping splitlocations
* A nest_features() method, as discussed in the previous email. This is
the opposite of set_ParentIDs_from_hierarchy(), for reading in GFF3 (and
then writing to a feature-graph compatible format, or to a database such
as biosql or chado).
* Bio::SeqIO::GFF3
I know a lot of us would like this a lot - is there any plans to implement
this yet?
* A GeneModel factory
This would take the output of the unflattener (a set of feature graphs
typed to SO) and make SeqFeature::Gene objects
Cheers
Chris
More information about the Bioperl-l
mailing list