[Bioperl-l] Bio::DB::SeqFeature treamtent of tags and annotations
Cook, Malcolm
MEC at stowers-institute.org
Mon Jan 29 21:17:24 UTC 2007
Lincoln,
Thanks for your suggestions on approach to my problems augmenting
Flybase annotation. I am trying to follow them and finding the
following oddities
The first issue relates to the intermix of 'annotations' and 'tag
values'. I find that Bio::DB::SeqFeature implements some of the 'tag'
methods and some of the 'Annotation' methods. Here is a perl one-liner
that shows values stored using add_tag_value are not retreived with
get_tag_values, but rather with get_Annotations.
> perl -MBio::DB::SeqFeature -e 'my $f=Bio::DB::SeqFeature->new;
$f->add_tag_value("x",666); print "get_tag_values:\t" .
$f->get_tag_values("x") . "\nget_Annotations:\t" .
$f->get_Annotations("x");'
whose output is:
get_tag_values:
get_Annotations: 666
Tracing this shows me that this results from the fact that:
Bio::DB::SeqFeature uses of Bio::Graphics::FeatureBase (via
Bio::DB::SeqFeature::NormalizedFeature) which does not support -tags in
->new but rather -attributes, viz:
-attributes a hashref of tag value attributes, in which the key is
the tag
and the value is an array reference of values
And though Bio::Graphics::FeatureBase purports to implement
Bio::SeqFeatureI, it only partially implements the 'tag' methods (now
deprecated and relegated to Bio::AnnotatableI). In particular, the '*'
methods Bio::SeqFeatureI are not implemented in
Bio::Graphics::FeatureBase
has_tag
* add_tag_value
get_tag_values
get_all_tags
* remove_tag
get_tagset_values
get_Annotations
As a result, add_tag_value and remove_tag are inherited from different
modules whose understanding of tags is not the same!
This one-liner :
>perl -MClass::ISA -MClass::Inspector -MBio::DB::SeqFeature -e 'my @c =
Class::ISA::self_and_super_path("Bio::DB::SeqFeature"); foreach my $fn
qw(add_tag_value get_tag_values) {print "\n$fn:\t", join "\t", (grep
{Class::Inspector->function_exists($_, $fn)} @c)}'
confirms that they are defined in different packages, namely:
add_tag_value: Bio::AnnotatableI
get_tag_values: Bio::Graphics::FeatureBase Bio::AnnotatableI
Proposed solution... hmmmm ..... I dunno.... maybe the following patch
to Bio::Graphics::FeatureBase->add_tag_value :
sub add_tag_value {
my ($self,$tag, at vals) = @_;
push @{$self->{attributes}{$tag}}, @vals;
}
It fixes my use case for now but I'm still concerned and confused about
this variety of methods.
Suggestions?
------------------------------------------------------------------------
-
Also, I think that any "ID" in column 9 of GFF3 float file should be
preserved through a round-trip through a Bio::DB::SeqFeature store, but
this is not yet possible since any ID attribute in GFF3 column 9 is
being lost by GFF3Loader, causing me to locally patch
GFF3Loader::handle_feature method to add the following:
# mec at stowers-institute.org <mailto:mec at stowers-institute.org> ,
wondering why not all attributes are
# carried forward, adds ID tag in particular service of
# round-tripping ID, which, though present in database as load_id
# attribute, was getting lost as itself
$unreserved->{ID}= $reserved->{ID} if exists $reserved->{ID};
Poised to patch.... what d'you think?
Malcolm Cook
Stowers Institute for Medical Research - Kansas City, Missouri
________________________________
From: lincoln.stein at gmail.com [mailto:lincoln.stein at gmail.com]
On Behalf Of Lincoln Stein
Sent: Tuesday, December 19, 2006 3:58 PM
To: Cook, Malcolm
Cc: bioperl list; lstein at cshl.org
Subject: Re: bp_seqfeature_load /
Bio::DB::SeqFeature::Store::GFF3Loader problems augmenting Flybase
annotation
Hi Malcom,
Your second guess was right. The use case of augmenting an
existing gene with additional splice forms isn't provided for. You can
get the functionality by making direct calls to
Bio::DB::SeqFeature::Store methods:
my @genes = $db->get_features_by_name('FBgn0017545');
@genes == 1 or die "Didn't get exactly one gene";
my $parent = $genes[0];
my $parent = $genes[0];
my $chr = $parent->seq_id;
my $start = $parent->start;
my $end = $parent->end;
my $strand = $parent->strand;
my $new_splice_form = $db->new_feature(-primary_tag => 'mRNA',
-source => 'added',
-seq_id => '4r',
-strand => $strand,
-start => $start+10,
-end => $end,
);
$parent->add_SeqFeature($new_splice_form);
for my $pos ([$start+10,$start+100],[$start+200,$end]) {
my ($e_start,$e_end) = @$pos;
my $exon = Bio::DB::SeqFeature->new(-primary_tag => 'exon',
-store => $db,
-seq_id => '4r',
-strand => $strand,
-start => $e_start,
-end => $e_end);
$new_splice_form->add_SeqFeature($exon);
}
I found a bug in updating the seqfeature database when I wrote
this script, so you'll have to get the latest biperl live. I think you
can use this to write a splice form updating script.
In order to support the idea of adding new splice forms to an
existing gene using the GFF3 format, I will have to either modify the
loader, or write a separate script (probably better to do the latter).
It shouldn't be hard if you'd like to give it a try.
Lincoln
On 12/19/06, Cook, Malcolm <MEC at stowers-institute.org
<mailto:MEC at stowers-institute.org> > wrote:
Lincoln and fellow Bio::DB::SeqFeature travelers,
I find that using bp_seqfeature_load.PLS to load
subfeatures of genes
already loaded using bp_seqfeature_load.PLS fails with
------------- EXCEPTION -------------
MSG: FBgn0017545 doesn't have a primary id
STACK
Bio::DB::SeqFeature::Store::GFF3Loader::build_object_tree_in_tables
/home/mec/cvs/bioperl-live/Bio/DB/SeqFeature/Store/GFF3Loader.pm:682
STACK
Bio::DB::SeqFeature::Store::GFF3Loader::build_object_tree
/home/mec/cvs/bioperl-live/Bio/DB/SeqFeature/Store/GFF3Loader.pm:663
STACK
Bio::DB::SeqFeature::Store::GFF3Loader::finish_load
/home/mec/cvs/bioperl-live/Bio/DB/SeqFeature/Store/GFF3Loader.pm:372
STACK Bio::DB::SeqFeature::Store::GFF3Loader::load_fh
/home/mec/cvs/bioperl-live/Bio/DB/SeqFeature/Store/GFF3Loader.pm:345
STACK Bio::DB::SeqFeature::Store::GFF3Loader::load
/home/mec/cvs/bioperl-live/Bio/DB/SeqFeature/Store/GFF3Loader.pm:242
STACK toplevel
/home/mec/cvs/bioperl-live/scripts/Bio-SeqFeature-Store/bp_seqfeature_lo
ad.PLS:76
Where FBgn0017545 is the ID of a gene previously loaded.
I am unsure how to remedy my situation and welcome any
advise on correct
or improved approach to my problem.
Here's some detail if it helps. I am developing a
pipeline to design a
microarray probes capable of distinguishing among splice
variants in
drosophila (using latest Flybase dmel_r5.1 annotation).
So I
1) load a filtered selection of Flybase annotation using
bp_seqfeature_load. (for testing purposes, I am using a
single gene's
worth of annotation, FBgn0017545.gff, attached). This
is done as
follows:
> bp_seqfeature_load.PLS --create
FBgn0017545.gff
2) analyze all the genes in the database, and create
GFF3 output each
feature of which has a 'Parent' that is a previously
loaded gene (i.e.
FBgn0017545). (These features represent the unique
introns, splice
sites, and exonic design targets. Output of this
analysis,
FBgn0017545_matd.gff, is also attached)
3) load these analysis results into the same database,
as follows:
> bp_seqfeature_load.PLS
FBgn0017545_matd.gff
It is at this point that I get the above error.
However, I don't get any error and the data loads fine
if I load the two
files together, as follows:
> bp_seqfeature_load.PLS --create <(cat
FBgn0017545.gff
FBgn0017545_matd.gff)
So, I suspect that either I am misunderstanding when/how
to use
bp_seqfeature_load.PLS or else this use case has not yet
arisen and must
be provided for somehow.
I am running against bioperl-live
Thanks for your thoughts and assistance,
Malcolm Cook
Database Applications Manager - Bioinformatics
Stowers Institute for Medical Research - Kansas City,
Missouri
--
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
FOR URGENT MESSAGES & SCHEDULING,
PLEASE CONTACT MY ASSISTANT,
SANDRA MICHELSEN, AT michelse at cshl.edu
<mailto:michelse at cshl.edu>
More information about the Bioperl-l
mailing list