[Bioperl-l] Bio::DB::SeqFeature treamtent of tags and annotations
Lincoln Stein
lstein at cshl.edu
Tue Jan 30 22:46:12 UTC 2007
I've fixed the first issue in CVS. Sorry for the inconsistency.
add_tag_value(), delete_tag_value() and get_Annotations() now all work as
expected.
The problem with the ID column is that it is supposed to be LOCAL to the
GFF3 file and is not intended to be stored in the database. In contrast,
Name can survive roundtripping. Perhaps the thing to do is to add a flag to
the GFF3 file that turns on ID round-tripping, e.g.
##round-trip-ids: 1
If you like this idea, I can implement it.
Lincoln
On 1/29/07, Cook, Malcolm <MEC at stowers-institute.org> wrote:
>
> Lincoln,
>
> Thanks for your suggestions on approach to my problems augmenting Flybase
> annotation. I am trying to follow them and finding the following oddities
>
> The first issue relates to the intermix of 'annotations' and 'tag
> values'. I find that Bio::DB::SeqFeature implements some of the 'tag'
> methods and some of the 'Annotation' methods. Here is a perl one-liner that
> shows values stored using add_tag_value are not retreived with
> get_tag_values, but rather with get_Annotations.
>
> > perl -MBio::DB::SeqFeature -e 'my $f=Bio::DB::SeqFeature->new;
> $f->add_tag_value("x",666); print "get_tag_values:\t" .
> $f->get_tag_values("x") . "\nget_Annotations:\t" .
> $f->get_Annotations("x");'
>
> whose output is:
> get_tag_values:
> get_Annotations: 666
>
> Tracing this shows me that this results from the fact that:
>
> Bio::DB::SeqFeature uses of Bio::Graphics::FeatureBase (via
> Bio::DB::SeqFeature::NormalizedFeature) which does not support -tags in
> ->new but rather -attributes, viz:
>
> -attributes a hashref of tag value attributes, in which the key is
> the tag
> and the value is an array reference of values
>
> And though Bio::Graphics::FeatureBase purports to implement
> Bio::SeqFeatureI, it only partially implements the 'tag' methods (now
> deprecated and relegated to Bio::AnnotatableI). In particular, the '*'
> methods Bio::SeqFeatureI are not implemented in Bio::Graphics::FeatureBase
>
>
> has_tag
> * add_tag_value
> get_tag_values
> get_all_tags
> * remove_tag
> get_tagset_values
> get_Annotations
>
> As a result, add_tag_value and remove_tag are inherited from different
> modules whose understanding of tags is not the same!
>
> This one-liner :
>
> >perl -MClass::ISA -MClass::Inspector -MBio::DB::SeqFeature -e 'my @c =
> Class::ISA::self_and_super_path("Bio::DB::SeqFeature"); foreach my $fn
> qw(add_tag_value get_tag_values) {print "\n$fn:\t", join "\t", (grep
> {Class::Inspector->function_exists($_, $fn)} @c)}'
>
> confirms that they are defined in different packages, namely:
>
> add_tag_value: Bio::AnnotatableI
> get_tag_values: Bio::Graphics::FeatureBase Bio::AnnotatableI
> Proposed solution... hmmmm ..... I dunno.... maybe the following patch
> to Bio::Graphics::FeatureBase->add_tag_value :
>
> sub add_tag_value {
> my ($self,$tag, at vals) = @_;
> push @{$self->{attributes}{$tag}}, @vals;
> }
>
> It fixes my use case for now but I'm still concerned and confused about
> this variety of methods.
>
> Suggestions?
>
>
> -------------------------------------------------------------------------
>
> Also, I think that any "ID" in column 9 of GFF3 float file should be
> preserved through a round-trip through a Bio::DB::SeqFeature store, but this
> is not yet possible since any ID attribute in GFF3 column 9 is being lost
> by GFF3Loader, causing me to locally patch GFF3Loader::handle_feature
> method to add the following:
>
> # mec at stowers-institute.org, wondering why not all attributes are
> # carried forward, adds ID tag in particular service of
> # round-tripping ID, which, though present in database as load_id
> # attribute, was getting lost as itself
> $unreserved->{ID}= $reserved->{ID} if exists $reserved->{ID};
>
> Poised to patch.... what d'you think?
>
> Malcolm Cook
> Stowers Institute for Medical Research - Kansas City, Missouri
>
>
>
> ------------------------------
> From: lincoln.stein at gmail.com [mailto:lincoln.stein at gmail.com] On Behalf
> Of Lincoln Stein
> Sent: Tuesday, December 19, 2006 3:58 PM
> To: Cook, Malcolm
> Cc: bioperl list; lstein at cshl.org
> Subject: Re: bp_seqfeature_load / Bio::DB::SeqFeature::Store::GFF3Loader
> problems augmenting Flybase annotation
>
> Hi Malcom,
>
> Your second guess was right. The use case of augmenting an existing gene
> with additional splice forms isn't provided for. You can get the
> functionality by making direct calls to Bio::DB::SeqFeature::Store methods:
>
> my @genes = $db->get_features_by_name('FBgn0017545');
> @genes == 1 or die "Didn't get exactly one gene";
> my $parent = $genes[0];
>
> my $parent = $genes[0];
> my $chr = $parent->seq_id;
> my $start = $parent->start;
> my $end = $parent->end;
> my $strand = $parent->strand;
>
> my $new_splice_form = $db->new_feature(-primary_tag => 'mRNA',
> -source => 'added',
> -seq_id => '4r',
> -strand => $strand,
> -start => $start+10,
> -end => $end,
> );
> $parent->add_SeqFeature($new_splice_form);
>
> for my $pos ([$start+10,$start+100],[$start+200,$end]) {
> my ($e_start,$e_end) = @$pos;
> my $exon = Bio::DB::SeqFeature->new(-primary_tag => 'exon',
> -store => $db,
> -seq_id => '4r',
> -strand => $strand,
> -start => $e_start,
> -end => $e_end);
> $new_splice_form->add_SeqFeature($exon);
> }
>
> I found a bug in updating the seqfeature database when I wrote this
> script, so you'll have to get the latest biperl live. I think you can use
> this to write a splice form updating script.
>
> In order to support the idea of adding new splice forms to an existing
> gene using the GFF3 format, I will have to either modify the loader, or
> write a separate script (probably better to do the latter). It shouldn't be
> hard if you'd like to give it a try.
>
> Lincoln
>
> On 12/19/06, Cook, Malcolm <MEC at stowers-institute.org> wrote:
> >
> > Lincoln and fellow Bio::DB::SeqFeature travelers,
> >
> > I find that using bp_seqfeature_load.PLS to load subfeatures of genes
> > already loaded using bp_seqfeature_load.PLS fails with
> >
> > ------------- EXCEPTION -------------
> > MSG: FBgn0017545 doesn't have a primary id
> > STACK
> > Bio::DB::SeqFeature::Store::GFF3Loader::build_object_tree_in_tables
> > /home/mec/cvs/bioperl-live/Bio/DB/SeqFeature/Store/GFF3Loader.pm:682
> > STACK Bio::DB::SeqFeature::Store::GFF3Loader::build_object_tree
> > /home/mec/cvs/bioperl-live/Bio/DB/SeqFeature/Store/GFF3Loader.pm:663
> > STACK Bio::DB::SeqFeature::Store::GFF3Loader::finish_load
> > /home/mec/cvs/bioperl-live/Bio/DB/SeqFeature/Store/GFF3Loader.pm:372
> > STACK Bio::DB::SeqFeature::Store::GFF3Loader::load_fh
> > /home/mec/cvs/bioperl-live/Bio/DB/SeqFeature/Store/GFF3Loader.pm:345
> > STACK Bio::DB::SeqFeature::Store::GFF3Loader::load
> > /home/mec/cvs/bioperl-live/Bio/DB/SeqFeature/Store/GFF3Loader.pm:242
> > STACK toplevel
> > /home/mec/cvs/bioperl-live/scripts/Bio-SeqFeature-Store/bp_seqfeature_lo
> >
> > ad.PLS:76
> >
> > Where FBgn0017545 is the ID of a gene previously loaded.
> >
> > I am unsure how to remedy my situation and welcome any advise on correct
> > or improved approach to my problem.
> >
> > Here's some detail if it helps. I am developing a pipeline to design a
> > microarray probes capable of distinguishing among splice variants in
> > drosophila (using latest Flybase dmel_r5.1 annotation). So I
> >
> > 1) load a filtered selection of Flybase annotation using
> > bp_seqfeature_load. (for testing purposes, I am using a single gene's
> > worth of annotation, FBgn0017545.gff, attached). This is done as
> > follows:
> >
> > > bp_seqfeature_load.PLS --create FBgn0017545.gff
> >
> > 2) analyze all the genes in the database, and create GFF3 output each
> > feature of which has a 'Parent' that is a previously loaded gene (i.e.
> > FBgn0017545). (These features represent the unique introns, splice
> > sites, and exonic design targets. Output of this analysis,
> > FBgn0017545_matd.gff, is also attached)
> >
> > 3) load these analysis results into the same database, as follows:
> >
> > > bp_seqfeature_load.PLS FBgn0017545_matd.gff
> >
> > It is at this point that I get the above error.
> >
> > However, I don't get any error and the data loads fine if I load the two
> > files together, as follows:
> >
> > > bp_seqfeature_load.PLS --create <(cat FBgn0017545.gff
> > FBgn0017545_matd.gff)
> >
> > So, I suspect that either I am misunderstanding when/how to use
> > bp_seqfeature_load.PLS or else this use case has not yet arisen and must
> >
> > be provided for somehow.
> >
> > I am running against bioperl-live
> >
> > Thanks for your thoughts and assistance,
> >
> > Malcolm Cook
> > Database Applications Manager - Bioinformatics
> > Stowers Institute for Medical Research - Kansas City, Missouri
> >
> >
>
>
> --
> Lincoln D. Stein
> Cold Spring Harbor Laboratory
> 1 Bungtown Road
> Cold Spring Harbor, NY 11724
> (516) 367-8380 (voice)
> (516) 367-8389 (fax)
> FOR URGENT MESSAGES & SCHEDULING,
> PLEASE CONTACT MY ASSISTANT,
> SANDRA MICHELSEN, AT michelse at cshl.edu
>
>
--
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
FOR URGENT MESSAGES & SCHEDULING,
PLEASE CONTACT MY ASSISTANT,
SANDRA MICHELSEN, AT michelse at cshl.edu
More information about the Bioperl-l
mailing list