[Bioperl-l] Bio::DB::SeqFeature treamtent of tags and annotations
Cook, Malcolm
MEC at stowers-institute.org
Wed Jan 31 00:30:35 UTC 2007
Hi Lincoln,
Thanks for the resolution of tag value methods. Your fixes work in my
hands...
I never knew that "ID column is that it is supposed to be LOCAL to the
GFF3 file and is not intended to be stored in the database".
I read in http://www.sequenceontology.org/gff3.shtml that
ID Indicates the name of the feature. IDs must be unique
within the scope of the GFF file.
But that ID is solely for the purpose of linearizing the relationships
among features within the scope of the GFF text file.
Further, in that document, I see lots of examples where the ID is what
is holding together the fabric of the feature relationships; IDs appear
as the value of: Target and Parent attributes.
I'd appreciate it if you can provide a motivating example where the ID=
in a (real life) GFF3 file is in fact ONLY used for the purpose of
linearizing the feature relationships.
ANyway, If it matters, the GFF formated genome I'm wallowing in these
days from Flybase (dmel r5.1) presents their FlyBase IDs in the ID
attribute, like this:
4 FlyBase gene 24068 25621 . + .
ID=FBgn0040037;Name=CG17923;Dbxref=FlyBase:FBan0017923,FlyBase_Annotatio
n_IDs:CG17923...
4 FlyBase mRNA 24068 25621 . + .
ID=FBtr0089155;Name=CG17923-RA;Parent=FBgn0040037;Dbxref=FlyBase_Annotat
ion_IDs:CG17923-RA;
4 FlyBase exon 24068 24477 . + .
ID=CG17923:1;Name=CG17923:1;Parent=FBtr0089155
Switching to using 'Name' might work OK for my application. I'll look
into it. It winds up being the same as the ID in some cases anyway....
Malcolm Cook
________________________________
From: lincoln.stein at gmail.com [mailto:lincoln.stein at gmail.com]
On Behalf Of Lincoln Stein
Sent: Tuesday, January 30, 2007 4:46 PM
To: Cook, Malcolm
Cc: bioperl list; lstein at cshl.org
Subject: Re: Bio::DB::SeqFeature treamtent of tags and
annotations
I've fixed the first issue in CVS. Sorry for the inconsistency.
add_tag_value(), delete_tag_value() and get_Annotations() now all work
as expected.
The problem with the ID column is that it is supposed to be
LOCAL to the GFF3 file and is not intended to be stored in the database.
In contrast, Name can survive roundtripping. Perhaps the thing to do is
to add a flag to the GFF3 file that turns on ID round-tripping, e.g.
##round-trip-ids: 1
If you like this idea, I can implement it.
Lincoln
On 1/29/07, Cook, Malcolm < MEC at stowers-institute.org
<mailto:MEC at stowers-institute.org> > wrote:
Lincoln,
Thanks for your suggestions on approach to my problems
augmenting Flybase annotation. I am trying to follow them and finding
the following oddities
The first issue relates to the intermix of 'annotations'
and 'tag values'. I find that Bio::DB::SeqFeature implements some of
the 'tag' methods and some of the 'Annotation' methods. Here is a perl
one-liner that shows values stored using add_tag_value are not retreived
with get_tag_values, but rather with get_Annotations.
> perl -MBio::DB::SeqFeature -e 'my
$f=Bio::DB::SeqFeature->new; $f->add_tag_value("x",666); print
"get_tag_values:\t" . $f->get_tag_values("x") . "\nget_Annotations:\t" .
$f->get_Annotations("x");'
whose output is:
get_tag_values:
get_Annotations: 666
Tracing this shows me that this results from the fact
that:
Bio::DB::SeqFeature uses of Bio::Graphics::FeatureBase
(via Bio::DB::SeqFeature::NormalizedFeature) which does not support
-tags in ->new but rather -attributes, viz:
-attributes a hashref of tag value attributes, in
which the key is the tag
and the value is an array reference of
values
And though Bio::Graphics::FeatureBase purports to
implement Bio::SeqFeatureI, it only partially implements the 'tag'
methods (now deprecated and relegated to Bio::AnnotatableI). In
particular, the '*' methods Bio::SeqFeatureI are not implemented in
Bio::Graphics::FeatureBase
has_tag
* add_tag_value
get_tag_values
get_all_tags
* remove_tag
get_tagset_values
get_Annotations
As a result, add_tag_value and remove_tag are inherited
from different modules whose understanding of tags is not the same!
This one-liner :
>perl -MClass::ISA -MClass::Inspector
-MBio::DB::SeqFeature -e 'my @c =
Class::ISA::self_and_super_path("Bio::DB::SeqFeature"); foreach my $fn
qw(add_tag_value get_tag_values) {print "\n$fn:\t", join "\t", (grep
{Class::Inspector->function_exists($_, $fn)} @c)}'
confirms that they are defined in different packages,
namely:
add_tag_value: Bio::AnnotatableI
get_tag_values: Bio::Graphics::FeatureBase
Bio::AnnotatableI
Proposed solution... hmmmm ..... I dunno.... maybe the
following patch to Bio::Graphics::FeatureBase->add_tag_value :
sub add_tag_value {
my ($self,$tag, at vals) = @_;
push @{$self->{attributes}{$tag}}, @vals;
}
It fixes my use case for now but I'm still concerned and
confused about this variety of methods.
Suggestions?
------------------------------------------------------------------------
-
Also, I think that any "ID" in column 9 of GFF3 float
file should be preserved through a round-trip through a
Bio::DB::SeqFeature store, but this is not yet possible since any ID
attribute in GFF3 column 9 is being lost by GFF3Loader, causing me to
locally patch GFF3Loader::handle_feature method to add the following:
# mec at stowers-institute.org
<mailto:mec at stowers-institute.org> , wondering why not all attributes
are
# carried forward, adds ID tag in particular service
of
# round-tripping ID, which, though present in database
as load_id
# attribute, was getting lost as itself
$unreserved->{ID}= $reserved->{ID} if exists
$reserved->{ID};
Poised to patch.... what d'you think?
Malcolm Cook
Stowers Institute for Medical Research - Kansas City,
Missouri
________________________________
From: lincoln.stein at gmail.com [mailto:
lincoln.stein at gmail.com <mailto:lincoln.stein at gmail.com> ] On Behalf Of
Lincoln Stein
Sent: Tuesday, December 19, 2006 3:58 PM
To: Cook, Malcolm
Cc: bioperl list; lstein at cshl.org
Subject: Re: bp_seqfeature_load /
Bio::DB::SeqFeature::Store::GFF3Loader problems augmenting Flybase
annotation
Hi Malcom,
Your second guess was right. The use case of
augmenting an existing gene with additional splice forms isn't provided
for. You can get the functionality by making direct calls to
Bio::DB::SeqFeature::Store methods:
my @genes =
$db->get_features_by_name('FBgn0017545');
@genes == 1 or die "Didn't get exactly one
gene";
my $parent = $genes[0];
my $parent = $genes[0];
my $chr = $parent->seq_id;
my $start = $parent->start;
my $end = $parent->end;
my $strand = $parent->strand;
my $new_splice_form =
$db->new_feature(-primary_tag => 'mRNA',
-source => 'added',
-seq_id => '4r',
-strand => $strand,
-start => $start+10,
-end => $end,
);
$parent->add_SeqFeature($new_splice_form);
for my $pos
([$start+10,$start+100],[$start+200,$end]) {
my ($e_start,$e_end) = @$pos;
my $exon =
Bio::DB::SeqFeature->new(-primary_tag => 'exon',
-store
=> $db,
-seq_id => '4r',
-strand => $strand,
-start => $e_start,
-end => $e_end);
$new_splice_form->add_SeqFeature($exon);
}
I found a bug in updating the seqfeature
database when I wrote this script, so you'll have to get the latest
biperl live. I think you can use this to write a splice form updating
script.
In order to support the idea of adding new
splice forms to an existing gene using the GFF3 format, I will have to
either modify the loader, or write a separate script (probably better to
do the latter). It shouldn't be hard if you'd like to give it a try.
Lincoln
On 12/19/06, Cook, Malcolm
<MEC at stowers-institute.org <mailto:MEC at stowers-institute.org> > wrote:
Lincoln and fellow Bio::DB::SeqFeature
travelers,
I find that using bp_seqfeature_load.PLS
to load subfeatures of genes
already loaded using
bp_seqfeature_load.PLS fails with
------------- EXCEPTION -------------
MSG: FBgn0017545 doesn't have a primary
id
STACK
Bio::DB::SeqFeature::Store::GFF3Loader::build_object_tree_in_tables
/home/mec/cvs/bioperl-live/Bio/DB/SeqFeature/Store/GFF3Loader.pm:682
STACK
Bio::DB::SeqFeature::Store::GFF3Loader::build_object_tree
/home/mec/cvs/bioperl-live/Bio/DB/SeqFeature/Store/GFF3Loader.pm:663
STACK
Bio::DB::SeqFeature::Store::GFF3Loader::finish_load
/home/mec/cvs/bioperl-live/Bio/DB/SeqFeature/Store/GFF3Loader.pm:372
STACK
Bio::DB::SeqFeature::Store::GFF3Loader::load_fh
/home/mec/cvs/bioperl-live/Bio/DB/SeqFeature/Store/GFF3Loader.pm:345
STACK
Bio::DB::SeqFeature::Store::GFF3Loader::load
/home/mec/cvs/bioperl-live/Bio/DB/SeqFeature/Store/GFF3Loader.pm:242
STACK toplevel
/home/mec/cvs/bioperl-live/scripts/Bio-SeqFeature-Store/bp_seqfeature_lo
ad.PLS:76
Where FBgn0017545 is the ID of a gene
previously loaded.
I am unsure how to remedy my situation
and welcome any advise on correct
or improved approach to my problem.
Here's some detail if it helps. I am
developing a pipeline to design a
microarray probes capable of
distinguishing among splice variants in
drosophila (using latest Flybase
dmel_r5.1 annotation). So I
1) load a filtered selection of Flybase
annotation using
bp_seqfeature_load. (for testing
purposes, I am using a single gene's
worth of annotation, FBgn0017545.gff,
attached). This is done as
follows:
> bp_seqfeature_load.PLS
--create FBgn0017545.gff
2) analyze all the genes in the
database, and create GFF3 output each
feature of which has a 'Parent' that is
a previously loaded gene (i.e.
FBgn0017545). (These features represent
the unique introns, splice
sites, and exonic design targets. Output
of this analysis,
FBgn0017545_matd.gff, is also attached)
3) load these analysis results into the
same database, as follows:
> bp_seqfeature_load.PLS
FBgn0017545_matd.gff
It is at this point that I get the above
error.
However, I don't get any error and the
data loads fine if I load the two
files together, as follows:
> bp_seqfeature_load.PLS
--create <(cat FBgn0017545.gff
FBgn0017545_matd.gff)
So, I suspect that either I am
misunderstanding when/how to use
bp_seqfeature_load.PLS or else this use
case has not yet arisen and must
be provided for somehow.
I am running against bioperl-live
Thanks for your thoughts and assistance,
Malcolm Cook
Database Applications Manager -
Bioinformatics
Stowers Institute for Medical Research -
Kansas City, Missouri
--
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
FOR URGENT MESSAGES & SCHEDULING,
PLEASE CONTACT MY ASSISTANT,
SANDRA MICHELSEN, AT michelse at cshl.edu
<mailto:michelse at cshl.edu>
--
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
FOR URGENT MESSAGES & SCHEDULING,
PLEASE CONTACT MY ASSISTANT,
SANDRA MICHELSEN, AT michelse at cshl.edu
More information about the Bioperl-l
mailing list