[Bioperl-l] Handling discontiguous feature locations in Bio::DB::SeqFeature::Store -- proposed patch to Bio::Graphics::FeatureBase

Cook, Malcolm MEC at stowers-institute.org
Thu May 3 20:19:00 UTC 2007


Lincoln,
 
Ah, yes, round-tripping GFF, the holy grail....
 
Unfortunately, I don't really have a baseline to go against for an
example that roundtrips successfully now.  Do you?
 
For example, after loading test data: 
 
> bp_seqfeature_load.PLS  bioperl-live/t/data/biodbgff/test.gff3
 
the Contig1 portion of which looks like this:
 
##gff-version 3
## sequence-region Contig1 1 37450
Contig1 confirmed transcript 1001 2000 42 + .
ID=Transcript:trans-1;Gene=abc-1;Gene=xyz-2;Note=function+unknown
Contig1 confirmed exon 1001 1100 . + . ID=Transcript:trans-1
Contig1 confirmed exon 1201 1300 . + . ID=Transcript:trans-1
Contig1 confirmed exon 1401 1450 . + . ID=Transcript:trans-1
Contig1 confirmed CDS 1051 1100 . + 0 ID=Transcript:trans-1
Contig1 confirmed CDS 1201 1300 . + 2 ID=Transcript:trans-1
Contig1 confirmed CDS 1401 1440 . + 0 ID=Transcript:trans-1
Contig1 est similarity 1001 1100 96 . . Target=EST:CEESC13F 1 100 +
Contig1 est similarity 1201 1300 99 . . Target=EST:CEESC13F 101 200 +
Contig1 est similarity 1401 1450 99 . . Target=EST:CEESC13F 201 250 +
Contig1 tc1 transposon 5001 6000 . + . ID=Transposon:c128.1
Contig1 tc1 transposon 8001 9000 . - . ID=Transposon:c128.2
Contig1 confirmed transcript 30001 31000 . - .
ID=Transcript:trans-2;Gene=xyz-2;Note=Terribly+interesting
Contig1 confirmed exon 30001 30100 . - .
ID=Transcript:trans-2;Gene=abc-1;Note=function+unknown
Contig1 confirmed exon 30701 30800 . - . ID=Transcript:trans-2
Contig1 confirmed exon 30801 31000 . - . ID=Transcript:trans-2
 
 
and then generating output with
 
>bp_seqfeature_gff3.PLS --gff=1 -- seq_id Contig1  #  using a script I
just committed - I hope you like it.  Note: gff=1 => recurse 
 
we get output gff with problems such as:
 
    1 IDs get turned into Aliases
    2 the seqid of a Target attributes gets copied into the features
Name attribute
    3 supression of parents of homogeneous subfeatures doesn't work when
the parent has other subfeatures that those with its same type (i.e. the
transcript feature also has exon subfeatures)
 
look:
 
Contig1 est similarity 1001 1100 96 . .
Name=EST:CEESC13F;ID=3;Target=EST:CEESC13F 1 100 +
Contig1 est similarity 1201 1300 99 . .
Name=EST:CEESC13F;ID=4;Target=EST:CEESC13F 101 200 +
Contig1 est similarity 1401 1450 99 . .
Name=EST:CEESC13F;ID=5;Target=EST:CEESC13F 201 250 +
Contig1 confirmed transcript 1001 2000 42 + .
ID=2;Alias=Transcript:trans-1;Gene=abc-1,xyz-2;Note=function+unknown
Contig1 confirmed transcript 1001 2000 42 + .
Parent=2;Alias=Transcript:trans-1;Note=function+unknown;Gene=abc-1,xyz-2
Contig1 confirmed exon 1001 1100 . + . Parent=2;Alias=Transcript:trans-1
Contig1 confirmed exon 1201 1300 . + . Parent=2;Alias=Transcript:trans-1
Contig1 confirmed exon 1401 1450 . + . Parent=2;Alias=Transcript:trans-1
Contig1 confirmed CDS 1051 1100 . + 0 Parent=2;Alias=Transcript:trans-1
Contig1 confirmed CDS 1201 1300 . + 2 Parent=2;Alias=Transcript:trans-1
Contig1 confirmed CDS 1401 1440 . + 0 Parent=2;Alias=Transcript:trans-1
Contig1 tc1 transposon 5001 6000 . + . ID=6;Alias=Transposon:c128.1
Contig1 tc1 transposon 8001 9000 . - . ID=7;Alias=Transposon:c128.2
Contig1 confirmed transcript 30001 31000 . - .
ID=9;Alias=Transcript:trans-2;Gene=xyz-2;Note=Terribly+interesting
Contig1 confirmed transcript 30001 31000 . - .
Parent=9;Alias=Transcript:trans-2;Note=Terribly+interesting;Gene=xyz-2
Contig1 confirmed exon 30001 30100 . - .
Parent=9;Alias=Transcript:trans-2;Gene=abc-1;Note=function+unknown
Contig1 confirmed exon 30701 30800 . - .
Parent=9;Alias=Transcript:trans-2
Contig1 confirmed exon 30801 31000 . - .
Parent=9;Alias=Transcript:trans-2
Contig1 . region 1 37450 . . . Name=Contig1;ID=1
 
with my new version of gff3_string (not yet commited), only the 3rd
problem is addressed, generating
 
bp_seqfeature_gff3.PLS --gff 1  -- seq_id Contig1
Contig1 est similarity 1001 1100 96 . .
Name=EST:CEESC13F;ID=3;Target=EST:CEESC13F 1 100 +
Contig1 est similarity 1201 1300 99 . .
Name=EST:CEESC13F;ID=4;Target=EST:CEESC13F 101 200 +
Contig1 est similarity 1401 1450 99 . .
Name=EST:CEESC13F;ID=5;Target=EST:CEESC13F 201 250 +
Contig1 confirmed transcript 1001 2000 42 + .
ID=2;Alias=Transcript:trans-1;Gene=abc-1,xyz-2;Note=function+unknown
Contig1 confirmed exon 1001 1100 . + . Parent=2;Alias=Transcript:trans-1
Contig1 confirmed exon 1201 1300 . + . Parent=2;Alias=Transcript:trans-1
Contig1 confirmed exon 1401 1450 . + . Parent=2;Alias=Transcript:trans-1
Contig1 confirmed CDS 1051 1100 . + 0 Parent=2;Alias=Transcript:trans-1
Contig1 confirmed CDS 1201 1300 . + 2 Parent=2;Alias=Transcript:trans-1
Contig1 confirmed CDS 1401 1440 . + 0 Parent=2;Alias=Transcript:trans-1
Contig1 tc1 transposon 5001 6000 . + . ID=6;Alias=Transposon:c128.1
Contig1 tc1 transposon 8001 9000 . - . ID=7;Alias=Transposon:c128.2
Contig1 confirmed transcript 30001 31000 . - .
ID=9;Alias=Transcript:trans-2;Gene=xyz-2;Note=Terribly+interesting
Contig1 confirmed exon 30001 30100 . - .
Parent=9;Alias=Transcript:trans-2;Gene=abc-1;Note=function+unknown
Contig1 confirmed exon 30701 30800 . - .
Parent=9;Alias=Transcript:trans-2
Contig1 confirmed exon 30801 31000 . - .
Parent=9;Alias=Transcript:trans-2
Contig1 . region 1 37450 . . . Name=Contig1;ID=1
 
 
I had to make another change to get this output though, since I had to
change the behaviour to
 
  # provide special handling to "remove an extraneous level
  # of parentage" (unless $preserveHomegenousParent) for features
  # which have at least one subfeature with the same type as the
  # feature itself (thus redefining Lincoln's "homogenous
  # parent/child" case, which previously required all children to have
  # the same type as parent)
 
 
I think you will agree this is the more desirable behaviour.
 
I would be happy to test any other GFF you suggest might be (more or
less) roundtripped.
 
What think you?
 
--Malcolm
 


________________________________

	From: lincoln.stein at gmail.com [mailto:lincoln.stein at gmail.com]
On Behalf Of Lincoln Stein
	Sent: Thursday, May 03, 2007 9:46 AM
	To: Cook, Malcolm
	Subject: Re: Handling discontiguous feature locations in
Bio::DB::SeqFeature::Store -- proposed patch to
Bio::Graphics::FeatureBase
	
	
	Hi Malcolm,
	
	For me, the major use case is that GFF3 files round-trip
correctly through the database. Do any of your use cases cover that?
	
	Lincoln
	
	
	On 5/2/07, Cook, Malcolm <MEC at stowers-institute.org> wrote: 

		Lincoln, 
		 
		Here for your comment and review is a very reworked
version of Bio::Graphics::FeatureBase->gff3_string.
		 
		The main difference is to that homogenous children get
ALL their attributes except for start/stop from the parent, including
the group.  I also provide option as to whether or now to "remove
extraneous level of parentage" called $preserveHomegenousParent.
		 
		There is an in-line comment and question for you in the
code body.
		 
		It works well in my hands to my use cases, but, I'm not
positive it is in the spirit of your intentions.
		 
		Cheers,
		 
		Malcolm
		 
		 
		sub gff3_string {
		  my ($self, $recurse, $preserveHomegenousParent,
		 
		      # Note: the following parameters, whose name
begins with '$_',
		      # are intended for recursive call only.
		 
		      $_parent,
		      $_self_is_hsf,  # is $self the child in a
homogeneous parent/child relationship?
		      $_hsf_parentgroup, # if so, what is the group (GFF
column 9) of the parent
		     ) = @_;
		 
		  # PURPOSE: Return GFF3 format for the feature $self.
Optionally
		  # $recurse to include GFF for any subfeatures of the
feature. If
		  # recursing, provide special handling to "remove an
extraneous level
		  # of parentage" (unless $preserveHomegenousParent) for
features
		  # which have subfeatures all of whose types are the
same as the
		  # feature itself (the "homogenous parent/child" case).
This usage is
		  # a convention for representing discontiguous
features; they may be
		  # created by using the -segment directive without
specifying a
		  # distinct -subtype in to `new` when creating a
		  # Bio::Graphics::FeatureBase (i.e.
Bio::DB::SeqFeature,
		  # Bio::Graphics::Feature).  Such homogenous
subfeatures created in
		  # this fashion DO NOT have the parent (GFF column 9)
attributes
		  # propogated to them; so, since they are all part of
the same
		  # parent, the ONLY difference relevant to GFF
production SHOULD be
		  # the $start and $end coordinates for their segment,
and ALL THIER
		  # OTHER ATTRIBUTES should be copied down from the
parent (including:
		  # strand, score, Name, ID, Parent, etc).
		 
		

		  my $hparentORself = $_self_is_hsf ? $_parent : $self;
# $self's parent, if it is a homogenous child, otherwise $self.
		 
		  if ($recurse &&  (my @ssf = $self->sub_SeqFeature)) {
		    my $homogenous = ! grep {$_->type ne $self->type}
@ssf; # will be TRUE only if  all subfeatures are the same type as
$self.
		    my $mygroup =
		      # compute $self's group if it is needed to be
passed down to
		      # subfeatures, unless it is already being passed
down (in which
		      # case there are (at least) 3 levels of homogenous
parent child
		      # (will this ever happen in practice???))
		      ! $homogenous ? '' : $_self_is_hsf ?
$_hsf_parentgroup : $self->format_attributes($_parent); 
		    return (join("\n", (($preserveHomegenousParent ?
($self->gff3_string(0)) : ()) , map
{$_->gff3_string($recurse,$preserveHomegenousParent,$hparentORself,$homo
genous,$mygroup)} @ssf)));
		  } else {
		    my $name  = $hparentORself->name;
		    my $class = $hparentORself->class;
		    my $group = $_self_is_hsf ? $_hsf_parentgroup :
$self->format_attributes($_parent);
		    my $strand = ('-','.','+')[$self->strand+1]; 
		    # TODO: understand conditions under which this could
be other than
		    # hparentORself->strand.  In particular, why does
add_segment flip
		    # the strand when start > stop?  I thought this was
not allowed!
		    # Lincoln - any ideas?
		    my $p      = join("\t",
	
$hparentORself->ref||'.',$hparentORself->source||'.',$hparentORself->met
hod||'.',
		        $self->start||'.',$self->stop||'.',
		        defined($hparentORself->score) ?
$hparentORself->score : '.',
		        $strand||'.',
		        defined($hparentORself->phase) ?
$hparentORself->phase : '.',
		        $group||'');
		  }
		}
		 
		
		

________________________________

			From: Cook, Malcolm 
			Sent: Friday, April 27, 2007 1:45 PM
			To: 'lincoln.stein at gmail.com'
			Cc: 'lstein at cshl.org'; 'bioperl list'
			Subject: RE: Handling discontiguous feature
locations in Bio::DB::SeqFeature::Store -- proposed patch to
Bio::Graphics::FeatureBase
			
			
			
			Hi Lincoln,
			 
			Cool.
			 
			The principal of what I figured out I still
think holds but the implementation is slightly broke.  Improved patch
forthoming next week.
			 

			Malcolm Cook
			Database Applications Manager - Bioinformatics
			Stowers Institute for Medical Research - Kansas
City, Missouri
			  

			 


________________________________

				From: lincoln.stein at gmail.com
[mailto:lincoln.stein at gmail.com] On Behalf Of Lincoln Stein
				Sent: Friday, April 27, 2007 12:45 PM
				To: Cook, Malcolm
				Cc: lstein at cshl.org; bioperl list
				Subject: Re: Handling discontiguous
feature locations in Bio::DB::SeqFeature::Store -- proposed patch to
Bio::Graphics::FeatureBase
				
				
				
				Hi Malcom,
				
				This is absolutely ok and you can go
ahead and commit. Thanks for figuring this out!
				
				Lincoln
				
				
				On 4/26/07, Cook, Malcolm <
MEC at stowers-institute.org <mailto:MEC at stowers-institute.org> > wrote: 

				Lincoln, et al,
				
				I find that the gff3_string for
Bio::DB::SeqFeature objects retreived 
				from a Bio::DB::SeqFeature::Store that
were initially created with
				-seqments (i.e. whose location was
discontiguous) does not display any
				other attributes in column 9 than
"Name".
				
				What do you think of the following patch
to Bio::Graphics::FeatureBase, 
				whose effect is to "contrive to return
(duplicated) common group values"
				(which otherwise get lost when
"collapsing" "homogenous" parent/child
				features)
				
				Another approach would be to copy the
attributes from the parent to the 
				children when the -seqments are first
created.
				
				Another approach would be to use
Bio::SeqFeature::Generic  as the db's
				-seqfeature_class and save with
-location being a Bio::Location::Split,
				but this was wrougth with other
problems. 
				
				Any other suggestions?  Do you want me
to commit this patch?
				
				Cheers,
				
				Malcolm
				
				Patch follows:
				
				
				
				
				Index: FeatureBase.pm
	
=================================================================== 
				RCS file:
	
/home/repository/bioperl/bioperl-live/Bio/Graphics/FeatureBase.pm,v
				retrieving revision 1.29
				diff -c -r1.29 FeatureBase.pm
				*** FeatureBase.pm      16 Apr 2007
19:55:33 -0000      1.29
				--- FeatureBase.pm       26 Apr 2007
16:30:23 -0000
				***************
				*** 581,587 ****
				      foreach (@children) {
				        s/Parent=/ID=/g;
				      } # replace Parent tag with ID
				!     return join "\n", at children;
				    }
				
				    return join("\n",$p, at children);
				--- 581,589 ----
				      foreach (@children) {
				        s/Parent=/ID=/g;
				      } # replace Parent tag with ID
				!     #return join "\n", at children; 
				!     # Instead of above, additionally,
contrive to return (duplicated)
				common group values
				!     return(join("$group\n", at children)
. $group);
				    }
				
				    return join("\n",$p, at children); 
				




				-- 
				Lincoln D. Stein
				Cold Spring Harbor Laboratory
				1 Bungtown Road
				Cold Spring Harbor, NY 11724
				(516) 367-8380 (voice)
				(516) 367-8389 (fax)
				FOR URGENT MESSAGES & SCHEDULING, 
				PLEASE CONTACT MY ASSISTANT, 
				SANDRA MICHELSEN, AT michelse at cshl.edu 




	-- 
	Lincoln D. Stein
	Cold Spring Harbor Laboratory
	1 Bungtown Road
	Cold Spring Harbor, NY 11724
	(516) 367-8380 (voice)
	(516) 367-8389 (fax)
	FOR URGENT MESSAGES & SCHEDULING, 
	PLEASE CONTACT MY ASSISTANT, 
	SANDRA MICHELSEN, AT michelse at cshl.edu 





More information about the Bioperl-l mailing list