[Bioperl-l] (no subject)

Jason Stajich jason at cgt.duhs.duke.edu
Thu May 27 22:33:17 EDT 2004


So you are testing for overlapping features.

Depending on how many features you are testing, there are easy ways
to code but can run sort of slow or a slightly more complicated ways to
code this which will run fast.

Bio::SeqFeature::Collection was designed as a way to store a collection of
seqfeatures and to quickly extract subset of features based on a location
filter (i.e. get me all feature between x and y).  This is either done by
using Berkeley DB B-Tree (the DB_File module).

You can also do this efficiently if you convert your data GFF and load it
in Bio::DB::GFF which allows you to the same thing as in SF::Collection
but using a relational database like mysql as the storage instead of
DB_File.

I guess you have to deal with the case of multiply overlapping features
and you will want to choose the one which is the longest I assume?

In my naive-about-to-go-to-bed solution:
# I would make all your entries Bio::Range objects and make them a list
use Bio::Range;
use strict;
open( my $in => "filename") || die $!;
my @ranges;
while( <$in> ) {
    if( />(\d+)\:\s*\[(\d+):(\d+)\]/) {
	my ($num,$start,$end) = ($1,$2,$3);
	$ranges[$num] = Bio::Range->new(-start => $start, -end => $end);
    } else { warn($_) }
}

my @skip;
# start at 1 since the first entry is 1 in your set below.
# find cases where $l and $r overlap and take the larger one
for(my $l=1;$l<@ranges;$l++) {
  for(my $r=$l+1;$r < @ranges;$r++ ) {
   if( $ranges[$l]->overlaps($ranges[$r]) ) {
    # choose the larger one
    if( $ranges[$l]->length > $ranges[$r]->length ) {
      $skip[$r]++;
    } else { $skip[$l]++ }
   }
  }
}
# now go back through and only print the ones we didn't say to skip
for( my $i=1; $i < @ranges; $i++ ) {
    next if $skip[$i];
    print ">$i [",$ranges[$i]->start,":",$ranges[$i]->end,"]\n";
}

This doesn't quite handle cases like this:

-----A----       --------C------
       ----B--------

In that if A < B < C then C is the only ORF which would get kept.  You
would have to do a slightly more complicated thing which I don't have time
to describe here, but which is basically single linkage and build up
groups of features which all overlap and then walk a path through them.
In fact in that case I don't really know which ones you want to keep
anyways...


You might be more clever and do something to give a name to each feature -
this might entail using Bio::SeqFeature::Generic obejcts instead of
Bio::Range but all the same code above would work.

Just my musings, not guaranteed to work off the bat, but I bet other
people have ideas about how to code this up too.

-jason


On Thu, 27 May 2004, Nandita Mullapudi wrote:

> Hi all,
> I have a set of automated gene prediction co-ordinates, - the file consists of the
> id of each gene followed by its co-ordinates, line by line in a contiguous
> manner.
> I need to go through these, identify the first instance of each prediction which is
> a gene "within" the previous gene  and throw those out. Are there any bioperl
> scripts that do that?
> e.g:
> >1:[300:700]
> >2:[1200:900]
> >3:[1300:1800]
> >4: [1600:1900]
> >5: [1700:2000]
>
> #4 is a one such example I'd like to find and eliminate.
>
> TIA,
> -Nandita
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu



More information about the Bioperl-l mailing list