[Bioperl-l] (no subject)
Jason Stajich
jason at cgt.duhs.duke.edu
Thu May 27 22:33:17 EDT 2004
So you are testing for overlapping features.
Depending on how many features you are testing, there are easy ways
to code but can run sort of slow or a slightly more complicated ways to
code this which will run fast.
Bio::SeqFeature::Collection was designed as a way to store a collection of
seqfeatures and to quickly extract subset of features based on a location
filter (i.e. get me all feature between x and y). This is either done by
using Berkeley DB B-Tree (the DB_File module).
You can also do this efficiently if you convert your data GFF and load it
in Bio::DB::GFF which allows you to the same thing as in SF::Collection
but using a relational database like mysql as the storage instead of
DB_File.
I guess you have to deal with the case of multiply overlapping features
and you will want to choose the one which is the longest I assume?
In my naive-about-to-go-to-bed solution:
# I would make all your entries Bio::Range objects and make them a list
use Bio::Range;
use strict;
open( my $in => "filename") || die $!;
my @ranges;
while( <$in> ) {
if( />(\d+)\:\s*\[(\d+):(\d+)\]/) {
my ($num,$start,$end) = ($1,$2,$3);
$ranges[$num] = Bio::Range->new(-start => $start, -end => $end);
} else { warn($_) }
}
my @skip;
# start at 1 since the first entry is 1 in your set below.
# find cases where $l and $r overlap and take the larger one
for(my $l=1;$l<@ranges;$l++) {
for(my $r=$l+1;$r < @ranges;$r++ ) {
if( $ranges[$l]->overlaps($ranges[$r]) ) {
# choose the larger one
if( $ranges[$l]->length > $ranges[$r]->length ) {
$skip[$r]++;
} else { $skip[$l]++ }
}
}
}
# now go back through and only print the ones we didn't say to skip
for( my $i=1; $i < @ranges; $i++ ) {
next if $skip[$i];
print ">$i [",$ranges[$i]->start,":",$ranges[$i]->end,"]\n";
}
This doesn't quite handle cases like this:
-----A---- --------C------
----B--------
In that if A < B < C then C is the only ORF which would get kept. You
would have to do a slightly more complicated thing which I don't have time
to describe here, but which is basically single linkage and build up
groups of features which all overlap and then walk a path through them.
In fact in that case I don't really know which ones you want to keep
anyways...
You might be more clever and do something to give a name to each feature -
this might entail using Bio::SeqFeature::Generic obejcts instead of
Bio::Range but all the same code above would work.
Just my musings, not guaranteed to work off the bat, but I bet other
people have ideas about how to code this up too.
-jason
On Thu, 27 May 2004, Nandita Mullapudi wrote:
> Hi all,
> I have a set of automated gene prediction co-ordinates, - the file consists of the
> id of each gene followed by its co-ordinates, line by line in a contiguous
> manner.
> I need to go through these, identify the first instance of each prediction which is
> a gene "within" the previous gene and throw those out. Are there any bioperl
> scripts that do that?
> e.g:
> >1:[300:700]
> >2:[1200:900]
> >3:[1300:1800]
> >4: [1600:1900]
> >5: [1700:2000]
>
> #4 is a one such example I'd like to find and eliminate.
>
> TIA,
> -Nandita
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu
More information about the Bioperl-l
mailing list