[Bioperl-l] simplealign, prediction parselist
Richard Adams
Richard.Adams@ed.ac.uk
Wed, 18 Sep 2002 18:22:58 +0100
Hi Heikki,
Here is code for functional Simple::Align purge method. It has the same
algorithm
as the original and is largely based on that code.. but it works ( at least for
me).
In the code that follows it is implemented as a normal subroutine that returns a
reference
to a purgedalignment in the main test script. This can obviously be altered when
it is put in
Bio::SimpleAlign.
I make no claims for the beauty of the program but hope it meets your
requirements.
Best wishes
Richard
#!/bin/perl -w
use strict;
use Bio::AlignIO;
use Purge2 qw(purge2);
my $Alnobj = Bio::AlignIO->new (-file => 'myalignmentfilewithduplicates');
my $aln = $Alnobj->next_aln;
print "there are", $aln->no_sequences, " sequences in original alignment\n";
my $purged_aln = purge2 ($aln, 0.7);
print "after purging, purgedalign_ obj has ", $purged_aln->no_sequences, "
sequences\n";
################ end of test script#######################
#!/usr/bin/perl -w
##############this package has a sub which identifies and removes
###### sequences from an alignment above a given threshold of similarity
###### uses remove_seq method of Bio::SimpleAlign.
###### purge2 should be able to directly replace Bio::Simplealign purge.
package Purge2;
use Exporter;
our @ISA = qw(Exporter);
our @EXPORT_OK = qw(purge2);
sub purge2 {
my ($self,$perc) = @_; #takes alignment ref and threshold for
similarity as parameters, like the original.
my (@seqs,$seq,$i,$j,$count,@one,@two,$seq2,$k,$res,$ratio, %duplicate,
@dups);
@seqs = $self->each_seq();
for ($i=0;$i< @seqs - 1;$i++ ) { #for each seq in alignment
$seq = $seqs[$i];
#skip if already in duplicate hash
next if exists $duplicate{$seq->display_id} ;
my $one = $seq->seq();
@one = split '', $one;#split to get 1aa per array element
for($j=$i+1;$j < @seqs;$j++) {
$seq2 = $seqs[$j];
#skip if already in duplicate hash
next if exists $duplicate{$seq2->display_id} ;
my $two = $seq2->seq();
@two = split '', $two;
#print "\@two has ", scalar @two, "elements\n";
$count = 0;
$res = 0;
for($k=0;$k<@one;$k++) {
if( $one[$k] ne '.' && $one[$k] ne '-' && defined($two[$k]) &&
$one[$k] eq $two[$k]) {
$count++;
}
if( $one[$k] ne '.' && $one[$k] ne '-' && defined($two[$k]) &&
$two[$k] ne '.' && $two[$k] ne '-' ) {
$res++;
}
}
if( $res == 0 ) {
$ratio = 0;
}
else {
$ratio = $count/$res;
}
# if above threshold put in duplicate hash and push onto
# duplicate array for returning to get_unique
if( $ratio > $perc ) {
print "duplicate!", $seq2->display_id, "\n";
$duplicate{$seq2->display_id} = 1;
push @dups, $seq2;
}
}
}
foreach my $seq(@dups) {
$self->remove_seq($seq);
}
return $self;
}
Heikki Lehvaslaiho wrote:
> Rickhard,
>
> The recommended way to contribute is to first post descriptions of
> modules, names and methods, for other people to comment on. Then in
> addition to writing the actual code, you are expected to write test
> suite (into t directory) with necessary data files (into t/data) which
> test at least all public methods in the modules. The actual commit can
> be handled by any of the core developer or anyone with a CVS account. If
> your contributions to bioperl are more than one off, we'd eventually
> like to give you a CVS account, so that you can maintain and add to the
> code base yourself.
>
> To start with, it is best that you post everything to the list.
>
> -Heikki
>
> On Tue, 2002-09-17 at 20:57, Richard Adams wrote:
> > Heikki,
> > I'll tidy up the Simple::Align->purge code and send it to you asap.
> > Re parsers for netphos, psort etc. Basically I'm working with about 30-40
> > genes on chromosome 4
> > in a region associated with bipolar disorder. I'm trying to build up a
> > database of possibly important
> > functional residues/regions of the proteins for prioritising the
> > experimental analysis of SNPs within
> > the coding region of the genes. Also I'm hoping to bolster the prediction
> > accuracy of these
> > programs by looking for conservation of prediction between sequence
> > homologues in a Clustal alignment.
> >
> > So that's why I'm developing the parsers. At present they are not in the
> > BioPerl 'format' at all and are standalone
> > modules/subroutines. I've not contributed to BioPerl before - is there a
> > standard way that you want code contributed in
> > or is some assistance likely to be available from someone more experienced
> > with BioPerl guts than myself?
> >
> > Richard
> >
> > Heikki Lehvaslaiho wrote:
> >
> > > Richard,
> > >
> > > Bio::SimpleAlign::purge() has not been functional as far as I know. We
> > > do not even have tests for it. Please send me your fixes and add them in
> > > and add you as an contributor to the module.
> > > It would be even better of you have written tests into t/SimpleAlign.t,
> > > too!
> > >
> > > It has been a while since you posted your message and there has been no
> > > answers to it. I think it can be safely said that no one else is working
> > > on Psort, TargetP or netphos parsers. It would be great to have them.
> > > Let us know what your plans are.
> > >
> > > -Heikki
> > >
> > > On Tue, 2002-08-27 at 11:51, Richard Adams wrote:
> > > > Hello,
> > > > Please ignore this if it's already been fixed but there seems to be 2
> > > > problems in 1.0.2
> > > > with Bio::SimpleAlign purge function.
> > > >
> > > > 1. the get_nse call produces the error
> > > > Use of uninitialized value in numeric eq (==) at
> > > > /packages/perl/lib/site_perl/5.6.0/Bio/SimpleAlign.pm line 373, <GEN0>
> > > > line 242.
> > > > 2.
> > > > the lines
> > > > @one = $seq->seq();
> > > > @two = $seq2->seq();
> > > > I guess are supposed to be split so that each element is an amino
> > > > acid or - or .
> > > > But the way the code works there will only be one element - the
> > > > whole sequence string.
> > > >
> > > > I've fixed these problems for my own use so if my code would be any use
> > > > I'd gladly submit it.
> > > >
> > > >
> > > >
> > > > I'm writing some parsers for various prediction servers, such as
> > > > Expasy's Psort, TargetP, netphos and ultimately plan to develop parsers
> > > > for some secondary
> > > > structure prediction servers. Are people already working on this sort
> > > > of thing or would these be useful additions?
> > > >
> > > > Richard
> > > >
> > > > Dr Richard Adams
> > > > Molecular Medicine Centre
> > > > University of Edinburgh UK
> > > >
> > > >
> > > > _______________________________________________
> > > > Bioperl-l mailing list
> > > > Bioperl-l@bioperl.org
> > > > http://bioperl.org/mailman/listinfo/bioperl-l
> > > --
> > > ______ _/ _/_____________________________________________________
> > > _/ _/ http://www.ebi.ac.uk/mutations/
> > > _/ _/ _/ Heikki Lehvaslaiho heikki@ebi.ac.uk
> > > _/_/_/_/_/ EMBL Outstation, European Bioinformatics Institute
> > > _/ _/ _/ Wellcome Trust Genome Campus, Hinxton
> > > _/ _/ _/ Cambs. CB10 1SD, United Kingdom
> > > _/ Phone: +44 (0)1223 494 644 FAX: +44 (0)1223 494 468
> > > ___ _/_/_/_/_/________________________________________________________
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
> --
> ______ _/ _/_____________________________________________________
> _/ _/ http://www.ebi.ac.uk/mutations/
> _/ _/ _/ Heikki Lehvaslaiho heikki@ebi.ac.uk
> _/_/_/_/_/ EMBL Outstation, European Bioinformatics Institute
> _/ _/ _/ Wellcome Trust Genome Campus, Hinxton
> _/ _/ _/ Cambs. CB10 1SD, United Kingdom
> _/ Phone: +44 (0)1223 494 644 FAX: +44 (0)1223 494 468
> ___ _/_/_/_/_/________________________________________________________