[Bioperl-l] longest ORF

Tue, 30 Jul 2002 12:41:57 -0400

its your regular expressions.. The ! does not work as you might expect..
try this instead.
Deep

foreach $direction (keys %strand) {
  my @starts;
  my @stops;
  while ($strand{$direction}=~m/(^(?!taa)|^(?!tga)|^(?!tag)|atg)/gi)
  {
     my $pos = pos($strand{$direction})- 2  ;
     if ($pos <= 0)
     {
       $pos = 1 ;
     }
     push @starts,$pos;
  }

  while ($strand{$direction}=~m/(taa|tga|tag|...$)/gi)
  {
     my $pos = pos($strand{$direction})- 2  ;
     if ($pos <= 0)
     {
        $pos = 1 ;
     }
     push @ends,$pos;
  }

>From: Dan Kortschak <kortschak@rsbs.anu.edu.au>
>Reply-To: Dan Kortschak <kortschak@rsbs.anu.edu.au>
>To: bioperl-l@bioperl.org
>Subject: Re: [Bioperl-l] longest ORF
>Date: Tue, 30 Jul 2002 17:49:50 +1000 (EST)
>MIME-Version: 1.0
>Received: from mc1-f21.law16.hotmail.com ([65.54.236.28]) by 
>mc1-s21.law16.hotmail.com with Microsoft SMTPSVC(5.0.2195.4905); Tue, 30 
>Jul 2002 01:01:06 -0700
>Received: from pw600a.bioperl.org ([199.93.107.70]) by 
>mc1-f21.law16.hotmail.com with Microsoft SMTPSVC(5.0.2195.4905); Tue, 30 
>Jul 2002 00:54:02 -0700
>Received: from pw600a.bioperl.org (localhost [127.0.0.1])by 
>pw600a.bioperl.org (8.12.2/8.12.2) with ESMTP id g6U7qNJD028574;Tue, 30 Jul 
>2002 03:52:23 -0400
>Received: from rsbsimail.anu.edu.au (rsbsimail.anu.edu.au 
>[150.203.38.39])by pw600a.bioperl.org (8.12.2/8.12.2) with ESMTP id 
>g6U7p1JD028541for <bioperl-l@bioperl.org>; Tue, 30 Jul 2002 03:51:02 -0400
>Received: from rsbs37112.anu.edu.au (not verified[150.203.37.112]) by 
>rsbsimail.anu.edu.au with MailMarshal (4,2,5,0) id <B0000628e1>; Tue, 30 
>Jul 2002 17:50:59 +1000
>X-X-Sender: dkortsch@sunya.rsbs.anu.edu.au
>In-Reply-To: 
><Pine.OSF.4.33.0207292002001.18773-100000@alpha10.bioch.virginia.edu>
>Message-ID: 
><Pine.LNX.4.44.0207301742080.17429-100000@sunya.rsbs.anu.edu.au>
>Sender: bioperl-l-admin@bioperl.org
>Errors-To: bioperl-l-admin@bioperl.org
>X-BeenThere: bioperl-l@bioperl.org
>X-Mailman-Version: 2.0.12
>Precedence: bulk
>List-Help: <mailto:bioperl-l-request@bioperl.org?subject=help>
>List-Post: <mailto:bioperl-l@bioperl.org>
>List-Subscribe: 
><http://bioperl.org/mailman/listinfo/bioperl-l>,<mailto:bioperl-l-request@bioperl.org?subject=subscribe>
>List-Id: Bioperl Project Discussion List <bioperl-l.bioperl.org>
>List-Unsubscribe: 
><http://bioperl.org/mailman/listinfo/bioperl-l>,<mailto:bioperl-l-request@bioperl.org?subject=unsubscribe>
>List-Archive: <http://bioperl.org/pipermail/bioperl-l/>
>Return-Path: bioperl-l-admin@bioperl.org
>X-OriginalArrivalTime: 30 Jul 2002 07:54:03.0578 (UTC) 
>FILETIME=[45D591A0:01C2379E]
>
>Hi Arron, I've fiddled some with the barebones perl you posted (there were
>issues with codons being in frame or not as well and some BioPerl syntax
>things). Thanks BTW. As it stands (with exception of the following), it
>works, but only on real ORFs (which sometimes aren't that useful).
>
>What I'm trying to do with the m/(atg|^!taa|^!tga|^!tag)/gi and
>m/(taa|tga|tag|...$)/gi is to let the ORF start at the beginning of the
>dna so long as it is not a stop and allow it to run-off at the end.
>However my tests show that the orfs always start at an atg. Can anyone
>help here?
>
>thanks
>Dan Kortschak
>
>#!/usr/bin/perl
>
>use Getopt::Long;
>use Bio::SeqIO;
>
>my ($sequencefile,$sequenceformat,$help) = (undef, 'fasta',undef);
>
>&GetOptions('input|i=s'              => \$sequencefile,
>             'format|f=s'             => \$sequenceformat,
>             'help|h'                 => \$help,
>             );
>
>if ($help) {
>    exec('perldoc', $0); # when I can be bothered
>    die;
>}
>
>sub longestORF {
>    my $best=0;
>    my ($bests, $beste, $beststrand) = (-1,-1,0);
>
>    my $dna=Bio::Seq->new(-seq => $_[0]);
>    my %strand=('+'=>$dna->seq,
>                '-'=>$dna->revcom->seq);
>
>    foreach $direction (keys %strand) {
>       my @starts;
>       my @stops;
>       while ($strand{$direction}=~m/(atg|^!taa|^!tga|^!tag)/gi) {
>          push @starts,pos($strand{$direction})-2;
>       }
>
>       while ($strand{$direction}=~m/(taa|tga|tag|...$)/gi) {
>          push @ends,pos($strand{$direction})-2;
>       }
>
>       for my $s (@starts) {
>          for my $e (@ends) {
>             if ($e%3==$s%3 and $e>$s) {
>                if ($e-$s>$best) {
>                   $best=$e-$s;
>                   ($bests,$beste,$beststrand) = ($s, $e, $direction);
>                   
>$bestorf=Bio::Seq->new(-seq=>$strand{$direction})->subseq($s,$e);
>                }
>             last
>             } else {
>                next
>             }
>          }
>       }
>       @starts=0;
>       @ends=0;
>    }
>    return ($best, $bests, $beste, $beststrand, $bestorf);
>}
>
>
>my $seqio = new Bio::SeqIO('-format' => $sequenceformat,
>                            '-file'   => $sequencefile );
>
>while (my $dna = $seqio->next_seq) {
>
>    print $dna->display_id," ",$dna->desc,": ";
>    $seq=$dna->seq;
>    ($length,$start,$end,$direction,$sequence)=longestORF($seq);
>    print $length," ",Bio::Seq->new(-seq=>$sequence)->translate->seq,"\n";
>}
>
>
>On Mon, 29 Jul 2002, Aaron J Mackey wrote:
>
> >
> > Something like this:
> >
> > #!/usr/bin/perl -w
> >
> > use strict;
> >
> > # get a $seq Bio::Seq object somehow ...
> > my $dna = $seq->seq;
> >
> > my @starts;
> > my @stops;
> >
> > while ($dna =~ m/atg/gi) {
> >     push @starts, pos($dna) - 2;
> > }
> >
> > while ($dna =~ m/(taa) | (tga) | (tag)/gi) {
> >     push @ends, pos($dna) - 2;
> > }
> >
> > my $best = 0;
> > my ($bests, $beste, $beststrand) = (-1) x 3;
> > for my $s (@starts) {
> >     for my $e (@ends) {
> >         if($e - $b + 1 > $best) {
> >             $best = $e - $b + 1;
> >             ($bests, $beste, $beststrand) = ($s, $e, 1);
> >         }
> >     }
> > }
> >
> > # repeat for reverse strand; left as an exercise for the reader.
> > # report $bests, $beste, $beststrand, perhaps build a new Seq.pm object
> >
> > As always, this is all good fodder for a Tools/ORFCaller.pm or somesuch.
> > Like Tools::IUPAC, Tools::ORFs could return a stream of Seq's 
>representing
> > each orf (with parameters for desired strandedness, minimum length,
> > genetic code (to use with Tools::CodonTable, etc).
> >
> > -Aaron
> >
> >
> > On Tue, 30 Jul 2002, Dan Kortschak wrote:
> >
> > > Hi All, I'm looking to see if anyone has a small script to take a 
>Bio::Seq
> > > object and find the longest ORF ((+) strand or both strands).
> > > Altenatively, is there an easy way to do this using Bio::Seq 
>(translate
> > > doesn't seem to take a frame for translation so I'm guessing that it
> > > translates in frame 1 - is this right?)
> > >
> > > thanks
> > > Dan Kortschak
> > >
> > >
> >
> >
>
>--
>_____________________________________________________________   .`.`o
>                                                          o| ,\__ `./`r
>   Dan Kortschak    kortschak@rsbs.anu.spanner.edu.au     <\/    \_O> O
>                                                           "|`...'.\
>   Before you criticise a man, try to walk a mile in his    `      :\
>   shoes. Then, if he doesn't like what you have to say,           : \
>   you'll be a mile away, and you'll have his shoes.               :  \
>
>   The address above will not work, remove the spanner from the works.
>
>By replying to this email you implicitly accept that your response may
>be forwarded to other recipients.
>Permission is granted for fair use reproduction.
>
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l@bioperl.org
>http://bioperl.org/mailman/listinfo/bioperl-l

_________________________________________________________________
Send and receive Hotmail on your mobile device: http://mobile.msn.com