[Bioperl-l] FW: ORF finder
Gatherer, D. (Derek)
D.Gatherer@organon.nhe.akzonobel.nl
Tue, 5 Sep 2000 10:45:30 +0200
-----Original Message-----
From: hilmar.lapp@pharma.Novartis.com
[mailto:hilmar.lapp@pharma.Novartis.com]
Sent: 04 September 2000 18:28
To: Gatherer, D. (Derek)
Cc: Ewan Birney; bioperl-l@bioperl.org
Subject: RE: ORF finder
>>what sort of ORF finder would you be interested in? One that takes the
>>sequence as it is and outputs the most likely ORF (based on length and
>>coding potential)?
A _very_ simple ORF finder that could scan a database of cDNAs, and print
out all the potential ORFs and their putative translated products.
>>A couple of weeks ago I wrote to the BioPerl list about ESTScan.....
educated suggestion of how to model the insertions and deletions ESTScan
predicts....
Nothing so sophisticated, as yet!!
Here's the code I have so far:
_________________________________________
#---------------------------------------------------------------------------
--
# PACKAGE : FindORF.pm
# PURPOSE : To find open reading frames in a cDNA sequence
# AUTHOR : Derek Gatherer (D.Gatherer@organon.nhe.akzonobel.nl)
# SOURCE :
# CREATED : 5th September 2000
# MODIFIED :
# DISCLAIMER : I am employed in the pharmaceutical industry but my
# : employers do not endorse or sponsor this module
# : in any way whatsoever. The above email address is
# : given purely for the purpose of easy communication
# : with the author, and does not imply any connection
# : between my employers and anything written below.
# LICENCE : You may distribute this module under the same terms
# : as the rest of BioPerl.
#---------------------------------------------------------------------------
--
=head1 NAME
Bio::Tools::FindORF - Object holding ORF statistics for one cDNA sequence
=head1 SYNOPSIS
Take a sequence object from, say, an inputstream, and creates an object
for the purposes of holding open reading frame (ORF) statistics about
that sequence. The minimum size of the ORF is specified. The ORFs
in only the forward orientation are searched. For reverse orientation,
it is recommended that the reverse complement be generated, by eg.
$seqobj->revcom(), and the FindORF call repeated.
Creating the FindORF object, eg:
my $inputstream = Bio::SeqIO->new( -file => "seqfile", -format =>
'Fasta');
my $seqobj = $inputstream->next_seq();
my $ORF_obj = Bio::Tools::FindORF->new($seqobj);
or:
my $seqobj = Bio::PrimarySeq->new(-seq=>'[cut and paste a sequence
here]', -moltype = 'dna', -id = 'test');
my $ORF_obj = Bio::Tools::FindORF->new($seqobj);
obtain an array of arrays, with each of the constituent arrays holding
the start point, the stop point, and the DNA sequence of the ORF.
The start point is taken to be the position of the A of the starting ATG,
and the stop point is taken to be the position immediately prior to the
first T of the stop codon.
So a sequence ATGAAAAAATGA would have start at 1 and stop at 9.
eg:
my $array_ref = $ORF_obj->get_ORFs(50);
will get all such ORFs of length more than 50 bases.
display array of arrays, eg:
my @array_of_arrays = @$array_ref;
foreach(@array_of_arrays)
{
print "\n@$_";
}
=head1 DESCRIPTION
Bio::Tools::FindORF is a featherweight object for finding ORFs in a single
cDNA sequence. RNA sequences need to be converted to DNA before ORF
finding.
Protein sequences will throw an exception. All ORFs are assumed to begin
with
methionine ATG, so this module applies to cDNA only, as yet.
See Synopsis above for object creation code.
=head1 DEVELOPERS' NOTES
=head1 FEEDBACK
=head2 Mailing Lists
User feedback is an integral part of the evolution of this
and other Bioperl modules. Send your comments and suggestions preferably
to one of the Bioperl mailing lists.
Your participation is much appreciated.
mailto:vsns-bcd-perl@lists.uni-bielefeld.de - General discussion
mailto:vsns-bcd-perl-guts@lists.uni-bielefeld.de - Technically-oriented
discussion
http://bio.perl.org/MailList.html - About the mailing
lists
=head2 Reporting Bugs
Report bugs to the Bioperl bug tracking system to help us keep track
the bugs and their resolution.
Bug reports can be submitted via email or the web:
mailto:bioperl-bugs@bio.perl.org
http://bio.perl.org/bioperl-bugs/
=head1 AUTHOR
Derek Gatherer
=head1 APPENDIX
The rest of the documentation details each of the object methods.
Internal methods are usually preceded with a _
=cut
package Bio::Tools::FindORF;
use vars qw(@ISA);
use strict;
use Bio::PrimarySeq; # for translating the ORF
use Bio::PrimarySeqI;
# Object preamble - inherits from Bio::Root::Object
@ISA = qw(Bio::Root::Object);
# new() is inherited from Bio::Root::Object
# _initialize is is called from within new()
sub _initialize
{
my($self,@args) = @_;
$self->SUPER::_initialize;
my $seqobj = shift (@args);
unless($seqobj->isa("Bio::PrimarySeqI"))
{
$seqobj->throw("die in _init, FindORF works only on
PrimarySeqI objects\n");
}
unless($seqobj->moltype() eq 'dna') # one of 'dna','rna','protein'
{
$seqobj->throw("die in _init, FindORF works only on DNA
sequences\n");
}
$self->{'_seqref'} = $seqobj;
return $self;
}
=head2 getORFs
Title : getORFs
Usage : my $array_ref = $ORF_obj->get_ORFs($length);
Function: finds ORFs in a single DNA sequence
Example : a sequence TGATGCCCATGCCCCGTAATGACCTAGCCTGA
: will give the array of arrays (each line one array)
: 3 29 ATGCCCATGCCCCGTAATGACCTAGCC and the corresponding protein
seq.
: 9 29 ATGCCCCGTAATGACCTAGCC
: 19 24 ATGACC
Returns : Reference to an array of arrays, as above.
Args : $length, the minimum size of ORF to bother with.
Throws an exception if the submitted sequence is not DNA.
=cut
sub get_ORFs
{
my ($self,$length) = @_;
my $seqobj = $self->{'_seqref'};
if(!$length | $length =~ /\D/ig | $length <= 0)
{
$self->throw("die in get_ORFs, the length variable must be a
positive integer\n");
}
unless ($seqobj->isa("Bio::PrimarySeqI"))
{
$self->throw("die in get_ORFs, FindORF works only on
PrimarySeqI objects\n");
}
my $seqstring = uc $seqobj->seq();
my @all_ORFs = (); # the thing returned at the
end
# now the real business
my $p = 0; # count starts
my $q = 0; # count stops
my @start = (); # store starts
my @stop = (); # store stops
while ($seqstring =~ /(ATG)/g) # got a start?
{
$start[$p++] = pos($seqstring); # note position
}
while ($seqstring =~ /(TGA|TAA|TAG)/g) # got a stop?
{
$stop[$q++] = pos($seqstring); # note position
}
for(my $x=0;$x<=$#start;$x++) # all starts
{
for(my $y=0;$y<=$#stop;$y++) # pair up each with all
stops
{
# test if stop is after start, AND is in same frame, AND is of required
length
if(($stop[$y]>$start[$x]) &&
(($stop[$y]-$start[$x])%3==0) && (($stop[$y]-$start[$x])>=$length))
{
# if we have a start and stop in frame, is there a clear ORF between them?
my $ORFseq =
substr($seqstring,($start[$x]-3),($stop[$y]-$start[$x]));
my $cont=0; # switch to decide
if we have ORF
while($ORFseq =~ /TGA|TAA|TAG/ig) # got an
internal stop?
{
if((pos($ORFseq))%3==0) # does it
interfere with the frame?
{
$cont=1; # if yes,
not an ORF, ignore.
}
}
if($cont==0) # if no,
ORF is clean, proceed
{
# make a new seqobj from the good
ORF sequence
my $ORF_seqobj =
Bio::PrimarySeq->new(-seq => "$ORFseq", -moltype => 'dna');
# make a protein object from it
my $translated_obj =
$ORF_seqobj->translate();
# pull out the protein sequence
my $prot_product =
$translated_obj->seq();
# make up data array for ORF, with start pos., stop pos., ORF DNA seq. and
predicted protein seq.
my @this_ORF =
(($start[$x]-2),($stop[$y]-3),$ORFseq, $prot_product);
# stick this into a big array of results
push @all_ORFs, [@this_ORF];
}
}
}
}
return \@all_ORFs; # return the results array
}
# and that's all the module
=head1 A DRIVER SCRIPT
##!/usr/local/bin/perl -w
use strict;
use lib "/usr/lib/perl5/5.005/"; # or whatever your path is
use Bio::Seq;
use Bio::SeqIO;
use Bio::Tools::FindORF;
my $seqin = Bio::SeqIO->new( '-format' => 'Fasta' , -file => 'YOUR FILE
HERE');
while((my $seqobj = $seqin->next_seq())) # run through flat
file
{
print "\n".$seqobj->display_id(); # each sequence in
turn
my $ORF_obj = Bio::Tools::FindORF->new($seqobj);# make an ORF object
my $array_ref = $ORF_obj->get_ORFs(2000); # run the method
my @array_of_arrays = @$array_ref; # dereference the
output
foreach(@array_of_arrays)
{
print "\n@$_"; # print results
}
}
=cut
1;