[Bioperl-l] FW: ORF finder

Tue, 5 Sep 2000 10:45:30 +0200

-----Original Message-----
From: hilmar.lapp@pharma.Novartis.com
[mailto:hilmar.lapp@pharma.Novartis.com]
Sent: 04 September 2000 18:28
To: Gatherer, D. (Derek)
Cc: Ewan Birney; bioperl-l@bioperl.org
Subject: RE: ORF finder

>>what sort of ORF finder would you be interested in? One that takes the
>>sequence as it is and outputs the most likely ORF (based on length and
>>coding potential)?

A _very_ simple ORF finder that could scan a database of cDNAs, and print
out all the potential ORFs and their putative translated products.

>>A couple of weeks ago I wrote to the BioPerl list about ESTScan.....
educated suggestion of how to model the insertions and deletions ESTScan
predicts....

Nothing so sophisticated, as yet!!

Here's the code I have so far:
_________________________________________

#---------------------------------------------------------------------------
--
# PACKAGE    : FindORF.pm
# PURPOSE    : To find open reading frames in a cDNA sequence
# AUTHOR     : Derek Gatherer (D.Gatherer@organon.nhe.akzonobel.nl)
# SOURCE     : 
# CREATED    : 5th September 2000
# MODIFIED   : 
# DISCLAIMER : I am employed in the pharmaceutical industry but my 
#	       : employers do not endorse or sponsor this module
#	       : in any way whatsoever.  The above email address is
#	       : given purely for the purpose of easy communication
#            : with the author, and does not imply any connection
#	       : between my employers and anything written below.
# LICENCE    : You may distribute this module under the same terms 
#	       : as the rest of BioPerl.
#---------------------------------------------------------------------------
--

=head1 NAME

Bio::Tools::FindORF - Object holding ORF statistics for one cDNA sequence

=head1 SYNOPSIS

Take a sequence object from, say, an inputstream, and creates an object 
for the purposes of holding open reading frame (ORF) statistics about 
that sequence.  The minimum size of the ORF is specified.  The ORFs 
in only the forward orientation are searched.  For reverse orientation, 
it is recommended that the reverse complement be generated, by eg. 
$seqobj->revcom(), and the FindORF call repeated.

Creating the FindORF object, eg:

	my $inputstream = Bio::SeqIO->new( -file => "seqfile", -format =>
'Fasta');
	my $seqobj = $inputstream->next_seq();
	my $ORF_obj = Bio::Tools::FindORF->new($seqobj);

or:
	my $seqobj = Bio::PrimarySeq->new(-seq=>'[cut and paste a sequence
here]', -moltype = 'dna', -id = 'test');
	my $ORF_obj  =  Bio::Tools::FindORF->new($seqobj);

obtain an array of arrays, with each of the constituent arrays holding 
the start point, the stop point, and the DNA sequence of the ORF.
The start point is taken to be the position of the A of the starting ATG,
and the stop point is taken to be the position immediately prior to the 
first T of the stop codon.

So a sequence ATGAAAAAATGA would have start at 1 and stop at 9. 

eg:
	my $array_ref = $ORF_obj->get_ORFs(50);

will get all such ORFs of length more than 50 bases.

display array of arrays, eg:

	my @array_of_arrays = @$array_ref;
	foreach(@array_of_arrays)
	{
		print "\n@$_";
	}

=head1 DESCRIPTION

Bio::Tools::FindORF is a featherweight object for finding ORFs in a single 
cDNA sequence.  RNA sequences need to be converted to DNA before ORF
finding.
Protein sequences will throw an exception.  All ORFs are assumed to begin
with
methionine ATG, so this module applies to cDNA only, as yet.

See Synopsis above for object creation code.

=head1 DEVELOPERS' NOTES

=head1 FEEDBACK

=head2 Mailing Lists

User feedback is an integral part of the evolution of this
and other Bioperl modules. Send your comments and suggestions preferably
to one of the Bioperl mailing lists.
Your participation is much appreciated.

  mailto:vsns-bcd-perl@lists.uni-bielefeld.de         - General discussion
  mailto:vsns-bcd-perl-guts@lists.uni-bielefeld.de    - Technically-oriented
discussion
  http://bio.perl.org/MailList.html            	      - About the mailing
lists

=head2 Reporting Bugs

Report bugs to the Bioperl bug tracking system to help us keep track
the bugs and their resolution.
Bug reports can be submitted via email or the web:

  mailto:bioperl-bugs@bio.perl.org
  http://bio.perl.org/bioperl-bugs/

=head1 AUTHOR

Derek Gatherer

=head1 APPENDIX

The rest of the documentation details each of the object methods. 
Internal methods are usually preceded with a _

=cut

package Bio::Tools::FindORF;
use vars qw(@ISA);
use strict;
use Bio::PrimarySeq;		# for translating the ORF
use Bio::PrimarySeqI;

# Object preamble - inherits from Bio::Root::Object

@ISA = qw(Bio::Root::Object);

# new() is inherited from Bio::Root::Object

# _initialize is is called from within new()

sub _initialize 
{
    	my($self,@args) = @_;
    	$self->SUPER::_initialize;

	my $seqobj = shift (@args);
    	unless($seqobj->isa("Bio::PrimarySeqI")) 
	{
		$seqobj->throw("die in _init, FindORF works only on
PrimarySeqI objects\n");
    	}

	unless($seqobj->moltype() eq 'dna')   # one of 'dna','rna','protein'
	{
		$seqobj->throw("die in _init, FindORF works only on DNA
sequences\n");
    	}

    	$self->{'_seqref'} = $seqobj;

    	return $self;
}

=head2 getORFs

 Title   : getORFs
 Usage   : my $array_ref = $ORF_obj->get_ORFs($length);
 Function: finds ORFs in a single DNA sequence
 Example : a sequence TGATGCCCATGCCCCGTAATGACCTAGCCTGA
	   : will give the array of arrays (each line one array)
	   : 3 29 ATGCCCATGCCCCGTAATGACCTAGCC and the corresponding protein
seq.
	   : 9 29 ATGCCCCGTAATGACCTAGCC
         : 19 24 ATGACC 
 Returns : Reference to an array of arrays, as above.
 Args    : $length, the minimum size of ORF to bother with.

  Throws an exception if the submitted sequence is not DNA.

=cut

sub get_ORFs
{
	my ($self,$length) = @_;

	my $seqobj =  $self->{'_seqref'};

	if(!$length | $length =~ /\D/ig | $length <= 0) 
	{
		$self->throw("die in get_ORFs, the length variable must be a
positive integer\n");
    	}

	unless  ($seqobj->isa("Bio::PrimarySeqI")) 
	{
		$self->throw("die in get_ORFs, FindORF works only on
PrimarySeqI objects\n");
    	}

	my $seqstring = uc $seqobj->seq();

	my @all_ORFs = ();			# the thing returned at the
end

# now the real business

	my $p = 0;					# count starts
	my $q = 0;					# count stops
	my @start = ();				# store starts
	my @stop = ();				# store stops
	while ($seqstring =~ /(ATG)/g)		# got a start?
	{
		$start[$p++] = pos($seqstring);	# note position
	}

	while ($seqstring =~ /(TGA|TAA|TAG)/g) 	# got a stop?
	{
		$stop[$q++] = pos($seqstring); 	# note position
	}

	for(my $x=0;$x<=$#start;$x++)		# all starts
	{	
		for(my $y=0;$y<=$#stop;$y++)	# pair up each with all
stops
		{
# test if stop is after start, AND is in same frame, AND is of required
length

			if(($stop[$y]>$start[$x]) &&
(($stop[$y]-$start[$x])%3==0) && (($stop[$y]-$start[$x])>=$length))
			{
# if we have a start and stop in frame, is there a clear ORF between them?

				my $ORFseq =
substr($seqstring,($start[$x]-3),($stop[$y]-$start[$x]));
				my $cont=0;		# switch to decide
if we have ORF
				while($ORFseq =~ /TGA|TAA|TAG/ig) # got an
internal stop?
				{
					if((pos($ORFseq))%3==0)	  # does it
interfere with the frame?
					{
						$cont=1;	  # if yes,
not an ORF, ignore.
					}
				}	
				if($cont==0)			  # if no,
ORF is clean, proceed
				{
					# make a new seqobj from the good
ORF sequence
					my $ORF_seqobj =
Bio::PrimarySeq->new(-seq => "$ORFseq", -moltype => 'dna');
					# make a protein object from it
					my $translated_obj =
$ORF_seqobj->translate();
					# pull out the protein sequence
					my $prot_product =
$translated_obj->seq();
# make up data array for ORF, with start pos., stop pos., ORF DNA seq. and
predicted protein seq.
					my @this_ORF =
(($start[$x]-2),($stop[$y]-3),$ORFseq, $prot_product);

# stick this into a big array of results
					push @all_ORFs, [@this_ORF];
				}
			}
		}
	}

	return \@all_ORFs;		# return the results array
}
# and that's all the module

=head1 A DRIVER SCRIPT

##!/usr/local/bin/perl -w

use strict;

use lib "/usr/lib/perl5/5.005/"; # or whatever your path is
use Bio::Seq;
use Bio::SeqIO;
use Bio::Tools::FindORF;

my $seqin = Bio::SeqIO->new( '-format' => 'Fasta' , -file => 'YOUR FILE
HERE');

while((my $seqobj = $seqin->next_seq()))		# run through flat
file
{
	print "\n".$seqobj->display_id();		# each sequence in
turn
	my $ORF_obj = Bio::Tools::FindORF->new($seqobj);# make an ORF object

	my $array_ref = $ORF_obj->get_ORFs(2000);	# run the method

	my @array_of_arrays = @$array_ref;		# dereference the
output
	foreach(@array_of_arrays)
	{
		print "\n@$_";				# print results
	}
}

=cut
1;