[Bioperl-l] Storing Large Primary Sequences
David Block
dblock@gene.pbi.nrc.ca
Wed, 20 Sep 2000 09:44:52 -0600 (CST)
James et al,
Here's the code. I assume that $self has a 'file' attribute with the path
to the chromosomal sequence file (chr2.clean, in my case). I assume
further that the file has been massaged so all the lines (except for
possibly the last one) have the same linelength. I store that linelength
as another attribute of self.
sub loadseq { #loads chromosomal sequence into memory
my ($self)=@_;
my $count=0;
my @in;
open IN,$self->file or die "Couldn't open Sequence file: $!";
if (defined($_=<IN>)) {
if (/^>/) {$_=<IN>} # get rid of the fasta header if it
exists
chomp;
$self->{linelength}=CORE::length; #assuming constant line
widths (as in .clean files)
$in[0]=$_;
}
while (<IN>) {
$count++;
chomp;
$in[$count]=$_;
}
return (\@in);
}
Here's the retrieval code. $self->_load_a_seq calls loadseq if it hasn't
been done already. I use a closure to store the sequence so it's
available to all members of the class. Therefore I don't have to repeat
loadseq for the life of the executable.
sub getseq { #quickly retrieves sequence from chromosomal
sequence present in memory
#returns sequence in 70 character lines
my ($self)=@_;
my $in=$self->_load_a_seq;
my $startpos=$self->start;
my $endpos=$self->stop;
my $linelength=$self->linelength;
my $length=$endpos-$startpos+1;
my $row=int($startpos/$linelength);
my $count=$row * $linelength + 1;
my $seq = substr($in->[$row],($startpos-$count),$length);
$length -= $linelength - ($startpos-$count);
do {
$row++;
$seq .= substr($in->[$row],0,$length);
$length -= $linelength;
} until ($length <= 0);
$seq .= "\n";
$seq=~ s/([^\n]{70})/$1\n/g;
return $seq;
}