[Bioperl-l] improve speed in extracting Fasta sequence
Rob Edwards
rob at salmonella.org
Tue Dec 28 00:02:18 EST 2004
This is very slow because you are using an array to store the data that
you need and then cycling through the array every time you get a
sequence. You should try using a hash to store the lookup information.
In your code, you trim off the whitespace on every iteration. Why not
just do it the first time that you get the accession number?
If you use a hash, you don't need to recycle through the array each
time you get a new sequence. You can then simplify your whole code by
also using Bio::SeqIO for the output. This will simplify your code a
lot.
If you want to make this really zippy you should look into the database
functionality in bioperl, but I suspect that this will suffice.
Rob
========
use strict;
use Bio::SeqIO;
my $file = 'uniprot';
my $format = 'Fasta';
#read in accession no input file
open (ACC, "acc.txt") or die "an error occured with reading acc file:
$!";
#loop thru the input file and write to output file
my %acc; # declare the hash that is used below
while (<ACC>)
{
chomp;
s/\s+//g; # strip the spaces here and then you only need to do it once
$accs{$_}=1; # now this is a hash and not an array
}
my $inseq = Bio::SeqIO->new('-file' => "<$file", '-format' => $format
);
my $outseq = Bio::SeqIO->new(-file=>">uniprot_fasta.txt",
-format=>'fasta'); # use Bio::SeqIO for output too
# get sequence
while (my $seq = $inseq->next_seq) {
$outseq->write_seq($seq) if $acc{$seq->id}; # print the sequence out
if we want that one
}
More information about the Bioperl-l
mailing list