[Bioperl-l] improve speed in extracting Fasta sequence

Rob Edwards rob at salmonella.org
Tue Dec 28 00:02:18 EST 2004


This is very slow because you are using an array to store the data that 
you need and then cycling through the array every time you get a 
sequence. You should try using a hash to store the lookup information.

In your code, you trim off the whitespace on every iteration. Why not 
just do it the first time that you get the accession number?

If you use a hash, you don't need to recycle through the array each 
time you get a new sequence. You can then simplify your whole code by 
also using Bio::SeqIO for the output. This will simplify your code a 
lot.

If you want to make this really zippy you should look into the database 
functionality in bioperl, but I suspect that this will suffice.

Rob


========
  use strict;
  use Bio::SeqIO;
  my $file = 'uniprot';
  my $format = 'Fasta';
  #read in accession no input file
  open (ACC, "acc.txt") or die "an error occured with reading acc file: 
$!";
  #loop thru the input file and write to output file
  my %acc; # declare the hash that is used below
  while (<ACC>)
  {
   chomp;
   s/\s+//g; # strip the spaces here and then you only need to do it once
   $accs{$_}=1; # now this is a hash and not an array
  }

  my $inseq = Bio::SeqIO->new('-file' => "<$file", '-format' => $format 
);
  my $outseq = Bio::SeqIO->new(-file=>">uniprot_fasta.txt", 
-format=>'fasta'); # use Bio::SeqIO for output too
  # get sequence
  while (my $seq = $inseq->next_seq) {
   $outseq->write_seq($seq) if $acc{$seq->id}; # print the sequence out 
if we want that one
}



More information about the Bioperl-l mailing list