[Bioperl-l] repost the problem --- Re: bl2seq hang and its performace

Mon Dec 15 20:25:37 EST 2003

bioperl sequence objects aren't particularly robust for huge sequences and
can have you run out of memory - presumably if you run your query on the
cmd line with the files and take bioperl out of the loop is runs fine?

You may need to rethink your strategy for searching and pre-create your
sequence files to be more IO and memory efficient.

Personally I find StandAloneBlast not the best module for lots of
searches and prepare my pipeline to be leaner when I need to.

You can do this yourself in a simple script.

#psuedo code
-- create your sequence files - see Bio::Seq::LargeSeq or Bio::DB::Fasta
   for more memory efficient ways to manipulate large sequence files.
-- generate unique names for your subsequences, use SeqIO to create the
   files presumably if that will work.
-- do the bl2seq
  my $bl2seqfh;
  open($bl2seqfh, "bl2seq -i $file1 -j $file2 -p blastn ... |")
    || die($!);
  Bioperl 1.3.x only code
  my $searchio = new Bio::SearchIO(-format => 'blast',
	                           -fh     => $bl2seqfh);

  my $r = $searchio->next_result;
  # or use Bio::Tools::BPbl2seq if you have an earlier
  # version of the toolkit.

This is essentially what StandAloneBlast should be doing for you, but with
the overhead and assumptions that you are passing Bio::SeqI objects and
creating the temporary files for you, and cleaning them up as well.  One
drawback/bug is I think it will still open and try and create Bio::SeqI
objects even when you passing filenames - which may be the source of your
problem, not sure - this may also have been fixed, I've not dug into the
code lately.

-jason
On Mon, 15 Dec 2003, Liu Haifeng wrote:

> Anyone can help?  Really urgent!
>
> Haifeng Liu
> ----- Original Message -----
> From: "Liu Haifeng" <lhaifeng at dso.org.sg>
> To: <bioperl-l at portal.open-bio.org>
> Sent: 2003å¹´12æœˆ12æ—¥ 14:49
> Subject: bl2seq hang and its performace
>
>
> > Hi all,
> >
> > I noticed that one of my program written using bioperl-1.2.3 runs very
> slow
> > and consumes huge memory, and I doubted that it is due to the call of
> bl2seq
> > in the program.  Thus, I wrote a small program (bl2seq sequences against
> > themselves from a fasta file) below to see if it is the ture:
> >
> >
> > #!/usr/bin/perl -w
> >        use Bio::SeqIO;
> >       use Bio::Tools::Blast;
> >        use Bio::Tools::Run::StandAloneBlast;
> >        use Bio::Tools::BPlite;
> >
> >        my $infile =shift;
> >        my $sno=0;
> >        my $blastalgo="blastp"; #blastp ,blastx, tblastn, tblastx
> >        my $pin = Bio::SeqIO->new('-file' => "$infile", '-format' =>
> > 'Fasta');
> >       while ( my $proseq = $pin -> next_seq()) {
> >           $sno++;
> >           print "bl2seq $sno ..............................\n";
> >           my @params=('program' => $blastalo);
> >           my $factory= Bio::Tools::Run::StandAloneBlast->new(@params);
> >           $factory->io->_io_cleanup();
> >           my $report=$factory->bl2seq($proseq, $proseq);
> >           while (my $hsp=$report->next_feature) {
> >               #only need the first hsp
> >               $report->close();
> >            }
> >           undef $report;
> >      }
> >       print "running is over\n";
> >
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > The program runs ok for the small fastat file.  However, when I input a
> > fasat file around 2.6M containing 10,000 protein sequences, the program
> > hangs when it compare the 1782th sequence.  Also I noticed that the
> program
> > has consume 12M of memory at that time.   I searched the archive that
> there
> > have been similar bl2seq problem occurred.  However, it should have been
> > solved in the latest version.
> >
> > Anyone can show me some clues to improve the performance of calling
> bl2seq?
> > Thank you.
> >
> > Regards
> > Haifeng Liu
> >
> >
> >
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu