[BioRuby] Disk cash on the parse genes

Toshiaki Katayama ktym at hgc.jp
Fri May 20 06:31:13 UTC 2011


Dear Endoh-san,

Thank you for pointing this problem out.

I tried to parse gbbct12.seq file with the example code based on
our tutorial at http://bioruby.open-bio.org/wiki/Tutorial and found
that the actual problem is in the multiple calling of the gb.naseq method.

The method is defined as shown in below and which doesn't cache
the generated Bio::Sequence::NA object, therefore, it will take
long time if called multiple times, especially for a long sequence.

bio/db/genbank/genbank.rb:
  def seq
    unless @data['SEQUENCE']
      origin
    end
    Bio::Sequence::NA.new(@data['SEQUENCE'])
  end
  alias naseq seq

If I store the object outside of the loop of feature manipulation,
it became much faster.

% ruby gbparse.rb gbbct12.seq > gbbct12.out 2> gbbct12.err
Parsed 16125 entries in 1645.838824 sec.

% ruby gbparse_new.rb gbbct12.seq > gbbct12.out_new 2> gbbct12.err_new
Parsed 16125 entries in 39.012607 sec.

Based on this observation, could you check the algorithm of your code?

Regards,
Toshiaki Katayama

-------------- next part --------------




On 2011/05/19, at 10:04, ???? wrote:

> Dear All
> 
> I often download whole genbank data from bio at mirror ( such as
> gbbct12.seq ) and parse them.
> But recently, parsing the whole data became to be difficult. On some
> some step, the program need a long time to select nucleic acid
> sequences of genes or transcripts. It seems that selection of spliced
> or partial sequences from a long (genome) nucleic acid sequence using
> feature data.
> 
> Anyone have strategies or methods avoiding these heavy steps ?
> 
> Daiji Endoh
> Rakuno Gakuen University
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby



More information about the BioRuby mailing list