[BioRuby] Fwd: Disk cash on the parse genes

Sat May 21 05:33:58 UTC 2011

Dear Katayama-san

I am very very grateful to your suggestion. I have been struggled on
this problem for 6 months.
Using your code, I can overcome the problem.

But, only one point the code stopped.
If the feature.position refer to the other entry such as
"join(M52614.1:1..1456,5216..5823), the code returned a error.
So I added a line below.

next if position =~ /[A-Z]+\d+\W*\d*\:/

The inserting code now working.
I attached the modified code.
Thanks again,

Daiji Endoh
************************************************************************
Dear Endoh-san,

Thank you for pointing this problem out.

I tried to parse gbbct12.seq file with the example code based on
our tutorial at http://bioruby.open-bio.org/wiki/Tutorial and found
that the actual problem is in the multiple calling of the gb.naseq method.

The method is defined as shown in below and which doesn't cache
the generated Bio::Sequence::NA object, therefore, it will take
long time if called multiple times, especially for a long sequence.

bio/db/genbank/genbank.rb:
 def seq
   unless @data['SEQUENCE']
     origin
   end
   Bio::Sequence::NA.new(@data['SEQUENCE'])
 end
 alias naseq seq

If I store the object outside of the loop of feature manipulation,
it became much faster.

% ruby gbparse.rb gbbct12.seq > gbbct12.out 2> gbbct12.err
Parsed 16125 entries in 1645.838824 sec.

% ruby gbparse_new.rb gbbct12.seq > gbbct12.out_new 2> gbbct12.err_new
Parsed 16125 entries in 39.012607 sec.

Based on this observation, could you check the algorithm of your code?

Regards,
Toshiaki Katayama
****************************************************************************************

On 2011/05/19, at 10:04, 遠藤大二 wrote:

> Dear All
>
> I often download whole genbank data from bio at mirror ( such as
> gbbct12.seq ) and parse them.
> But recently, parsing the whole data became to be difficult. On some
> some step, the program need a long time to select nucleic acid
> sequences of genes or transcripts. It seems that selection of spliced
> or partial sequences from a long (genome) nucleic acid sequence using
> feature data.
>
> Anyone have strategies or methods avoiding these heavy steps ?
>
> Daiji Endoh
> Rakuno Gakuen University
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby

-- 
酪農学園大学　獣医学部　放射線学教室
遠藤大二
〒069-8501　北海道江別市文京台緑町582
Tel: 011-388-4847
Fax:011-387-5890