[BioRuby] Fwd: Disk cash on the parse genes

Mon May 23 03:28:08 UTC 2011

Dear Endoh-san,

> Using your code, I can overcome the problem.

Good!

> "join(M52614.1:1..1456,5216..5823), the code returned a error.

External reference should be detected by the Bio::Location class
and the ID will be stored in an instance variable @xref_id,
however, how to deal with it is up to users, so you need to implement
some code to fetch external entry (in this case @xref_id="M52614.1") from
available services (local DB or web service etc.) and extract
the sub-sequence from the entry.

Please take a look at pattern (G) in the documentation.
  http://bioruby.open-bio.org/rdoc/classes/Bio/Locations.html

Unfortunately, I've got an unexplained "Exception" error from NCBI
when retrieving http://www.ncbi.nlm.nih.gov/nuccore?term=M52614.1
so, I'll use "join(U75473.1:1..293,1..216)" found in a GenBank entry
SMMFD02 (gbbct65.seq) for example.

# obtain a genbank record
bioruby> entry = getobj("genbank:SMMFD02")
 or
bioruby> entry = Bio::GenBank.new(open("http://togows.dbcls.jp/entry/ncbi-genbank/SMMFD02").read)

# cache whole sequence as we learnt in this thread :-)
bioruby> naseq = entry.naseq

# pick up "gene" features only
bioruby> genes = entry.features.select {|x| x.feature == "gene" }
  ==> [#<Bio::Feature:0x102340870
  @feature="gene",
  @position="join(U75473.1:1..293,1..216)",
  @qualifiers=
   [#<Bio::Feature::Qualifier:0x102340280 @qualifier="gene", @value="mfd">]>]

# example to handle external references in a given position
bioruby> genes.each do |gene|
  locations = Bio::Locations.new(gene.position)
  locations.each do |location|
    if xref = location.xref_id
      xref_entry = open("http://togows.dbcls.jp/entry/ncbi-genbank/#{xref}").read
      location.sequence = Bio::GenBank.new(xref_entry).naseq.subseq(location.from, location.to)
    end
  end
  gene.position = locations.to_s    # (*1)
  puts naseq.splice(gene.position)  # (*2)
end

(*1) will generate the following string

join(replace(U75473.1:1..293,"gtcttcttgttggtgatgttggttttggaaaaacggaagtagcgatgagagctgcttttaaagcagttaatgatgataaacaagttgctgttttggtgccaacaacagttcttgctcaacagcactataatacttttaaggagcgctttgaaaattttcctgtcaatgttgccatgatgagtcgttttaaaaccaagactgaacagtctgaaacgttaactaaattagctaagggacaggttgatatcattattggaacacatcgtctactttctaaagatgttacgtttaaa"),1..216)

(*2) will return 293 + 216 = 509 bp sequence

gtcttcttgttggtgatgttggttttggaaaaacggaagtagcgatgagagctgcttttaaagcagttaatgatgataaacaagttgctgttttggtgccaacaacagttcttgctcaacagcactataatacttttaaggagcgctttgaaaattttcctgtcaatgttgccatgatgagtcgttttaaaaccaagactgaacagtctgaaacgttaactaaattagctaagggacaggttgatatcattattggaacacatcgtctactttctaaagatgttacgtttaaaggggttaaacacaaggaaacattgaaagaattaaaaactaaggttgatgtcttgaccttgacagcaactcctattccacggacattacatatgtctatgcttggtatacgagatttatcagttattgaaacacctccaagtaatcgttaccctgtccagacttatgttatggaaacaaatgcaagtgtcattcgtgaagctattatgcgtgaaatt

During this trial, I found a bug in the Bio::Sequence#splice method.

bio/sequence/common.rb:
  def splice(position)
    unless position.is_a?(Locations) then
      position = Locations.new(position)
    end
    s = ''
    position.each do |location|
      if location.sequence
        s << location.sequence
      else  # <----- (*3)
        exon = self.subseq(location.from, location.to)
        begin
          exon.complement! if location.strand < 0
        rescue NameError
        end
        s << exon
      end
    end
    return self.class.new(s)
  end
  alias splicing splice

We need to fix this else block (*3) to mind if @xref_id exists or not.
Currently, "join(U75473.1:1..293,1..216)" will be treated as
"join(1..293,1..216)" and, obviously, it is not feasible.

Toshiaki

On 2011/05/21, at 14:33, 遠藤大二 wrote:

> Dear Katayama-san
> 
> I am very very grateful to your suggestion. I have been struggled on
> this problem for 6 months.
> Using your code, I can overcome the problem.
> 
> But, only one point the code stopped.
> If the feature.position refer to the other entry such as
> "join(M52614.1:1..1456,5216..5823), the code returned a error.
> So I added a line below.
> 
> next if position =~ /[A-Z]+\d+\W*\d*\:/
> 
> The inserting code now working.
> I attached the modified code.
> Thanks again,
> 
> Daiji Endoh
> ************************************************************************
> Dear Endoh-san,
> 
> Thank you for pointing this problem out.
> 
> I tried to parse gbbct12.seq file with the example code based on
> our tutorial at http://bioruby.open-bio.org/wiki/Tutorial and found
> that the actual problem is in the multiple calling of the gb.naseq method.
> 
> The method is defined as shown in below and which doesn't cache
> the generated Bio::Sequence::NA object, therefore, it will take
> long time if called multiple times, especially for a long sequence.
> 
> bio/db/genbank/genbank.rb:
> def seq
>   unless @data['SEQUENCE']
>     origin
>   end
>   Bio::Sequence::NA.new(@data['SEQUENCE'])
> end
> alias naseq seq
> 
> If I store the object outside of the loop of feature manipulation,
> it became much faster.
> 
> % ruby gbparse.rb gbbct12.seq > gbbct12.out 2> gbbct12.err
> Parsed 16125 entries in 1645.838824 sec.
> 
> % ruby gbparse_new.rb gbbct12.seq > gbbct12.out_new 2> gbbct12.err_new
> Parsed 16125 entries in 39.012607 sec.
> 
> Based on this observation, could you check the algorithm of your code?
> 
> Regards,
> Toshiaki Katayama
> ****************************************************************************************
> 
> 
> On 2011/05/19, at 10:04, 遠藤大二 wrote:
> 
>> Dear All
>> 
>> I often download whole genbank data from bio at mirror ( such as
>> gbbct12.seq ) and parse them.
>> But recently, parsing the whole data became to be difficult. On some
>> some step, the program need a long time to select nucleic acid
>> sequences of genes or transcripts. It seems that selection of spliced
>> or partial sequences from a long (genome) nucleic acid sequence using
>> feature data.
>> 
>> Anyone have strategies or methods avoiding these heavy steps ?
>> 
>> Daiji Endoh
>> Rakuno Gakuen University
>> _______________________________________________
>> BioRuby Project - http://www.bioruby.org/
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
> 
> 
> 
> 
> 
> -- 
> 酪農学園大学　獣医学部　放射線学教室
> 遠藤大二
> 〒069-8501　北海道江別市文京台緑町582
> Tel: 011-388-4847
> Fax:011-387-5890
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby