[BioRuby] concatenate two EMBL sequence files

Wed Jan 18 03:30:02 EST 2006

Just for future reference (including myself): the code to shift features in a GenBank sequence. It might be improved on (a lot?), but it works for me...

jan.

#!/usr/bin/ruby
require 'bio'

module Bio
  class Location
    def shift!(slide_length)
      @from += slide_length.to_i
      @to += slide_length.to_i
    end

    def to_s
      output = ''
      if @lt
        from_str = '<' + @from.to_s
      else
        from_str = @from.to_s
      end
      if @gt
        to_str = '>' + @to.to_s
      else
        to_str = @to.to_s
      end

      output = from_str + '..' + to_str

      if @strand == -1
        output = 'complement(' + output + ')'
      end

      return output
    end
  end

  class Locations
    def shift!(slide_length)
      @locations.each do |x|
        x.shift!(slide_length)
      end
    end

    def to_s
      output = ''
      if @locations.length == 1
        output = @locations[0].to_s
      else
        parts = Array.new()
        @locations.each do |loc|
          parts.push(loc.to_s)
        end
        output = 'join(' + parts.join(',') + ')'
      end
      return output
    end
  end

  class Feature
    def shift!(slide_length)
      locations = Locations.new(@position)
      locations.shift!(slide_length)
      @position = locations.to_s
    end
  end
end

ff = Bio::FlatFile.open(Bio::GenBank, "M33388.gb")
ff.each_entry do |gb|
  gb.features.each do |feature|
    puts "BEFORE: " + feature.position
    feature.shift!(5)
    puts "AFTER:  " + feature.position
  end
end
ff.close

-----Original Message-----
From: Toshiaki Katayama [mailto:ktym at hgc.jp]
Sent: Tue 1/17/2006 2:16 PM
To: jan aerts (RI)
Subject: Re: [BioRuby] concatenate two EMBL sequence files

Hello,

On 2006/01/17, at 21:57, jan aerts (RI) wrote:

> I've tried BioPerl, but it gets pretty difficult when you want to shift
> features that have positions like
> "join(<5..10,complement(55..70),13..14)"...

I have never tried but I believed that BioPerl is good for these situation
as it seems to have various sequence and location models...

Note that, the location including '<' notation is not well supported in BioRuby.
Please keep in mind that '<5..10' will be treated as '5..10'.
In your case, I suppose you understand what you are doing. :)

> It was _really_ straightforward to add such a Bio::Locations#shift
> method to my bioruby installation as you suggested, but now I'm trying
> to figure out where to add the code for exporting to EMBL format. There
> is a to_fasta method in bio/sequence.rb, but a newly added to_embl
> method in that file can not be found by my script. Can anyone tell me
> where I should add such a method?
>
> My code to read a genbank file, shift all features by 5 and print in
> EMBL format:
> $  ff = Bio::FlatFile.new(Bio::GenBank, ARGF)
> $  ff.each_entry do |gb|
> $    gb.features.each do |feature|
> $      locs = Bio::Locations.new(feature.position)
> $      locs.shift(5)                ####Uses new method to shift all locs by 5

I realized, after I hit send button, that the 'shift' method should be
added to Bio::Features class to shift all the features of the sequence.

  gb2 = gb.features.shift(5)
  gb2.features.each ...

     or

  gb.features.each do |feature|
    feature.shift!(5)
  end
  puts gb.to_embl

However, ideally, having a rich sequence model like BioPerl which converts
sequence with annotations from any sequence entry (GenBank, EMBL etc.),
would give more sophisticated solution.

If you have two genbank entries in variables gb1 and gb2, and if there were
a 'to_richseq' method which converts Bio::GenBank data model to Bio::RichSeq
or something,

  seq = gb1.to_richseq + gb2.to_richseq.shift(5)

will give you a new sequence which has a merged sequence with shifted annotations.

> $    end
> $    puts gb.to_embl                ####Doesn't work yet!! Where do I
> put the to_embl method?
> $  end
>
> Two problems still exist:
> (1) in what file from the bioruby libs should I put the to_embl code?

In Ruby, classes are open, so you can put

module Bio
  class GenBank
    def to_embl
      ...
    end
  end
end

before you use the 'to_embl' method against your Bio::GenBank object
(say, on top of your script).

Again, I think it would be better to have this method in the new
not-yet-implemented Bio::RichSeq class.
The Bio::Sequence#to_fasta method should also be moved to this class.
I don't like the name 'RichSeq', though.

> (2) I suppose that, in the code above, the features themselves don't get
> shifted (the locs object has been detached from the gb object). How can
> I do that?

Yes, as I mentioned above, my previous suggestion was not good.
The new shift/shift! methods should be in Bio::Features or Bio::Feature
(or in Bio::RichSeq would be better).

> Many thanks,
> Jan Aerts

Good luck!

Thanks,
Toshiaki Katayama

>
>> -----Original Message-----
>> From: bioruby-bounces at portal.open-bio.org 
>> [mailto:bioruby-bounces at portal.open-bio.org] On Behalf Of 
>> Toshiaki Katayama
>> Sent: 16 January 2006 15:29
>> To: BioRuby ML Discussion List Project
>> Subject: Re: [BioRuby] concatenate two EMBL sequence files
>>
>> Hello,
>>
>> For your purpose, using BioPerl would be straight forward.
>>
>> BioRuby doesn't have feature re-location mechanism and no one 
>> is working on writing out in EMBL format for now.
>>
>> I might consider to do the similar job by converting both 
>> sequence to GFF.
>> By the way, having Bio::Locations#shift(slide_length) method 
>> would be helpful.
>>
>> Regards,
>> Toshiaki Katayama
>>
>> On 2006/01/16, at 22:46, jan aerts (RI) wrote:
>>
>>> Hello all,
>>>
>>> I'm trying to concatenate two EMBL files (_with_ features) 
>> into one larger sequence. Of course, positions of features of 
>> the second sequence have to recalculated in the new sequence.
>>> e.g.:
>>>   sequence 1 = 100 bp
>>>     feature A: 5..10
>>>   sequence 2 = 200 bp
>>>     feature B: 15..20
>>>   => concatenated sequence 3 = 300 bp
>>>        feature A: 5..10
>>>        feature B: 115..120
>>>
>>> I'd like to write out the new sequence as an EMBL file.
>>>
>>> Can anyone point me in the right direction? Many thanks.
>>>
>>> A possible perl-convert,
>>> Jan Aerts, PhD
>>> Bioinformatics Group
>>> Roslin Institute
>>> Roslin
>>> Scotland, UK
>>>
>>> _______________________________________________
>>> BioRuby mailing list
>>> BioRuby at open-bio.org
>>> http://portal.open-bio.org/mailman/listinfo/bioruby
>>
>> _______________________________________________
>> BioRuby mailing list
>> BioRuby at open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioruby
>>