[BioRuby] Patch for Bug 18019.

Anurag Priyam anurag08priyam at gmail.com
Fri Apr 16 22:30:18 UTC 2010


This also works:

Check if the string read from the genbank file is not a whitespace sequence(
patch below ).

$ ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]);ff.each { |e| c +=
1 }; p c' sequences.gb
3

$ ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]);ff.each { |e| c +=
1 }; p c' gbvrt21.seq
1991

$ ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]);ff.each { |e| c +=
1 }; p c' sample.gb #this was my sample file
1

>From 94e70a98a0643caf13acc0417b677073b8f7968d Mon Sep 17 00:00:00 2001
From: Anurag Priyam <anurag08priyam at gmail.com>
Date: Sat, 17 Apr 2010 03:50:48 +0530
Subject: [PATCH] fixed bug 18019; redundant nil entry

---
 lib/bio/io/flatfile/splitter.rb |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/lib/bio/io/flatfile/splitter.rb
b/lib/bio/io/flatfile/splitter.rb
index a07b016..f068bc2 100644
--- a/lib/bio/io/flatfile/splitter.rb
+++ b/lib/bio/io/flatfile/splitter.rb
@@ -191,7 +191,7 @@ module Bio
           self.entry_start_pos = p0
           self.entry = e
           self.entry_ended_pos = p1
-          return entry
+          return entry unless entry =~ /^\s$/
         end
       end #class Defalult

-- 
1.7.0


On Fri, Apr 16, 2010 at 8:04 AM, Tomoaki NISHIYAMA <
tomoakin at kenroku.kanazawa-u.ac.jp> wrote:

> Hi Goto-san,
>
> How do you feel to change the DELIMITER to "\nLOCUS" with
> DELIMITER_OVERRUN = 5, like the BLAST parsers.
>
> This is not as dirty as to check for empty lines and works for both
> the GenBank release files and the files obtained through Entrez.
>
> $ ruby -rbio -e 'Bio::GenBank::DELIMITER = "\nLOCUS"; \
> Bio::GenBank::DELIMITER_OVERRUN = 5; \
>
> c = 0; ff = Bio::FlatFile.open(ARGV[0]); \
> ff.each { |e| c += 1 }; p c' gbvrt21.seq
> 1991
> $ ruby -rbio -e 'Bio::GenBank::DELIMITER = "\nLOCUS"; \
> Bio::GenBank::DELIMITER_OVERRUN = 5; \
>
> c = 0; ff = Bio::FlatFile.open(ARGV[0]); \
> ff.each { |e| c += 1 }; p c' sequences.gb
> 3
> -- Tomoaki NISHIYAMA
>
>
> Advanced Science Research Center,
> Kanazawa University,
> 13-1 Takara-machi,
> Kanazawa, 920-0934, Japan
>
>
> On 2010/04/15, at 15:32, Naohisa GOTO wrote:
>
>  Hi Anurag,
>>
>> Parsing of GenBank files is primarily tested with official
>> GenBank releases. (But currently no unit tests. I hope they
>> would be added during the GSoC project "Ruby 1.9.2 support of
>> BioRuby".)
>>
>> The test is something like:
>>
>> # preparetion of test data
>>
>>  % wget ftp://ftp.ncbi.nlm.nih.gov/genbank/gbvrt21.seq.gz
>>  % gzip -dc gbvrt21.seq.gz > gbvrt21.seq
>>
>> # Counts the number of entries
>>
>>  % ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]); \
>>                  ff.each { |e| c += 1 }; p c' gbvrt21.seq
>>   #==> 1991
>>
>> # Checks if the number of entries is correct.
>>
>>  % grep -c '^LOCUS' gbvrt21.seq
>>
>>   #==> 1991
>>
>> # Executes with the monkey patch.
>> # Be careful that this takes very long time and large memory!
>>
>>  % ruby -rbio -e 'Bio::GenBank::DELIMITER = "\n//\n\n"; \
>>                  c = 0; ff = Bio::FlatFile.open(ARGV[0]); \
>>                  ff.each { |e| c += 1 }; p c' gbvrt21.seq
>>  #==> 1
>>
>> It is apparent that the patch is wrong.
>>
>> Splitting entries by using such delimiter is simple and the performance
>> is well, but it can only work with correct data which should always be
>> ended with the delimiter. Characters after the last delimiter in the
>> file is regarded as a single entry because we don't want to lose data.
>>
>> The behavior can be changed, for example, when getting only white
>> spaces and then the end of file without delimiter, it is ignored and
>> treated as EOF with no entries.
>>
>> Naohisa Goto
>> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
>>
>> On Thu, 15 Apr 2010 10:34:53 +0900
>> Naohisa GOTO <ngoto at gen-info.osaka-u.ac.jp> wrote:
>>
>>  On Wed, 14 Apr 2010 21:44:49 +0100
>>> Jan Aerts <jan.aerts at gmail.com> wrote:
>>>
>>>  Thanks for that, Anurag. Contributions to bioruby very much appreciated
>>>> :-)
>>>>
>>>> @Goto-san: can you merge that fix?
>>>>
>>>
>>> No, because the patch ignores reading of entries in the middle of the
>>> file.
>>> To parse files distributed from NCBI, the delimiter should be "\n//\n",
>>> and cannot be "\n//\n\n".
>>>
>>> Naohisa Goto
>>> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
>>>
>>>
>>>> Cheers,
>>>> jan.
>>>>
>>>> On 14 April 2010 21:41, Anurag Priyam <anurag08priyam at gmail.com> wrote:
>>>>
>>>>  Hello all,
>>>>>
>>>>> This is my start at being a part of the BioRuby developer community.
>>>>>
>>>>> The RubyForge bug tracking page shows bug 18019( GenBank each_entry,
>>>>> last
>>>>> entry is always nil )[1] to be open. I am attaching a patch for it. Its
>>>>> very
>>>>> tiny. The fix was already suggested in a comment by Raoul Jean Pierre
>>>>> Bonnal( the submitter of the bug ). I have verified the solution and
>>>>> created
>>>>> a patch for it. Or should I send a pull request on github?
>>>>>
>>>>> Patch( git format-patch ):
>>>>>
>>>>>  From ac82213651e5f5761d32cc9c658188d060c2e75a Mon Sep 17 00:00:00 2001
>>>>>>
>>>>> From: Anurag Priyam <anurag08priyam at gmail.com>
>>>>> Date: Wed, 14 Apr 2010 22:58:45 +0530
>>>>> Subject: [PATCH] fixed bug 18019: last entry of each_entry is nil
>>>>>
>>>>> ---
>>>>>  lib/bio/db/genbank/common.rb |    2 +-
>>>>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>>>>
>>>>> diff --git a/lib/bio/db/genbank/common.rb
>>>>> b/lib/bio/db/genbank/common.rb
>>>>> index 545eac1..eaa760c 100644
>>>>> --- a/lib/bio/db/genbank/common.rb
>>>>> +++ b/lib/bio/db/genbank/common.rb
>>>>> @@ -24,7 +24,7 @@ class NCBIDB
>>>>>  #
>>>>>  module Common
>>>>>
>>>>> -  DELIMITER = RS = "\n//\n"
>>>>> +  DELIMITER = RS = "\n//\n\n"
>>>>>  TAGSIZE = 12
>>>>>
>>>>>  def initialize(entry)
>>>>> --
>>>>> 1.7.0
>>>>>
>>>>>
>>>>> [1]
>>>>>
>>>>>
>>>>> http://rubyforge.org/tracker/index.php?func=detail&aid=18019&group_id=769&atid=3037
>>>>>
>>>>> --
>>>>> Anurag Priyam
>>>>> 2nd Year,Mechanical Engineering,
>>>>> IIT Kharagpur.
>>>>> +91-9775550642
>>>>>
>>>>> _______________________________________________
>>>>> BioRuby Project - http://www.bioruby.org/
>>>>> BioRuby mailing list
>>>>> BioRuby at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioruby
>>>>>
>>>>>
>>>>>  _______________________________________________
>>>> BioRuby Project - http://www.bioruby.org/
>>>> BioRuby mailing list
>>>> BioRuby at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioruby
>>>>
>>> _______________________________________________
>>> BioRuby Project - http://www.bioruby.org/
>>> BioRuby mailing list
>>> BioRuby at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioruby
>>>
>>
>> _______________________________________________
>> BioRuby Project - http://www.bioruby.org/
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
>>
>>
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>



-- 
Anurag Priyam
2nd Year,Mechanical Engineering,
IIT Kharagpur.
+91-9775550642


More information about the BioRuby mailing list