[BioRuby] Patch for Bug 18019.

Fri Apr 16 02:34:14 UTC 2010

Hi Goto-san,

How do you feel to change the DELIMITER to "\nLOCUS" with
DELIMITER_OVERRUN = 5, like the BLAST parsers.

This is not as dirty as to check for empty lines and works for both
the GenBank release files and the files obtained through Entrez.

$ ruby -rbio -e 'Bio::GenBank::DELIMITER = "\nLOCUS"; \
Bio::GenBank::DELIMITER_OVERRUN = 5; \
c = 0; ff = Bio::FlatFile.open(ARGV[0]); \
ff.each { |e| c += 1 }; p c' gbvrt21.seq
1991
$ ruby -rbio -e 'Bio::GenBank::DELIMITER = "\nLOCUS"; \
Bio::GenBank::DELIMITER_OVERRUN = 5; \
c = 0; ff = Bio::FlatFile.open(ARGV[0]); \
ff.each { |e| c += 1 }; p c' sequences.gb
3
--  
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan

On 2010/04/15, at 15:32, Naohisa GOTO wrote:

> Hi Anurag,
>
> Parsing of GenBank files is primarily tested with official
> GenBank releases. (But currently no unit tests. I hope they
> would be added during the GSoC project "Ruby 1.9.2 support of
> BioRuby".)
>
> The test is something like:
>
> # preparetion of test data
>
>  % wget ftp://ftp.ncbi.nlm.nih.gov/genbank/gbvrt21.seq.gz
>  % gzip -dc gbvrt21.seq.gz > gbvrt21.seq
>
> # Counts the number of entries
>
>  % ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]); \
>                   ff.each { |e| c += 1 }; p c' gbvrt21.seq
>    #==> 1991
>
> # Checks if the number of entries is correct.
>
>  % grep -c '^LOCUS' gbvrt21.seq
>
>    #==> 1991
>
> # Executes with the monkey patch.
> # Be careful that this takes very long time and large memory!
>
>  % ruby -rbio -e 'Bio::GenBank::DELIMITER = "\n//\n\n"; \
>                   c = 0; ff = Bio::FlatFile.open(ARGV[0]); \
>                   ff.each { |e| c += 1 }; p c' gbvrt21.seq
>   #==> 1
>
> It is apparent that the patch is wrong.
>
> Splitting entries by using such delimiter is simple and the  
> performance
> is well, but it can only work with correct data which should always be
> ended with the delimiter. Characters after the last delimiter in the
> file is regarded as a single entry because we don't want to lose data.
>
> The behavior can be changed, for example, when getting only white
> spaces and then the end of file without delimiter, it is ignored and
> treated as EOF with no entries.
>
> Naohisa Goto
> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
>
> On Thu, 15 Apr 2010 10:34:53 +0900
> Naohisa GOTO <ngoto at gen-info.osaka-u.ac.jp> wrote:
>
>> On Wed, 14 Apr 2010 21:44:49 +0100
>> Jan Aerts <jan.aerts at gmail.com> wrote:
>>
>>> Thanks for that, Anurag. Contributions to bioruby very much  
>>> appreciated :-)
>>>
>>> @Goto-san: can you merge that fix?
>>
>> No, because the patch ignores reading of entries in the middle of  
>> the file.
>> To parse files distributed from NCBI, the delimiter should be "\n// 
>> \n",
>> and cannot be "\n//\n\n".
>>
>> Naohisa Goto
>> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
>>
>>>
>>> Cheers,
>>> jan.
>>>
>>> On 14 April 2010 21:41, Anurag Priyam <anurag08priyam at gmail.com>  
>>> wrote:
>>>
>>>> Hello all,
>>>>
>>>> This is my start at being a part of the BioRuby developer  
>>>> community.
>>>>
>>>> The RubyForge bug tracking page shows bug 18019( GenBank  
>>>> each_entry, last
>>>> entry is always nil )[1] to be open. I am attaching a patch for  
>>>> it. Its
>>>> very
>>>> tiny. The fix was already suggested in a comment by Raoul Jean  
>>>> Pierre
>>>> Bonnal( the submitter of the bug ). I have verified the solution  
>>>> and
>>>> created
>>>> a patch for it. Or should I send a pull request on github?
>>>>
>>>> Patch( git format-patch ):
>>>>
>>>>> From ac82213651e5f5761d32cc9c658188d060c2e75a Mon Sep 17  
>>>>> 00:00:00 2001
>>>> From: Anurag Priyam <anurag08priyam at gmail.com>
>>>> Date: Wed, 14 Apr 2010 22:58:45 +0530
>>>> Subject: [PATCH] fixed bug 18019: last entry of each_entry is nil
>>>>
>>>> ---
>>>>  lib/bio/db/genbank/common.rb |    2 +-
>>>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>>>
>>>> diff --git a/lib/bio/db/genbank/common.rb b/lib/bio/db/genbank/ 
>>>> common.rb
>>>> index 545eac1..eaa760c 100644
>>>> --- a/lib/bio/db/genbank/common.rb
>>>> +++ b/lib/bio/db/genbank/common.rb
>>>> @@ -24,7 +24,7 @@ class NCBIDB
>>>>  #
>>>>  module Common
>>>>
>>>> -  DELIMITER = RS = "\n//\n"
>>>> +  DELIMITER = RS = "\n//\n\n"
>>>>   TAGSIZE = 12
>>>>
>>>>   def initialize(entry)
>>>> --
>>>> 1.7.0
>>>>
>>>>
>>>> [1]
>>>>
>>>> http://rubyforge.org/tracker/index.php? 
>>>> func=detail&aid=18019&group_id=769&atid=3037
>>>>
>>>> --
>>>> Anurag Priyam
>>>> 2nd Year,Mechanical Engineering,
>>>> IIT Kharagpur.
>>>> +91-9775550642
>>>>
>>>> _______________________________________________
>>>> BioRuby Project - http://www.bioruby.org/
>>>> BioRuby mailing list
>>>> BioRuby at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioruby
>>>>
>>>>
>>> _______________________________________________
>>> BioRuby Project - http://www.bioruby.org/
>>> BioRuby mailing list
>>> BioRuby at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioruby
>> _______________________________________________
>> BioRuby Project - http://www.bioruby.org/
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
>
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>