[BioRuby] Patch for Bug 18019.

Sat Apr 17 03:09:07 UTC 2010

Hi Anurag,

If you change that code, you need to check if it is right for all  
kinds of
files processed with FlatFile.

The regular expression
> /^\s$/

will match any empty line whether it is the whole entry or just a  
part of it.

I suspect this change will break parsing BLAST output or any file  
that contain
internal blank lines.  Did you check them?

/\A\s*\z/
might work as you intend, though I feel this a dirty hack.
The expression \A and \z are explained in
http://ruby-doc.org/docs/ProgrammingRuby/html/language.html
-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan

On 2010/04/17, at 7:30, Anurag Priyam wrote:

> This also works:
>
> Check if the string read from the genbank file is not a whitespace  
> sequence( patch below ).
>
> $ ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]);ff.each { | 
> e| c += 1 }; p c' sequences.gb
> 3
>
> $ ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]);ff.each { | 
> e| c += 1 }; p c' gbvrt21.seq
> 1991
>
> $ ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]);ff.each { | 
> e| c += 1 }; p c' sample.gb #this was my sample file
> 1
>
> From 94e70a98a0643caf13acc0417b677073b8f7968d Mon Sep 17 00:00:00 2001
> From: Anurag Priyam <anurag08priyam at gmail.com>
> Date: Sat, 17 Apr 2010 03:50:48 +0530
> Subject: [PATCH] fixed bug 18019; redundant nil entry
>
> ---
>  lib/bio/io/flatfile/splitter.rb |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/lib/bio/io/flatfile/splitter.rb b/lib/bio/io/flatfile/ 
> splitter.rb
> index a07b016..f068bc2 100644
> --- a/lib/bio/io/flatfile/splitter.rb
> +++ b/lib/bio/io/flatfile/splitter.rb
> @@ -191,7 +191,7 @@ module Bio
>            self.entry_start_pos = p0
>            self.entry = e
>            self.entry_ended_pos = p1
> -          return entry
> +          return entry unless entry =~ /^\s$/
>          end
>        end #class Defalult
>
> -- 
> 1.7.0
>
>
> On Fri, Apr 16, 2010 at 8:04 AM, Tomoaki NISHIYAMA  
> <tomoakin at kenroku.kanazawa-u.ac.jp> wrote:
> Hi Goto-san,
>
> How do you feel to change the DELIMITER to "\nLOCUS" with
> DELIMITER_OVERRUN = 5, like the BLAST parsers.
>
> This is not as dirty as to check for empty lines and works for both
> the GenBank release files and the files obtained through Entrez.
>
> $ ruby -rbio -e 'Bio::GenBank::DELIMITER = "\nLOCUS"; \
> Bio::GenBank::DELIMITER_OVERRUN = 5; \
>
> c = 0; ff = Bio::FlatFile.open(ARGV[0]); \
> ff.each { |e| c += 1 }; p c' gbvrt21.seq
> 1991
> $ ruby -rbio -e 'Bio::GenBank::DELIMITER = "\nLOCUS"; \
> Bio::GenBank::DELIMITER_OVERRUN = 5; \
>
> c = 0; ff = Bio::FlatFile.open(ARGV[0]); \
> ff.each { |e| c += 1 }; p c' sequences.gb
> 3
> -- Tomoaki NISHIYAMA
>
>
> Advanced Science Research Center,
> Kanazawa University,
> 13-1 Takara-machi,
> Kanazawa, 920-0934, Japan
>
>
> On 2010/04/15, at 15:32, Naohisa GOTO wrote:
>
> Hi Anurag,
>
> Parsing of GenBank files is primarily tested with official
> GenBank releases. (But currently no unit tests. I hope they
> would be added during the GSoC project "Ruby 1.9.2 support of
> BioRuby".)
>
> The test is something like:
>
> # preparetion of test data
>
>  % wget ftp://ftp.ncbi.nlm.nih.gov/genbank/gbvrt21.seq.gz
>  % gzip -dc gbvrt21.seq.gz > gbvrt21.seq
>
> # Counts the number of entries
>
>  % ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]); \
>                  ff.each { |e| c += 1 }; p c' gbvrt21.seq
>   #==> 1991
>
> # Checks if the number of entries is correct.
>
>  % grep -c '^LOCUS' gbvrt21.seq
>
>   #==> 1991
>
> # Executes with the monkey patch.
> # Be careful that this takes very long time and large memory!
>
>  % ruby -rbio -e 'Bio::GenBank::DELIMITER = "\n//\n\n"; \
>                  c = 0; ff = Bio::FlatFile.open(ARGV[0]); \
>                  ff.each { |e| c += 1 }; p c' gbvrt21.seq
>  #==> 1
>
> It is apparent that the patch is wrong.
>
> Splitting entries by using such delimiter is simple and the  
> performance
> is well, but it can only work with correct data which should always be
> ended with the delimiter. Characters after the last delimiter in the
> file is regarded as a single entry because we don't want to lose data.
>
> The behavior can be changed, for example, when getting only white
> spaces and then the end of file without delimiter, it is ignored and
> treated as EOF with no entries.
>
> Naohisa Goto
> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
>
> On Thu, 15 Apr 2010 10:34:53 +0900
> Naohisa GOTO <ngoto at gen-info.osaka-u.ac.jp> wrote:
>
> On Wed, 14 Apr 2010 21:44:49 +0100
> Jan Aerts <jan.aerts at gmail.com> wrote:
>
> Thanks for that, Anurag. Contributions to bioruby very much  
> appreciated :-)
>
> @Goto-san: can you merge that fix?
>
> No, because the patch ignores reading of entries in the middle of  
> the file.
> To parse files distributed from NCBI, the delimiter should be "\n// 
> \n",
> and cannot be "\n//\n\n".
>
> Naohisa Goto
> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
>
>
> Cheers,
> jan.
>
> On 14 April 2010 21:41, Anurag Priyam <anurag08priyam at gmail.com>  
> wrote:
>
> Hello all,
>
> This is my start at being a part of the BioRuby developer community.
>
> The RubyForge bug tracking page shows bug 18019( GenBank  
> each_entry, last
> entry is always nil )[1] to be open. I am attaching a patch for it.  
> Its
> very
> tiny. The fix was already suggested in a comment by Raoul Jean Pierre
> Bonnal( the submitter of the bug ). I have verified the solution and
> created
> a patch for it. Or should I send a pull request on github?
>
> Patch( git format-patch ):
>
> From ac82213651e5f5761d32cc9c658188d060c2e75a Mon Sep 17 00:00:00 2001
> From: Anurag Priyam <anurag08priyam at gmail.com>
> Date: Wed, 14 Apr 2010 22:58:45 +0530
> Subject: [PATCH] fixed bug 18019: last entry of each_entry is nil
>
> ---
>  lib/bio/db/genbank/common.rb |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/lib/bio/db/genbank/common.rb b/lib/bio/db/genbank/ 
> common.rb
> index 545eac1..eaa760c 100644
> --- a/lib/bio/db/genbank/common.rb
> +++ b/lib/bio/db/genbank/common.rb
> @@ -24,7 +24,7 @@ class NCBIDB
>  #
>  module Common
>
> -  DELIMITER = RS = "\n//\n"
> +  DELIMITER = RS = "\n//\n\n"
>  TAGSIZE = 12
>
>  def initialize(entry)
> --
> 1.7.0
>
>
> [1]
>
> http://rubyforge.org/tracker/index.php? 
> func=detail&aid=18019&group_id=769&atid=3037
>
> --
> Anurag Priyam
> 2nd Year,Mechanical Engineering,
> IIT Kharagpur.
> +91-9775550642
>
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>
>
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>
>
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>
>
>
> -- 
> Anurag Priyam
> 2nd Year,Mechanical Engineering,
> IIT Kharagpur.
> +91-9775550642
> <0001-fixed-bug-18019-redundant-nil-entry.patch>