[BioRuby] Patch for Bug 18019.

Naohisa GOTO ngoto at gen-info.osaka-u.ac.jp
Thu Apr 15 06:32:09 UTC 2010


Hi Anurag,

Parsing of GenBank files is primarily tested with official
GenBank releases. (But currently no unit tests. I hope they
would be added during the GSoC project "Ruby 1.9.2 support of
BioRuby".)

The test is something like:

# preparetion of test data

 % wget ftp://ftp.ncbi.nlm.nih.gov/genbank/gbvrt21.seq.gz
 % gzip -dc gbvrt21.seq.gz > gbvrt21.seq

# Counts the number of entries

 % ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]); \
                  ff.each { |e| c += 1 }; p c' gbvrt21.seq
   #==> 1991

# Checks if the number of entries is correct.

 % grep -c '^LOCUS' gbvrt21.seq

   #==> 1991
   
# Executes with the monkey patch.
# Be careful that this takes very long time and large memory!

 % ruby -rbio -e 'Bio::GenBank::DELIMITER = "\n//\n\n"; \
                  c = 0; ff = Bio::FlatFile.open(ARGV[0]); \
                  ff.each { |e| c += 1 }; p c' gbvrt21.seq
  #==> 1

It is apparent that the patch is wrong.

Splitting entries by using such delimiter is simple and the performance
is well, but it can only work with correct data which should always be
ended with the delimiter. Characters after the last delimiter in the
file is regarded as a single entry because we don't want to lose data.

The behavior can be changed, for example, when getting only white
spaces and then the end of file without delimiter, it is ignored and
treated as EOF with no entries.

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

On Thu, 15 Apr 2010 10:34:53 +0900
Naohisa GOTO <ngoto at gen-info.osaka-u.ac.jp> wrote:

> On Wed, 14 Apr 2010 21:44:49 +0100
> Jan Aerts <jan.aerts at gmail.com> wrote:
> 
> > Thanks for that, Anurag. Contributions to bioruby very much appreciated :-)
> > 
> > @Goto-san: can you merge that fix?
> 
> No, because the patch ignores reading of entries in the middle of the file.
> To parse files distributed from NCBI, the delimiter should be "\n//\n",
> and cannot be "\n//\n\n".
> 
> Naohisa Goto
> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
> 
> > 
> > Cheers,
> > jan.
> > 
> > On 14 April 2010 21:41, Anurag Priyam <anurag08priyam at gmail.com> wrote:
> > 
> > > Hello all,
> > >
> > > This is my start at being a part of the BioRuby developer community.
> > >
> > > The RubyForge bug tracking page shows bug 18019( GenBank each_entry, last
> > > entry is always nil )[1] to be open. I am attaching a patch for it. Its
> > > very
> > > tiny. The fix was already suggested in a comment by Raoul Jean Pierre
> > > Bonnal( the submitter of the bug ). I have verified the solution and
> > > created
> > > a patch for it. Or should I send a pull request on github?
> > >
> > > Patch( git format-patch ):
> > >
> > > >From ac82213651e5f5761d32cc9c658188d060c2e75a Mon Sep 17 00:00:00 2001
> > > From: Anurag Priyam <anurag08priyam at gmail.com>
> > > Date: Wed, 14 Apr 2010 22:58:45 +0530
> > > Subject: [PATCH] fixed bug 18019: last entry of each_entry is nil
> > >
> > > ---
> > >  lib/bio/db/genbank/common.rb |    2 +-
> > >  1 files changed, 1 insertions(+), 1 deletions(-)
> > >
> > > diff --git a/lib/bio/db/genbank/common.rb b/lib/bio/db/genbank/common.rb
> > > index 545eac1..eaa760c 100644
> > > --- a/lib/bio/db/genbank/common.rb
> > > +++ b/lib/bio/db/genbank/common.rb
> > > @@ -24,7 +24,7 @@ class NCBIDB
> > >  #
> > >  module Common
> > >
> > > -  DELIMITER = RS = "\n//\n"
> > > +  DELIMITER = RS = "\n//\n\n"
> > >   TAGSIZE = 12
> > >
> > >   def initialize(entry)
> > > --
> > > 1.7.0
> > >
> > >
> > > [1]
> > >
> > > http://rubyforge.org/tracker/index.php?func=detail&aid=18019&group_id=769&atid=3037
> > >
> > > --
> > > Anurag Priyam
> > > 2nd Year,Mechanical Engineering,
> > > IIT Kharagpur.
> > > +91-9775550642
> > >
> > > _______________________________________________
> > > BioRuby Project - http://www.bioruby.org/
> > > BioRuby mailing list
> > > BioRuby at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/bioruby
> > >
> > >
> > _______________________________________________
> > BioRuby Project - http://www.bioruby.org/
> > BioRuby mailing list
> > BioRuby at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioruby
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby




More information about the BioRuby mailing list