[BioRuby] Patch for Bug 18019.

Thu Apr 15 08:30:42 UTC 2010

To parse the genbank files, ultimately IO#gets(sep_string=$/) is called. I
did a File.read on a small sequence file[1]. The last sequence of characters
are as: ".... tctaga\n//\n\n".  This shows how Ruby would see the file. Note
the two "\n" at the end. That was my rationale for the patch.

Now, with the current delimiter "\n//\n", what happens is that, when we call
gets(delimiter) repetitively, it returns "\n" as the last entry and nil
thereafter. This "\n" is the root cause of the problem as it is returned to
Bio::FlatFile#next_entry and Bio::FlatFile#each_entry, from either:
Bio::Splitter::Default#get_entry or Bio::Splitter::Default#get_parsed_entry.
The checks employed later for the return value, include checking for nil (
return nil unless r;; in next_entry ). I think we can include check
conditions for whitespace to avoid this? I believe Goto-san's mail also
implied something on the same line?

[1] http://home.cc.umanitoba.ca/~psgendb/X54090.gen.html

On Thu, Apr 15, 2010 at 12:56 PM, Tomoaki NISHIYAMA <
tomoakin at kenroku.kanazawa-u.ac.jp> wrote:

> Hi Goto-san,
>
>
>  Splitting entries by using such delimiter is simple and the performance
>> is well, but it can only work with correct data which should always be
>> ended with the delimiter. Characters after the last delimiter in the
>> file is regarded as a single entry because we don't want to lose data.
>>
>> The behavior can be changed, for example, when getting only white
>> spaces and then the end of file without delimiter, it is ignored and
>> treated as EOF with no entries.
>>
>
>
> Because genbank and genpept format file downloaded from NCBI with entrez
> usually ends with double new line characters,
> the latter behavior is really desired.
>
> $ wget -O sequences.gb  "
> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&cmd=text&rettype=gp&id=2,3,4
> "
>
> $ ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]);\
>  ff.each { |e| c += 1 }; p c' sequences.gb
> #==> 4
> Hope it becomes 3. As there are 3 entries.
> $ grep LOCUS sequences.gb
> LOCUS       A00002                   194 bp    DNA     linear   PAT
> 10-FEB-1993
> LOCUS       A00003                   194 bp    DNA     linear   PAT
> 10-FEB-1993
> LOCUS       X17276                   556 bp    DNA     linear   MAM
> 26-FEB-1992
>
> Actually this file have an excess newline at each end of entry.
> And his patch will work in this case, despite it is not right as you
> mentioned.
>
> Although in this example no error is reported because we don't do anything
> with the
> entry, accessing the last entry (the fourth in this case) will cause error.
> --
> Tomoaki NISHIYAMA
>
> Advanced Science Research Center,
> Kanazawa University,
> 13-1 Takara-machi,
> Kanazawa, 920-0934, Japan
>
>
>
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>

-- 
Anurag Priyam
2nd Year,Mechanical Engineering,
IIT Kharagpur.
+91-9775550642