[BioRuby] Patch for Bug 18019.

Tomoaki NISHIYAMA tomoakin at kenroku.kanazawa-u.ac.jp
Thu Apr 15 07:26:42 UTC 2010


Hi Goto-san,

> Splitting entries by using such delimiter is simple and the  
> performance
> is well, but it can only work with correct data which should always be
> ended with the delimiter. Characters after the last delimiter in the
> file is regarded as a single entry because we don't want to lose data.
>
> The behavior can be changed, for example, when getting only white
> spaces and then the end of file without delimiter, it is ignored and
> treated as EOF with no entries.


Because genbank and genpept format file downloaded from NCBI with entrez
usually ends with double new line characters,
the latter behavior is really desired.

$ wget -O sequences.gb  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/ 
efetch.fcgi?db=protein&cmd=text&rettype=gp&id=2,3,4"
$ ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]);\
  ff.each { |e| c += 1 }; p c' sequences.gb
#==> 4
Hope it becomes 3. As there are 3 entries.
$ grep LOCUS sequences.gb
LOCUS       A00002                   194 bp    DNA     linear   PAT  
10-FEB-1993
LOCUS       A00003                   194 bp    DNA     linear   PAT  
10-FEB-1993
LOCUS       X17276                   556 bp    DNA     linear   MAM  
26-FEB-1992

Actually this file have an excess newline at each end of entry.
And his patch will work in this case, despite it is not right as you  
mentioned.

Although in this example no error is reported because we don't do  
anything with the
entry, accessing the last entry (the fourth in this case) will cause  
error.
-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan





More information about the BioRuby mailing list