[BioRuby] GFF3

Mon Aug 30 09:10:28 UTC 2010

On Wed, Aug 18, 2010 at 05:21:24PM +0900, Tomoaki NISHIYAMA wrote:
> The gene number within a genome doesn't grow so much.  So, the
> memory becomes problematic only if you are dealing with multiple
> genomes or more fine features.
>
> Saving memory is another kind of optimization.  It's good if we can
> achieve to do with less memory.  I just don't care much as far as
> the problem fit in the memory I can use and run in a reasonable
> time.

Well, interesting news. The low memory version is actually 50% faster
than the InMemory BioRuby edition.

On a decent 15Gb server with fast drives (and ruby 1.8.7 (2010-08-16
patchlevel 302) [x86_64-linux]): 

When I parse a 500Mb GFF3 file, without FASTA information, with
BioRuby it consumes 8.5 Gb RAM and takes 20 minutes.  My NoCache
version takes 1Gb RAM and 13 minutes. On my 2Gb laptop the native
BioRuby version never completed (which, in my opinion, is
unacceptable).

Mine is the naive version - i.e. I only store file seek positions in
memory, and reload and parse a record from disk every time. The record
parser is BioRuby's, not mine. There are no optimizations. Even this
is faster than BioRuby's default in memory model - which takes 19
minutes by itself to load and parse the data file; I only use the last
1 minute for digesting information and assembly of sequences.

I am not 100% sure why this is, but I know that BioRuby consumes the
whole file in memory first, splits it by line and, next, starts
parsing GFF. Probably memory allocation and regex are expensive with
really large buffers.

I think BioRuby needs to provide iterators for on demand parsing of
files, rather than big memory blobs. I also do it for FASTA in my
BigBio project. It can be done transparently, as both InMemory and
NoCache versions use the same algorithm.

It will take me some time to complete a write-up on how to approach
this for BioRuby, as I am keeping my head low next month. Note that,
BioJava provides iteration too, as a default model, though I think
their visitor pattern introduces too much complexity. 

In short: We can use simple Ruby iterators - it will work - and
potentially even provides transparent LRU caching. I'll have numbers
on that later, as that is my route to speed optimization. I know GFF3
components get reloaded and re-parsed many times.

If you want to try, my code is at 

  http://github.com/pjotrp/bioruby-gff3-plugin

the current report is at

  http://thebird.nl/bioruby/BioRuby_GFF3.html

Note: you may need my empty line patch for BioRuby to run the InMemory
edition (my BioRuby GFF3 branch on github).

Pj.