[BioRuby] GFF3

Pjotr Prins pjotr.public14 at thebird.nl
Tue Aug 31 06:53:09 UTC 2010


On Tue, Aug 31, 2010 at 11:12:37AM +0900, Tomoaki NISHIYAMA wrote:
> During the conversation on "Benchmarking FASTA file parsing", I
> realized that GC takes quite a lot of time if a large memory is to
> be used.  The mark and sweep algorithm in Matz ruby implementation
> scans over all the allocated objects every time the GC is run (which
> is not written in ruby code but implicitly runs if not suppressed).

Yup. No GC is perfect. They all have trade-offs. And, like you say, in
particular when you run out of memory it starts to hurt. 

> Since ruby-1.9.2 seems to have much better GC performance, I am
> interested how the performance compares in ruby-1.9.2.  (I am also
> interested in GC.disable condition, but this may not work with 15
> Gbytes though).

The GC should really run on a separate thread (read core). Not sure Ruby
1.9 does that now. The JVM does, so JRuby probably does. When I
implement an LRU cache it could also easily run on a separate thread,
as returned data is immutable.  I may do that, if I find something
similar to Erlang actors, for Ruby. This may be it:

  http://on-ruby.blogspot.com/2008/01/ruby-concurrency-with-actors.html

It is something to do later. Parallelized cache handling would really
be nice for big data. And, if it looks like a standard Hash to the
outside users, it will be easy to implement transparently throughout
BioRuby.

Anyway, let me add a cache first, and see what it means to
performance.

> Running your script with ruby 1.9 caused several errors, related to
> case when : removal of colon at the end of when line and changing
> colon to newline if the colon is not at the end of line was
> sufficient to run with ruby 1.9.2.  (diff at the end) Either one of
> newline, semicolon, and "then" seems to work.

I still have to migrate to 1.9. Thanks for trying! Next time please
fix it on github so I can merge it in easier. I may migrate for
using actors.

> The other good reason is that the data is perhaps not read from the
> disk many times but cached by the operating system and retained on
> memory.  So this is not as bad as it sounds.  Having 15 Gbytes,
> presumably 500 Mbytes file need not flushed.

Yes. And that is why I started experimenting with NoCache. Seeks are
cheap. Even without the OS buffers, disk reads are very very optimized
these days (I have done some work on that last year, together with a
student Konstantin Tretjakov). Most seeks in GFF3 are even within the
standard hardware cache (8/16 Mb) boundary, and are therefore not a
problem, even on small machines!  With NoCache the file gets read
twice, so the penalty should really be 2x max. Which is totally
acceptable, if that means we can handle any size data on any machine.
And then we can offer both InMemory and NoCache. We can handle any
type of big data. Our users win.  BioRuby wins.

Next to do: I want an LRU cache to prevent *parsing* every record
twice.  Parsing is the single expensive thing in NoCache.

One thing will be interesting: to see what LRU means in conjunction
with GC.

Pj.




More information about the BioRuby mailing list