[BioRuby] [GSoC][NeXML and RDF API] Code Review.

Sun Jun 27 08:43:22 UTC 2010

On Sun, Jun 27, 2010 at 04:45:43PM +0900, Naohisa Goto wrote:
> Hi,
> 
> I think the ability to handle large data and the memory usage
> whether or not to load all data in memory at a time, is essentially
> independent.  Not loading everything in memory does not guarantee
> the ability to handle large data, due to the disk I/O bottleneck and
> memory management overhead.

Well, depends on what you plan to do with that data :). I think you
are saying that streaming data may not be efficient, for example for
treating alignments. That could be true. However, I think the default
strategy should be non-memory bound, if possible. Throughout BioRuby
the strategy is the opposite, at the moment. For example, by default
FASTA files are loaded in RAM. Same for BLAST XML. I regularly have
files that exceed RAM and work around these limitations. I don't think
this should be the *default* strategy.

I prefer the Unix way of using pipes. Only use memory when it is
available.

With new code we should design for big data. If it is done from the
start, it takes no real effort. 

> I think it is currently OK to depend on memory. The price of memory
> is gradually going down, and I think buying a machine with huge
> memory could be a solution to treat large data.

We can not all afford big machines. It would hamper many
groups/students. RAM is getting cheaper, but data is growing faster.

Anurag, what is the size of RAM you have access to?

Pj.