[Biopython-dev] [Biopython] SciPy paper: documenting statistical data structure design issues

Tue May 25 13:52:26 UTC 2010

Hi,

> My main concern with the current tools is the memory issue. For instance when
> I try to create a distribution of sequence lengths or qualities using NGS
> data I end up with millions of numbers. That is too much for any reasonable
> computer.

Several million numbers aren't all that much, though, right?

To simulate your example, I created a 100,000,000 long vector (which,
depending on what type of NGS data you have, should be considered a
large number of reads) representing faux read-lengths, and it's only
taking up ~ 382 MB's[1] and gathering basic statistics on it
(variance, mean, histograms, etc.) isn't painful at all.

Once you start adding more metadata to the 100,000,000 elements, I can
see where you start running into problems, though.

> I've solved the problem by using disk caches that work as
> iterators. I'm sure that this is not the most performant solucion. It's just
> a hack and I would like to use better tools for sure.

Have you tried looking at something like PyTables? Might be something
to consider ...

Just a thought,

-steve

[1] I'm using R, which only used 32bit integers, but the language
itself isn't really the the point since we're all going to be running
into a wall with respect to NGS-sized datasets.

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact