[Biopython-dev] [Biopython] SciPy paper: documenting statistical data structure design issues

Tue May 25 19:19:35 UTC 2010

On Mon, May 24, 2010 at 11:53 PM, Jose Blanca <jblanca at btc.upv.es> wrote:

> Hi:
>
> My main concern with the current tools is the memory issue. For instance
> when
> I try to create a distribution of sequence lengths or qualities using NGS
> data I end up with millions of numbers. That is too much for any reasonable
> computer. I've solved the problem by using disk caches that work as
> iterators. I'm sure that this is not the most performant solucion. It's
> just
> a hack and I would like to use better tools for sure.
> If you want to take a look at my current solution go to:
>
>
> http://github.com/JoseBlanca/franklin/blob/master/franklin/utils/itertools_.py
> http://github.com/JoseBlanca/franklin/blob/master/franklin/statistics.py

Please feel free to add some of the comments to the wiki.
I also cross posted this to the StatsModels list as I thought it might be of
interest to the list. Although I believe Steve Lianoglou comments are
correct, data set size is a issue in bio and only getting bigger.

>
> Best regards,
>
> Jose Blanca
>
> > Also If you know of other groups that would be interested please share
> this
> > link/information.
> >
> > > Thanks,
> > > --Michiel.
> > >
> > > --- On Mon, 5/24/10, Vincent Davis <vincent at vincentdavis.net> wrote:
> > > > From: Vincent Davis <vincent at vincentdavis.net>
> > > > Subject: [Biopython] SciPy paper: documenting statistical data
> > > > structure
> > >
> > > design issues
> > >
> > > > To: "biopython" <biopython at lists.open-bio.org>
> > > > Date: Monday, May 24, 2010, 3:45 PM
> > > > "see the message below, cross posted
> > > > from pystatsmodels"
> > > >
> > > > We have ben having some discussion on the pystatsmodels
> > > > maling list about
> > > > data objects, numpy arrays... I think it would be valuable
> > > > for some
> > > > biopython users to contribute some comments, examples or
> > > > ideas to the scipy
> > > > wiki that has been setup for this. I think at the heart of
> > > > this is that
> > > > although almost anything can be done with a numpy array we
> > > > run into many
> > > > problems that are difficult to solve with the current tools
> > > > for numpy
> > > > arrays. Because of this I think some nice examples of the
> > > > data design
> > > > problems that you have faced in the biopython and how they
> > > > have been solved
> > > > would be valuable.
> > > >
> > > > Thanks
> > > > Vincent
> > > >
> > > > On Sat, May 22, 2010 at 7:22 PM, Wes McKinney <wesmckinn at gmail.com>
> > > >
> > > > wrote:
> > > > > For my SciPy talk and paper in a little over a month,
> > > >
> > > > I was hoping to
> > > >
> > > > > render a somewhat coherent discussion of the design
> > > >
> > > > needs of
> > > >
> > > > > statistical data structures, based on my experience
> > > >
> > > > developing pandas
> > > >
> > > > > for quant finance research. I think these broadly fall
> > > >
> > > > into a few
> > > >
> > > > > categories: implementation ease, usability (for the
> > > >
> > > > non-developer
> > > >
> > > > > IPython-based console user), performance, and
> > > >
> > > > flexibility. Hopefully
> > > >
> > > > > this will be useful information that will help guide
> > > >
> > > > future
> > > >
> > > > > development efforts. What do you folks think?
> > > > >
> > > > > As part of this, I was thinking maybe we should start
> > > >
> > > > a wiki page (or
> > > >
> > > > > pages) somewhere to start listing out the various
> > > >
> > > > design issues (big
> > > >
> > > > > and small) where people can write their opinions and
> > > >
> > > > we can have a
> > > >
> > > > > structured discussion (e-mail is a bit hard for this
> > > >
> > > > sort of thing).
> > > >
> > > > > I'd also like to spend some time reading through other
> > > >
> > > > people's code
> > > >
> > > > > (e.g. all of the larry code) and writing down what I
> > > >
> > > > think about their
> > > >
> > > > > design choices in a constructive way.
> > > > >
> > > > > Part of what prompted my idea for a wiki was reading
> > > >
> > > > some of the larry
> > > >
> > > > > code and wanting to share my thoughts on various parts
> > > >
> > > > of it. Of
> > > >
> > > > > course I'm also prepared for other people to attack
> > > >
> > > > (and for me to
> > > >
> > > > > have to defend) my own code. For most of these things
> > > >
> > > > there isn't a
> > > >
> > > > > "right" and "wrong" and I am only interested in having
> > > >
> > > > constructive
> > > >
> > > > > discussions and hearing people's perspectives. Here's
> > > >
> > > > an example: in
> > > >
> > > > > pandas when adding two different-labeled 2d arrays,
> > > >
> > > > the result has the
> > > >
> > > > > *union* of all the labels. In la you get the
> > > >
> > > > intersection. Certainly
> > > >
> > > > > are pros and cons for either approach (in my case I
> > > >
> > > > don't want to lose
> > > >
> > > > > information, even if it's nulled out).
> > > > >
> > > > > We should also have a place where we document
> > > >
> > > > differences in
> > > >
> > > > > performance for various operations. I spent a lot of
> > > >
> > > > time even before
> > > >
> > > > > pandas was open-source obsessing over speed-- I'd like
> > > >
> > > > to think I
> > > >
> > > > > learned a few things but I was operating in a bubble
> > > >
> > > > so I might have
> > > >
> > > > > missed really obvious speedups. I also learned lots of
> > > >
> > > > odd things
> > > >
> > > > > about NumPy (did you know fancy indexing is a LOT
> > > >
> > > > slower than
> > > >
> > > > > ndarray.take?). We should probably establish some
> > > >
> > > > apples-to-apples
> > > >
> > > > > performance benchmarks to help people decide what to
> > > >
> > > > use for their
> > > >
> > > > > applications if speed matters.
> > > > >
> > > > > Best,
> > > > > Wes
> > > >
> > > >    *Vincent Davis
> > > > 720-301-3003 *
> > > > vincent at vincentdavis.net
> > > >  my blog <http://vincentdavis.net> |
> > > > LinkedIn<http://www.linkedin.com/in/vincentdavis>
> > > > _______________________________________________
> > > > Biopython mailing list  -  Biopython at lists.open-bio.org
> > > > http://lists.open-bio.org/mailman/listinfo/biopython
> >
> >   *Vincent Davis
> > 720-301-3003 *
> > vincent at vincentdavis.net
> >  my blog <http://vincentdavis.net> |
> > LinkedIn<http://www.linkedin.com/in/vincentdavis>
> > _______________________________________________
> > Biopython mailing list  -  Biopython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
>
>
>
> --
> Jose M. Blanca Postigo
> Instituto Universitario de Conservacion y
> Mejora de la Agrodiversidad Valenciana (COMAV)
> Universidad Politecnica de Valencia (UPV)
> Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
> 46022 Valencia (SPAIN)
> Tlf.:+34-96-3877000 (ext 88473)
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>

  *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>