[Bioperl-l] the roundup (long)

Thu Aug 14 08:33:10 EDT 2003

I've added a bunch of new things and fixed some bugs, wanted to try and
summarize before I get too busy and forget the details.

Here is the roundup.

[Bio::PopGen]

New code in Bio::PopGen implements several statistics for use in testing
the neutrality of mutations in a population these include Tajima's D, Fu
and Li's D, Fu and Li's F, as well as some utilities like Theta and Pi.
These are still being put through the paces to insure that are calculating
everything properly.

A basic Coalescent simulator was already in Bioperl named
Bio::Tree::RandomTree.  This has been renamed
Bio::PopGen::Simulation::Coalescent and uses the revamped
Bio::Tree::AlleleNode objects.

Added an LD calculation implemention in composite_LD for unphased data.
Will also have D-prime by the end of the week for haplotype data.  These
are in Bio::PopGen::Statistics.  Bio::PopGen::PopStats has an
implementation of Fst to test for population structure.

The PopGen::Individual, PopGen::Population, PopGen::Genotype interface and
implementations seem to be reaching a stable point.  I've also unified
these with the bioperl-pedigree code (a separate CVS module).  Some more
small tweaks will probably go in over the coming months as things get put
through the paces, but I hope this can become stable code for a while.

In order to get allele/genotype data into Bioperl have added
Bio::PopGen::IO which can parse in csv delimited files as well as
prettybase format.  I expect to have the code to take SimpleAlign objects
and turn them into PopGen::Individuals written shortly.

To be sure and give credit - all the PopGen stuff is in collaboration with
Matthew Hahn.  We are preparing a tutorial to these objects which should
be out there in the Fall.

[Bio::Matrix]
I added Bio::Matrix::IO to implement a framework for simple matrix
parsing.  This is only to try and simplify things even though different
types of matricies are not equateable.  I added a scoring matrix parser
(IO::scoring) for BLOSUM/PAM matrix parsing.  It unsurprisinging produces
Bio::Matrix::Scoring objects.  Also added IO::phylip to parse phylip
distance matricies and thus produce Bio::Matrix::PhylipDist.  I also added
a general purpose object Bio::Matrix::Generic which is a starting place
for putting column and row data.  This is probably NOT the object you want
to use for PWM and PSSMs - perhaps Stefan Kirov's stuff will fit in here.

[memory leaks - Bio::Tree and Bio::SeqFeature]
Some memory leaks have been fixed for Trees and SeqFeatures.  Perl won't
cleanup and break memory cycles unless you explictly break them in the
DESTROY code.  However with our object hierarchy DESTROY is not
necessarily getting called by all subclass unless we do the whole chained
destructor (analagous to our chained constructors).  The way I solved it
is to use the already written _register_for_cleanup method which is called
in the constructor to specify the cleanup method instead of relying on
DESTROY.  This seems to work.  The only downside is in the case of things
like a Bio::Tree::Tree.   In this case the tree structure is implicit in
the Nodes and their pointers to children/parents.  The problem comes in if
we reuse all or part of the tree to do some test - like this
sub foo {
my @nodes = @_;
my $tree = new Bio::Tree::Tree(-nodes => \@nodes);
...

# tree gets destroyed at the end of scope
}

The problem is by default the tree destruction means also destroy the
containing nodes, which is a problem if you want to use those nodes for
something later.  The solution is at the end to set the root_node_pointer
to undef and thus the tree has no way to destroy the underlying nodes.
However this might be hard to remember to do all the time.

I introduced a -nodelete option (method name nodelete) to the constructor
(default value is false) which if true will not destroy the underlying
nodes.

A similar problem existed for SeqFeature FeaturePairs, I've added the code
in SeqFeature::Generic, SeqFeature::FeaturePair, and
SeqFeature::Gene::GeneStructure & SeqFeature::Gene::Transcript
which should take care of this now.  I was able successfully parse a large
number of genewise reports which each generated gene/transript sets which
previously had caused my perl to crash running out of memory so I feel we
have removed some (probably not all) of the leaks that get introduced
when there are cycles.

The memleak bugs were also fixed on the 1.2 branch for what its worth.

[Bio::SearchIO]
Added some more SearchIO parsers.  Borrowing from Bala's Tools::Blat
impelementation I made a SearchIO::psl parser which can parse PSL output.
It needs to be tweaked a little more to skip the header lines if they are
produced but works for me for output from Jim's lav2Psl code.

Additionally a SearchIO::blasttable has been added which can parse NCBI -m
8 or -m 9 output for those just needing some minimal information.

[other bugs - from changelog]
    o Bio::SearchIO
      - Fixed bugs in BLAST parsing which couldn't parse NCBI
        gapped blast properly (was losing hit significance values due to
        the extra unexpeted column).
      - Parsing of blastcl3 (netblast from NCBI) now can handle case of
        integer overflow (# of letters in nt seq dbs is > MAX_INT)
        although doesn't try to correct it - will get the negative
        number for you.  Added a test for this as well.
      - Fixed HMMER parsing bug which prevented parsing when a hmmpfam report
        has no top-level family classification scores but does have scores and
        alignments for individual domains.

On the 1.2 branch I also fixed a couple of places in SeqIO::genbank
and SeqIO::bsml where we weren't dereferencing the arrayref for keywords.

-jason
--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu