[Bioperl-l] Bioperl List Summary - January 2003

Aaron J Mackey ajm6q at virginia.edu
Tue Mar 4 10:58:06 EST 2003


    As I seem to be volunteering for more and more BioPerl documentation
    jobs recently, I thought I'd pool my resources and recycle some of my
    tuits to write a list summary. Expect these to be sporadic and
    incomplete; my goal is to highlight important questions, changes, fixes,
    and proposals, not recapitulate all list traffic. I'll try to include
    appropriate links to specific messages, or at least to the parent
    message. It'll probably take me awhile to get good at this, so please
    bear with me (and do send any suggestions).

    To play a bit of catch up, I'm now going to loosely summarize the entire
    month of January (leaving a few topics untouched that are better
    addressed in February). February's summary will be ready soon, after
    which you'll see more easily digestable weekly (or perhaps bi-weekly)
    summaries. I'll also be posting the HTML-ized summaries on my O'Reilly
    weblog with active hyperlinks.

    One item from December 31 of 2002 bears mentioning: Ewan Birney released
    stable version 1.2, with significant new functionality, and important
    updates to code that makes use of NCBI web services; upgrading is highly
    recommended, although some of the January list activity reflects small
    trials and tribulations with this release.



    * Searching the mailing list archives
        This seemed like an appropriate topic to put at the top of my list.
        The Bioperl-l mailing list isn't exactly as high-traffic as
        perl5-porters or the linux kernel mailing list, but it is a mixture
        of both deeply technical development issues and novice user
        questions. While the BioPerl tutorial and documentation are the
        first places one should look for answers, the second place must be
        the archives of the mailing list. Brain Osborne pointed out that
        "the Search box is hidden below the Thanks link at www.bioperl.org".

        It wasn't mentioned, but the "htdig" link Hilmar Lapp pointed out
        (which is also below the search box) does not actually index the
        bioperl mailing list, but seems to search all other OBF-affiliated
        lists (biojava, biopython, etc) ...


        Michal Kurowski pointed out that "the quickest way of accessing old
        postings seems to be a group archive from the mailman pages" and
        that "you can even download the whole thing and use it as a local
        mailbox", which happens to be very useful if you want to write list
        summaries. Mailman archives are at:


    * Bioperl 1.2 builds under cygwin
        John Nash reports that he was able to build the 1.2 distribution
        under cygwin once MakeMaker issues were overcome (in his case by
        upgrading to perl 5.8.0). Other tips are provided:


    * Getting/untarring the 1.2 distribution
        Some people had trouble either FTPing the 1.2 distribution, or with
        successfully untarring the tarball. These problems seemed to have
        resolved by themselves, and may have been related to router issues
        at the server. For the record, bioperl-1.2 can be found at:


    * man pages with bioperl-1.2
        People may have noticed that the "make" process for bioperl-1.2 does
        not generate nor install man pages. Ewan Birney explains, "In 1.2 we
        had to drop the manifyfication stage of the makefile because it was
        triggering a line-too-long error on some OSs due to shell
        constraints". If you wish to get them back, comment (or delete) out
        the MY::manifypods sub in Makefile.PL


    * Converting ABI trace to Phred format
        When asked why an ABI trace file read via SeqIO::abi didn't generate
        a Bio::Seq::SeqWithQuality (a sequence with associated quality
        values), Aaron Mackey replied, "I'm not sure why abi.pm in the
        bioperl distribution doesn't set it's sequence factory to
        SeqWithQuality"; I'm still not sure why. See the fix at:


    * biocorba status
        When asked about the status of the biocorba project, Jason Stajich
        replied, "We have working bindings in java,perl,python and bridges
        to the respective Bio* toolkits from these bindings for servers and
        clients based on a slightly modified BSANE IDL spec from OMG". He
        qualified that statement with "none of the original developers are
        using it in any of their work so development and final rounds of
        testing have not really happened"


    * DNA Smith-Waterman
        Yee Man has reimplemented the classic Smith-Waterman algorithm, with
        algorithmic improvements as suggested by Gotoh (affine gaps) and
        Myers & Miller (linear space), and wondered whether it would be a
        good addition to the BioPerl C-coded extension library (which
        currently contains a protein-only Smith-Waterman implementation by
        Ewan Birney, pSW.pm). Some discussion about classic (and novel)
        dynamic programming algorithms ensued, which eventually boiled down
        to a desire to have the generic (but extremely fast) Smith-Waterman
        code (written by Webb Miller) used by Bill Pearson's SSEARCH
        implementation made more widely available as a linkable C library
        (which BioPerl could then subsume). Interested parties should
        contact me. Relatedly, to answer one of our FAQ's yet again, if you
        currently want to do Smith-Waterman on DNA sequences, you should use
        BioPerl's bindings to the EMBOSS suite of sequence utilities.


    * using AUTOLOAD for get/set accessors
        The BioPerl code is full of explicitly coded accessor methods; often
        we are asked why we don't use more code-efficient methods of
        autogenerating these identical functions (via AUTOLOAD or
        Class::MakeMethod). The discussion is long-ranging, but it boils
        down to wanting every accessor to have the same functionality with
        respect to undef values and return value behavior, as dictated by
        our accessor "boilerplate" (which we kindly ask everyone to use).
        Yes, we know we can achieve that via sophisticated Class::MakeMethod
        usage, but we have bigger fish to fry at the moment. There's
        another, subtler issue about interfaces and implementation method
        introspection, but I'll leave that to a later discussion.


    * Bio:Seq no longer a RangeI (bug in Bio::Graphics::Panel)
        Much to the consternation of Lincoln Stein (and his legions of
        Bio::Graphics users), BioPerl 1.2 introduced a change to Bio::Seq in
        that it no longer complies with the Bio::RangeI interface; see
        Heikki Lehvaslaiho's "This has to be cruft!" message from November:


        Unfortunately, Bio::Graphics::Panel relied on Bio::Seq having a
        "start" method, so lots of existing code broke. A number of fixes
        were recommended, including a) using a Bio::Seq::SeqFactory to
        generate Bio::LocatableSeq's (which do implement RangeI methods), b)
        patching your Bio::Graphics::Panel and c) upgrading BioPerl 1.2 to
        the live CVS development version. A BioPerl 1.2.1 is forthcoming for
        this, and other reasons.


    * complement(join(e1, e2)) vs. join(complement(e1), complement(e2))
        Periodically, people ask "Is it possible to have bioperl output
        features in Genbank format of the form
        "complement(join(1..50,60..100))" rather than
        "join(complement(1..50),complement(60..100))?" This time it
        degenerated a little into a discussion about whether these two
        representations were semantically equivalent (short answer: yes).
        The answer to the original question is that BioPerl parses either
        representation into the same structure, which can only be "dumped"
        in one representation (presently, the latter).


    * GenBank bond() FT operator
        Recent GenBank files have begun to exhibit a new feature location
        operator, "bond", to identify dicysteine bonds in proteins and mRNA
        splice sites in RefSeq sequences. BioPerl has no concept of this
        location operator (which is really more of a feature, and would be
        better represented as a /bond feature table entry), and so currently
        dies when parsing a record containing it. A brute force fix is
        provided, but a better answer is yet to appear:



    * SearchIO now has megablast parser
        Jason Stajich writes, "The oft requested megablast parser has now
        been implemented in SearchIO". This should be available in the
        upcoming bioperl 1.2.1 bug-fix release, as well as CVS.


    * bl2seq parser needs to know report type to get strand right
        No matter how hard he tried, Dave Arenillas couldn't retrieve HSP
        strand information from a bl2seq (BLAST two sequences against each
        other) report. After "a little bit of detective work", Jason Stajich
        found that the Bio::Tools::BPbl2seq report object needs to be told
        the program type (e.g. "blastn") since it's not smart enough to
        guess it by context alone. The patch to BPbl2seq.pm is available via


    * bioperl.rpm in biolinux.org distribution
        Marc Logghe reports that "A couple of friends of mine have started
        up www.biolinux.org [ ... and] are offering a number of rpm packages
        for free download like e.g. emboss, sim4, phylip, ncbiblast, ...".
        After some discussion about what a bugbear packaging BioPerl can be
        (most dependencies are not critical for the entire package, only
        certain subparts that may or may not be useful for a given person),
        Hunter Matthews chimed in that he'd likely be able to make a bioperl
        1.2 rpm (he had previously made a 1.0 rpm); Hunter adds "7.3 would
        be the most likely target platform". Marc Logghe later reported that
        "the bioperl rpm's for RedHat and Suse are on line at


    * example scripts reorganization for installation as "production" code
        Spurred on by an earlier conversation regarding the perl scripts
        scattered between examples/ and scripts/, Brian Osborne has taken up
        the challenge to reorganize and reshape these so that those
        functional scripts with adequate POD and a .PLS suffix all live in
        scripts/ and get installed for "production" use. Scripts should
        remain in examples/ if they are simply "proof-of-concept" code, or
        just poorly documented. Everyone liked this, and from the CVS
        activity, it looks like the work is progressing.


    * Bio::Seq::SequenceTrace
        Chad Matsalla has added a Bio::Seq::SequenceTrace object, to "mimic
        the information available in a scf 'Sequence Chromatogram File'". It
        slices and dices. In the process, Chad ended up rewriting his
        Bio::SeqIO::scf code, "because the old module was somewhat ...


    * MLAGAN/LAGAN support
        Stephen Montgomery has supplied both MLAGAN and LAGAN wrappers and
        parsers (the Lagan Tookit is a set of alignment programs for
        comparative genomics).



    * SeqIO/scf.pm bug
        Tony Cox "finally got around to checking in a fix for the SeqIO/scf
        module when it has to deal with 8-bit encoded trace data". It's not
        yet clear where this fix stands with Chad Matsalla's rewrite of

    * Bio::Tools::Run::WrapperBase.pm missing from 1.2
        Because of some code migration between bioperl subprojects,
        Bio/Tools/Run/WrapperBase.pm went missing in the 1.2 release,
        causing a wide variety of failures. The 1.2.1 release will address
        this, or you can retrieve the missing file and install it manually
        from here:


    * bug fixes in Blast HSP tiling code
        After finding that for certain BLAST reports the "blast" and
        "psiblast" SearchIO parsers gave mildly differing values for
        "frac_identical" and "frac_conserved", Jason Stajich did some
        auditing of the HSP tiling code and found a few inconsistencies
        which have since been fixed. End result: frac_identical and
        frac_conserved should be better behaved and actually correct.



    * Project ideas for the aspiring biohacker
        Periodically, we're asked "I'd like to get involved, do you have any
        project ideas a newbie could work on?". Jason Stajich shot out a few
        choice ideas including "blastz" SearchIO parsing, which was briefly
        discussed. Get involved!


    * Bio::Perl namespace export groups
        The Bio::Perl module is a top-level, "novice" interface to a few
        small tidbits of BioPerl functionality. Many first-time users
        appreciate the simplicity of the Bio::Perl interface, and so we'd
        like to extend it's reach into other "meaty" areas of BioPerl
        functionality. Here we talk about how we might achieve this using
        custom export tags (a la CGI.pm and others). Another area where
        someone could make a dramatic impact without writing any new



    Well, that's it for this installment. Stay tuned for February (a much
    busier month!).

More information about the Bioperl-l mailing list