Bioperl: Project volunteers

Thu, 13 Jan 2000 18:31:53 +0000 (GMT)

On Tue, 11 Jan 2000, John Peden wrote:

> 
> Hi,
> 
> 
> I expect there may be a number of lurkers on this list who are slightly
> confused as to where they should start supporting bioperl. I regularly write
> perl but most of it is rather crude and I realise that there must be more
> elegant/efficient ways to code much of it, I rarely use objects and I am
> unfamiliar with the literature on Project or Object design.
> 

I think many people are in your position John, and it doesn't mean you are
a bad programmer at all: generally it is because people are isolated and
so don't share their tricks of the trade: and alot of modules look quite
daunting to understand, so people seem to be stuck. Take heart - alot of
people (myself included) were in your position, and we aren't *THAT* bad
at programming.

I would first start to write scripts using the bioperl objects. In
particular if you can write real, fully featured with options scripts that
drives one or two objects (for example, what about a script to use the
RestrictionEnzyme class? Or the BLAST parser?). This will teach you about
objects. Don't be afraid to post problems to the list (once you have a
problem - do try to solve it yourself for 5 minutes before you post here).

>From that point I think alot of how the modules work will become clearer
to you. This means that writing your own module will be easy. Once you've
got the knack it is extremely easy...

> 
> If I am to contribute to bioperl I will need to improve my Perl. Is there
> any documentation available on what standards bioperl modules should be
> written to, or any naming conventions? Are there any modules that represent
> good/perfect bioperl code? Perhaps more controversially are there modules
> which are not so good ? would anyone be willing to briefly explain the main
> faults of the module.

There are bits and bobs of documentation. I have included at the end here
something I wrote on an enforced 4 hour stay over at JFK. It is not quite
what you want but it helps somewhat.

Good or "modern" bioperl code I think is probably best found in
Bio::SeqFeature::Generic which is not in the 0.5 release but if you
connect by anonymous cvs (go to http://cvs.bioperl.org and read the
instructions. It is very easy) then it is there. Bio::Seq is *not* good.
Bio::Index::Abstract and Bio::Index::EMBL is both nice code and doing
something reasonably complicated, with the Abstract class doing the
indexing. It might blow your mind though (there is a particularly sneaky
upcast...).

Bad bioperl code. Bio::Seq - which I have rewritten (it is Bio::NewSeq in
the anonymous cvs stuff) but will need some massaging to get it ready for
the new things. Other - stupid - stuff we need are molecular weight
modules and a codon table object which can then be used by translate
(codon table probably should optionally store codon frequency). A very
simple ORF finder would be nice as well.

Make sure when you develop code you develop against the cvs
repository. This keeps your code working with the current
version. Eventually, if it looks like you have good things to contribute
on a regular basis you can get a read/write cvs login - we don't want to
give them out to people who just "want one" but as soon as there is a
level of trust about the code we are happy.

Does this help?

> 
> John
> 
> John Peden			Tel. +44-(0)1865-222350
> MRC Haematology		Fax. +44-(0)1865-222500
> IMM, Oxford OX3 9DS
> 

Here is the biodesign.pod I wrote:

=head1 NAME

Bioperl - Design Documentation

=head1 SYNOPSIS

Not appropiate. Read on...

=head1 DESCRIPTION

Bioperl is a coordinated project which has a number of design features
to allow bioperl to be well used, extended and collaborate with other
packages. This design can be focused in a number of areas.

  bioperl ettiquette and learning about it
  bioperl root object - exception throwing, exceptions etc.
  bioperl interface design
  bioperl sequence object design notes

=head1 AUTHOR

This was written by Ewan Birney in a variety of airports across the US.

=head1 Reusing code and working in collaborative projects

The biggest problem often in reusing a code base like bioperl is that
it requires both the people using it and the people contributing to
it to change their attitude towards code. Generally people in bioinformatics
are more likely to be self-taught, single programmers, who put together most
of their scripts/programs as individuals. Bioperl is a truely collaborative
project (the core code is the product of about 15 individuals) and anyone
will be only contributing some part of it in the future.

Here are some notes about how my coding style has changed to work in
collaborative projects.

=head2 Learn to read documentation

Reading documentation is sometimes as tough as writing the
documentation. Try to read documentation before you ask a question -
not only might it answer your question, but more importantly it will
give you idea why the person who wrote the module wrote it - and this
will be the frame work in which you can understand his or her answer.

=head2 Respect people's code (in particular if it works)

If the code does what you want, the fact that it is not written the
way you would write should not be a big issue. Of course, if there is
some glaring error then that is worth pointing out to
someone. Dismissing a module on the basis of its coding style is a
tremendously wrong thing to do.

=head2 Learn how to provide good feedback

This ranges from giving very accurate bug reports (this script -->
makes this error, giving all data), through to pointing out design
issues in a constructive manner (not - this *sucks*). If you find
a problem, then providing a patch using diff or work around is
a great thing to do - the author/maintainer of the module will
love you for it.

Providing "I used XXX and it did just what I wanted it to do" feedback
is also really great. Developers generally only hear about their mistakes.
To hear about successes gives everyone a warm glow.

One trick I have learnt is that when I download a new project/code or
use a new module I open up a fresh buffer in emacs and keep a mini diary
of everything that I did or think when I started to use the package. After
I used it I could go back, edit the buffer and then send it to the author
either with "it was great - it did just what I wanted, but I found that
the documentation here was misleading" to "to get it to install I had
to incant the following things..."

=head2 Taking on a project

When you want to get involved, hopefully it will be because you want to
extend something or provide better facillities to something. The important
thing here is not to work in a vacuum. By providing the main list with
a good proposal before you start about what you are going to do (and listen
to the responses) is a must. I have been pulled up so many times by other
people looking at my design that I can't imagine coding stuff now without
feedback.

=head2 Designing good tests

Sadly, you might think that you have written good code, but you don't know
that until you manage to test it! The CPAN style perl modules have a wonderful
test suite system (delve around into the t/ directories) and I have extended
the makefile system so that the test script which you write to test the module
can be part of the t/ system from the start. Once a test is in the t/ system it
will be run millions of times worldwide when bioperl is downloaded, providing
incredible and continual regression testing of your module (for free!).

=head2 Having fun

The coding process should be enjoyable, and I get very proud of people who tell
me that they picked up bioperl and it worked for them, even if they don't use
a single module that I wrote. There is a brilliant sense of community in bioperl
about providing useful, stable code and it should be a pleasure to contribute to it.

So - I am always looking forward to people posting on the guts list
with their feedback/questions/proposals. As well as the long standing fun we
have making new releases.

=head1 Bioperl Root Object

All objects in bioperl (but for interfaces - see the next section) inheriet
from the Root Object. The bioperl root object allows a number of very useful
concepts to be provided. In particular.

=over

=item exceptions

   Bioperl root object allow exceptions to be throw on the object with very
nice debugging output

=item context

   Bioperl root object have a context which allows, in particular, exceptions
that are thrown to say which object as throwing the exception.

=item rearrange

   Bioperl root object have some helper methods, in particular rearrange to
help functions which take hash inputs.

=back

=head2 Using the root object.

To use the root object, the object has to inheriet from it. This means
the @ISA array should have (Bio::Roo::Object) in it and that the
module goes "use Bio::Root::Object". The root object provides the
->new function. This new function builds a hash, sets some root object
management issues and then calls the _initialize function. It is this
function which your object needs to implement.  The full code is given
below.

   # convention is that if you are using the Bio::Root object you should put it
   # inside the Bio namespace

   package Bio::MyNewObject;
   use vars qw(@ISA);
   use strict;

   use Bio::Root::Object;
   @ISA = qw(Bio::Root::Object);

   # new() is inherited from Bio::Root::Object
   # _initialize is where the heavy stuff will happen when new is called

  sub _initialize {
     my($self,@args) = @_;
     # call superclasses initialize

     my $make = $self->SUPER::_initialize(@args);

     # do your own argument processing here
     # set default attributes etc...

     return $make; # success - we hope!
  }

=head2 Throwing Exceptions

 Exceptions are die functions, in which the $@ variable (a scalar) is
used to indicate how it died. The exceptions can be caught using the
eval {} system. The bioperl root object has a method called "->throw"
which calls die but also provides a full stack trace of where this
throw happened on (and also which object the exception was thrown -
see the context section). So an exception like

  $obj->throw("I am throwing an exception");

Provides the following output on STDERR if is not caught. 

  -------------------- EXCEPTION --------------------
  MSG: I am throwing an exception
  CONTEXT: Error in object Bio::Root::Object "anonymous Bio::Root::Object"
  SCRIPT: myscript.pl
  STACK:
  main::my_subroutine(7)
  main::(3)
  ---------------------------------------------------

indicating that this exception was thrown at line 7 of subroutine my_subroutine,
in myscript.pl

Exceptions can be caught using an eval block, such as

 my $obj = Bio::SomeObject->new();
 my $obj2
 eval {
   $obj2 = $obj->method1();
   $obj2->method2(10);
 }

 if( $@ ) {
   # exception was thrown
   &tell_user("Exception was thrown, preventing whatever I wanted to do. Actual exception $@");
   exit(0);
 } 

 # else - use $obj2

notice that the eval block can have multiple statements in it, and
also that if you want to use variables outside of the eval block, they
must be declared with my outside of the eval block (you are planning
to use strict in your scripts, aren't you!).

=head2 context

Each bioperl object has a context, which is given by the name
attribute (name is a method defined in the Bio::Root::Object
package). This context is displayed when the exception is made, so
that the following script:

  use Bio::Root::Object;
  $obj = Bio::Root::Object->new;

  $obj->name("Context-A");
  &my_subroutine($obj);

  sub my_subroutine {
        $self = shift;
        $self->throw("I am throwing an exception");
  }

Produces the following exception

  -------------------- EXCEPTION --------------------
  MSG: I am throwing an exception
  CONTEXT: Error in object Bio::Root::Object "Context-A"
  SCRIPT: test2.pl
  STACK:
  main::my_subroutine(10)
  main::test2.pl(6)
  ---------------------------------------------------

Notice that the Object nows says that it is Context-A. 

This context is particularly useful when objects are produced from a
database. This is because some exceptions are really due to problems
with the data in an object rather than the code. These sort of
exceptions are better tracked down when you know where the object came
from, not where in the code the exception is thrown.

One of the drawbacks to this scheme is that the attribute ->name is
"special" from bioperl's perspective. I believe it is best to stay
away from using $obj->name() to mean anything from the object's
perspective (for example ->id() ), leaving it free to be used as a
context for debugging purposes. You might prefer to overload the name
attribute to be "useful" for the object.

=head1 Bioperl Interface design

Bioperl has been moving to a split between B<interface> and
B<implementation> definitions.  An interface is solely the definition
of what methods one can call on an object, without any knowledge of
how it is implemented. An implementation is an actual, working
implementation of an object. In languages like Java, interface
definition is part of the language. In Perl, like many aspects of Perl
you have to roll your own.

In bioperl, the interface names are called Bio::MyObjectI, with the
trailing I indicating it is an interface definition of an object. The
interface files (sometimes nicknamed the 'I files') provide mainly
documentation on what the interface is, and how to use (and implement
it). All the functions which the implementation is expected to provide
are defined as subroutines, and then die with an informative
warning. The exception to this rule are the implementation independent
functions (see later).

Objects which want to implement this interface should inheriet the
Bio::MyObjectI file in their @ISA array. This means that if the
implementation does not provide a method which the interface defines,
rather than the user getting a "method not found error" it gets a
"mymethod was not defined in MyObjectI, but should have been" which
makes it clearer that whoever provided the implementation was to
blame, and not the caller/script writer.

When people want to check they have valid objects being passed to
their functions they should test the presence of the interface, not
the implementation. for example

  sub my_sequence_routine {
    my($seq,$other_argument) = @_;

    $seq->isa('Bio::SeqI') || die "[$seq] is not a sequence. Cannot process";

    # do stuff

  }

This is in contrast to 

  sub my_incorrect_sequence_routine {
    my($seq,$other_argument) = @_;

    # this line is INCORRECT
    $seq->isa('Bio::Seq') || die "[$seq] is not a sequence. Cannot process";

    # do stuff

  }

=head2 Rationale of interface design

Some people might justifiably argue "why do this?". The main reason is
to support external objects from bioperl, and allow them to masquarade
as real bioperl objects. For example you might have your own quite
intricate sequence object which you want to use in bioperl functions,
but don't want to lose your own neat coding. One option would be to
have a function which built a bioperl sequence object from your
object, but then you would be endlessly building temporary objects and
destroying them, in particular if the script yo-yoed between your code
and bioperl code.

A better solution would be to implement the Bio::SeqI interface. You
would read the Bio::SeqI documentation, and then provide the methods
which it required, and put Bio::SeqI in your @ISA array. Then you
could pass in your object into bioperl routines and eh voila - you
B<are> a bioperl sequence object.

(A problem might arise if your object has the same methods as the
Bio::SeqI methods but use them differently - your $obj->id() might
mean provide the raw memory location of the object, whereas the
documentation for Bio::SeqI $obj->id() says it should return the
human-readable name. If so you need to look into providing an
'Adaptor' class, as suggested in the Gang-of-four).

Interface classes really come into their own when we start leaving
Perl and enter extensions wrapped over C or over databases, or
through systems like CORBA to other languages, like Java/Python
etc. Here the "object" is often a very thin wrapper over the
a DBI interface, or an XS interface, and how it stores the object
is really different. By providing a very clear, implementation free
interface with good documentation there is a very clear target
to hit.

Some people might complain that we are doing something very "un-perl-like"
by providing these separate interface files. They are 90% documentation,
and could be provided anywhere, in many ways they could be merged with
the actual implementation classes and just made clear that if someone
wants to mimic a class they should override the following methods. However,
we (and in particular myself - Ewan) prefers a clear separation of the
interface. It gives us a much clearer way of defining what is going on.
It is in many ways just "design sugar" (as opposed to syntactic sugar)
to help us, but it really helps, so thats good enough justification to me.

=head2 Implementation functions in Interface files

One of the issues we discovered early on in using Interface files was
that there were methods that we would like to provide for classes
which were independent of their implementation. A good example is
a "Range" interface, which might define the following methods

   $obj->start()
   $obj->end()

Now a client to the object might want to use a $obj->length() method.
because it is much easier than retrieving the two attributes and
substracting them. However, the ->length() method is just a pain for
someone providing the implementation to provide - once start() and
end() is defined, length is. There seems to be a catch-22 here: to
make an object definition good for a B<client> one needs to have
additional, helper methods "on top of" the interface, however to make
life easier for the B<object implementation> one wants to have the
bare minimum of functions defined which the implementer has to
provide.

In the Range interface this became more than annoyance, as alot of the
"smarts" of the Range system was that we wanted to have the ability to
say

  if( $range->intersection($someother_range) ) 

We wanted a generic RangeI interface that we could apply to many
objects, with definitions required only for ->start, ->end and
->strand. However we wanted the ->intersection, and ->union methods to
be on all ranges, without us having to reimplement this every time.

Our (Matt Pocock and Ewan Birney's) solution was to allow
implementation into the RangeI interface file, but only when these
implementations sat "on top" of the interface definition and therefore
provided helper client operations. In a language like Java, we would
clearly have two classes, with a composition/delegation method:

   MyPublicSomethingClass has-a MyInternalSomethingInterface, with

   ADifferentImplemtation implements MyInternalSomethingInterface

However this is really heavy handed in Perl (and people were
complaining about having different implementation/interface
classes). We were quite happy about merging the implementation
independent functions with the interface definition, and I (Ewan) used
this in other interfaces since then. The documentation has to be clear
about what is going on, but I think in general it is.

=head2 IDL (Interface Definition Language)

There is an idl definition of bioperl in bioperl.idl. This is the start
of a new era of interoperability in this field, so please read it and
see if you can comment on it.

=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================