[Bioperl-l] Experiences from a newbie

Andreas Bernauer andreas.bernauer at gmx.de
Fri Jan 23 12:17:09 EST 2004


Hi Brian and all other Bioperlers!

Brian Osborne wrote:
> I've started writing a HOWTO on Features and Annotations, I would appreciate
> your taking a look at it and telling me what you think, here's the URL:
> 
> http://bioperl.org/HOWTOs/html/Feature-Annotation.html

Finally I'm back here to Bioperl and I read your HOWTO.  Here are my
comments.  This mail is quite long and I criticize a lot.  Don't hit
me for that, the fact that I am still using Bioperl proves that its
functionality outranges the drawbacks in its documentation, although
the later makes it harder for me to use it.  I want my comments to be
understood as constructive, impulses for improvements you migh want to
make if you agree with (parts of) my opinion.

First of all I want to mention, that Brian's HOWTO helped me to
understand what's going on in Bioperl, although I still can't
understand some design decisions of Bioperl (not of the HOWTO).  As a
conclusion I must say that Brian's HOWTO is understandable as itself,
but what still stays not understandable is the way Bioperl is built up
and _why_ it was built up that way.  Thus, my comments are somehow
connected to the metalevel.

Concerning your HOWTO, I very much like the examples you give,
e.g. how to extraxt these CDS information from a GenBank file (as this
is exactly what I wanted to do, when I used BioPerl the first time :-)
or the link to the Feature Table Document that gives some more
detailed information.  The links collection at the end of the document
is also very helpful.

What I was missing were links to the module descriptions of the
modules you mention at the place you mention them.  Sometimes the doc
even tells the reader to look into the module definition without
providing a link.  But that's a minor issue.

For example, you briefly explain, what a SeqFeature is (associated
with a sequence and has a location there) and what an Annotation is
(associated with a sequence without a location).  So both are
associated with a sequence.  Why do they not inherit from some parent
object if they share this common feature, I was asking myself.  Why
doesn't Bioperl just have an object for a simple sequence and then
derive one specialty after another from that simple object, e.g. one
that has a location and one that doesn't?  Why are there two distinct
objects for that?

I know, this sounds totally stupid to you and now, that I have read
the whole HOWTO, including chapter 6 that explains Annotations in
details, it sounds so to me, too.  This shows that the HOWTO is really
helpful, but it also shows, that a small example at the beginning
wouldn't let the reader wonder all the time till chapter 6 to find out
what the difference is.

Furthermore, I don't understand the naming of the objects.  SeqFeature
and Annotation seem pretty related, but have completely different
names.  Maybe this is because of some lack in my biological
background, but what do you consider an "Annotation" and what a
"Feature"?  Where can I find definitions for that, written in English?
This would help me a lot, I guess, as I am constantly guessing, what
an object represents in the real world.  

Again, after reading the HOWTO and writing this email, especially, I
know the difference.  But there is no place in any document so far I
read, that tells me that.  I've concluded that after reading a couple
of documents.

This goes on with lots of other objects and goes hand in hand with the
module description.  Just randomly picking one out of Brian's HOWTO:
Bio::Seq:RichSeq.  I'd like to know what a RichSeq is and go to the
module description page.  I quote from that page
(http://doc.bioperl.org/releases/bioperl-1.4/Bio/Seq/RichSeq.html):

"This module implements Bio::Seq::RichSeqI, an interface for sequences
created from or created for entries from/of rich sequence databanks,
like EMBL, GenBank, and SwissProt. Methods added to the Bio::SeqI
interface therefore focus on databank-specific information. Note that
not every rich databank format may use all of the properties
provided."

OK, a RichSeq object representes sequences from a rich sequence
database.  Guess what, that's what I thought!  But what is a "rich
sequence (database)"?  The descriptions gives examples, but this does
not help me at all, as this is just the list of the databanks I know.
What I want to say is, I still don't know, what is _not_ a rich
sequence database, i.e. what's the difference between a plain sequence
and a rich sequence?  Aren't all sequences saved like in Swissprot,
etc.?  Or do you consider a FASTA file with only sequence information
as a database?  OK, I guess, you mean, that the sequences contain lots
of annototations or features or comments etc. and that's what you
consider "rich".  This might look a little bit picky, but this is the
kind of lack of definition that makes it so hard for me to get into
BioPerl as I encounter it again and again.


Another example that happened to me:  The HOWTO explains in Chapter 5
"Some Other Objects" how to read the species information from a
GenBank file.  The file looked like this:

SOURCE      human.
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.

and has an example code:

my $classification = $seq_object->species->classification;
# "sapiens Homo Hominidae Catarrhini Primates Eutheria Mammalia
# Euteleostomi Vertebrata Craniata Chordata Metazoa Eukaryota"

I guess the comment tells me, what's in $classification now.  This
looks strange to me and I don't really understand, how this string was
built.  The first two words are the words of the first line after
"organism", but in reverse order.  Then, without any marker, the rest
of the two lines follow, missing the semicolon and the last word
("Homo") and again in reverse order.  So I looked into the
documentation of the module Bio::Species, to figure out, how I am
supposed to find the information I want (namely the name of the
organism and its classification).

I went to http://doc.bioperl.org/releases/bioperl-1.4/ and looked in
the upper left frame to find Bio::Species, which I couldn't.  I don't
know why.  There is bioperl-1.4::Bio::SeqIO,
bioperl-1.4::Bio::SeqIO::game and bioperl-1.4::Bio::Structure, but
Bio::Species is missing.  So I searched around and found "Species" in
the lower left frame, which turned out to be the page I was looking
for (I guess).

Here's what the documentation gives me:

,--http://doc.bioperl.org/releases/bioperl-1.4/Bio/Species.html#POD2----
|Title   : classification
| Usage   : $self->classification(@class_array);
|           @classification = $self->classification();
| Function: Fills or returns the classification list in
|           the object.  The array provided must be in
|           the order SPECIES, GENUS ---> KINGDOM.
|           Checks are made that species is in lower case,
| Example : $obj->classification(qw( sapiens Homo Hominidae
|           Catarrhini Primates Eutheria Mammalia Vertebrata
|           Chordata Metazoa Eukaryota));
| Returns : Classification array
| Args    : Classification array 
|                 OR
|           A reference to the classification array. In the latter case
|           if there is a second argument and it evaluates to true,
|           names will not be validated.
`---------

Ok, sections title, usage help me.  Although the usage does not reveal
all possible arguments, but that's why there is an Args section, I
guess.

If I read the Function section correctly, it looks like this function
both sets or reads the classification of the object.  That looks kind
of weird to me, remembering what I've learned in OO with mutators and
accessors, but maybe that's the way it works in Perl.  Nevermind. 

The Example section shows me how I could call the function.
Unfortunately, it neither tells me, what the new state of the object
will be (ok, maybe that's clear for you, but hey, why do you give an
example anyway?).  Nor does it tell me, what I wanted to know:  if I
use the first form, what I get as a result.

The Returns section tells me I get a "Classification array". Hm, is
this another object?  Or just an array?  Why call it "Classification
array" then, and not "array representing classification"?  It does not
tell me how, however, what information this array carries.  Well,
another guess makes me believe that the returning array will have the
same structure as the array passed as argument.  And now I kind of
understand what's in $classification: It's not a string, but an array,
and the HOWTO just gave me the string representation of that array.

But why is the species printed in the wrong order ("sapiens Homo"
instead of "Homo sapiens")? And why is the last word ("Homo") missing?
I did not come up with an answer for this.  Maybe a bug in the HOWTO,
in the implementation of the classification function or a bug in my
understanding.


So what is this guy complaining about, you might think.  Wasn't there
everything (well, except for this missing "Homo" word) in the
documentation?  What I am criticizing is that although it might be
documented, it is documented in a way that takes me (and I am talking
only about myself) a lot of time to understand:

* I don't know how you name which thing.  (A big exception of this is
the SearchIO howto, which states in its 4th chapter exactly, what
attribute of the object contains what kind of data and where this data
comes from.)  What's an annotation?  A rich sequence?  A feature?
Different people might understand different things under these names
and I think it's good to clarify that everybody is talking about the
same thing.
(Btw, what's funny is that the doc to Bio::Seq tells me the following:
 Bio::SeqFeatureI - a location on a sequence, potentially with a
sequence and annotation, which is not quite the same as the HOWTO told
me (associated with a sequence, not potentially)).  

* I can't just read the documentation and use a function.  That's the
  major drawback for me.  I almost always have to make conclusions out
  of the documentation.  In the example above: Why does the Result
  section not tell me what the structure of the resulting array is?  I
  have to conclude that from what is stated in the Function section.
  So why is there a Result section at all?


Thank you for your time.  Again, I am still using Bioperl as it serves
my purpose, but I wished the documentation would be usefuller.  I
don't want to discourage anybody, instead, I'd like this to be
understood as a constructive critic.  I hope I've stated everything in
a clear way.


Best regards,

Andreas.





More information about the Bioperl-l mailing list