[Biojava-l] Newbie questions

Thomas Down td2@sanger.ac.uk
Thu, 27 Jul 2000 10:28:27 +0100

On Wed, Jul 26, 2000 at 02:35:27PM -0600, Michael Giddings wrote:
> 1. I'm looking for a fast DNA sequence implementation for things like
> sequence searches, Smith-Waterman alignments, etc.  It seems to me that even
> SimpleSequence, being based on SimpleSymbolList, would not be as efficient
> as what I'm looking for, since it must create and manage in memory a whole
> list of Symbol objects (tell me if I'm wrong).  What I'm thinking of (and I
> have such a beast implemented in Objective-C already*) is essentially a
> wrapper around a String object (or some kind of contiguous, linear memory
> block).  It would have functionality such as reverse, complement, and could
> be extended with tools to do SmithWaterman searches against other sequences,
> etc. (I also have Objective-C code for SW alignments).  It would be easy to
> then have a constructor for the heavier-weight Sequence objects which would
> accept one of these "LightweightSequence" objects as input.

The SimpleSymbolList objects (and other SymbolList implementations)
may not be as inefficient as you think.  BioJava /doesn't/ have lots
of Symbol objects floating around (with all the memory overhead that
would imply).  If we just concentrate on DNA sequences for now, there
are four `singleton' Symbol objects representing the DNA bases, and
a singleton Alphabet which wraps them up.  References to these objects
can be easily obtained via the DNATools class.

A SimpleSymbolList is an array of references to these objects.  All
the simple operations you might want to do on one of these sequences
just compile down to comparisons between pointers, which aren't
computationally any more expensive than comparison between ASCII
characters (indeed, I can think of one or two processor architecures
where this might be quicker!).

The only real downside of SimpleSymbolLists are that they do use
one pointer (4 bytes, on most systems) per symbol.  So far, this
hasn't really turned out to be a problem, but there's no reason
why we can't write more space-efficient implementations, for
instance using an array of bytes instead.

Let me know if you want to see ByteArraySymbolList,
or something similar.

Incidentally, if you want to get Smith-Waterman (or any other
dynamic programming algorithm) up and running quickly, grab
the latest BioJava CVS and take a look at the org.biojava.bio.dp

> 2. Why doesn't SimpleSequence or SimpleSymbolList have a constructor which
> takes a string as input to initialize the sequence?  Was this intentional
> (i.e. am I missing something), or just a feature nobody has needed yet?

The main reason for this is that there's a certain amount of
complexity inherant in parsing the string representation of a
sequence.  In BioJava, the `T' which is thymine in DNA is represented
by a different Symbol object from the `T' which is a threonine in
protein.  But you can use the same SimpleSymbolList class to
store both kinds of sequence.

The easiest way to get your string-like sequences into BioJava
is to use something like:

  String seqString = "gatcgga";
  Alphabet dnaAlpha = DNATools.getAlphabet();
  SymbolParser parser = dnaAlpha.getParser("token");
  SymbolList seq = parser.parse(seqString);

Does that make some kind of sense?

> 3. It would be really nice if there was an overview document for all the
> classes, giving a basic rundown on the class hierarchy, where to find
> various types of Objects, what the philosophy is behind the inheritance
> structure, and so on.  For example, it seems like there is a dual
> inheritance structure, one for interfaces an one for the implementations
> themselves.  This makes figuring out what's going on doubly complicated to a
> newbie.

Yes, it would be nice...  We're gradually working on improving
the quality of the JavaDoc (are you using a version out of CVS,
or the version on the website?  All the non-CVS versions are 
currently a bit out of date, sorry...).  I've also started writing
a tutorial:


This isn't anywhere near perfect yet, but it might answer some more
of your questions.  I'm hoping to expand this as and when I get the
time.  (Obviously, if anyone else is feeling at a loose end, feel
free to contribute some extra sections!)

Some more formal design documents would also be great, but again
there's an issue of time.

But to briefly answer your question, I'd suggest you don't worry
too much about the `double inheritance' issue (which is common to
a lot of Java frameworks).  Most of the time, you'll be working
with the interfaces, so just concentrate on the interface hierarchy.
When implementations inherit off one another, that's usually just
a behind the scenes convenience, rather than an issue which is
important for end users.

> 4. Does anyone on this list know about the Bio-OpenSource meeting at ISMB
> '00, and what will be taking place there?  The ISMB web-site is kind of
> vague about it, but the registration deadline is fast approaching.

I unfortunately won't be there (wish I was...), but Matt Pocock will
be flying the BioJava flag.  As I understand it, there will be shortish
presentations about a number of projects (including BioJava), then
an opportunity for open discussion sessions.  Matt might be able to
add some more detail to this.

> * For those not familiar, Objective-C inherits most (if not all) it's
> semantics from Smalltalk, and it is also semantically quite similar to Java.  

Good to see another ObjC hacker about!  I'd quite likely
still be using it myself, if Java VMs hadn't got up to a usable

Happy hacking (and hope you enjoy BioJava),

One of the advantages of being disorderly is that one is
constantly making exciting discoveries.
                                       -- A. A. Milne