[Biojava-l] Re: the current discussion

Thomas Down td2@sanger.ac.uk
Mon, 24 Jan 2000 09:59:45 +0000


On Sun, Jan 23, 2000 at 11:17:46PM -0600, Mike Marsh wrote:
> My code:
> I have just started writing.  I currently have working classes for
> ProteinSequence, DNASequence, DNASequenceList, and others.  My objects are
> pretty smart.  For example DNASequence has a method transcribe() that
> returns an RNASequence. 

Cool -- that's certainly the kind of functionality that's
nice to have.  But I'd rather not have anything hardcoded:
you can just about get away with this for transcribe()
(although what happens for tRNA genes which contain lots of
unusual nucleotide residues -- where do these get put in?)
but think about converting an RNASequence to a ProteinSequence:
you need to know whether to use the universal genetic code
or some wierd mitochondrial variant (you might like to look
at what the OMG IDLs do to handle this).  Also there's this
issue of (especially when these objects are being used to
back GUI applications) wanting to do things like three-frame
translations.

What I'd really like to see for BioJava is a framework for
converting between different types of Sequence objects, so
that the converters (transcribers and translaters) can be
plugged in as needed.  This is something that the Sanger
centre core doesn't have right now, but it really ought to.

> DNASequenceList has a great static method that
> takes a FASTA file of the genome as an argument and returns a
> DNASequenceList encapsulating all DNASequences (genes) for that genome. 

Does this kind of functionality belong in a static method?  I
tend to be a bit wary of statics in general, since they make
extensibility hard -- this becomes especially important if
we ever want to see BioJava components being plugged together
for rapid application development in an IDE.

> My code is not yet javadoc'd.  And won't be before next weekend.  But you
> can have a look at a UML diagram of my protein classes.  It communicates a
> lot.  
> condor.bcm.tmc.edu/~mm692227/biosequence/Protein.html   //scaled to fit
> condor.bcm.tmc.edu/~mm692227/biosequence/Protein.gif    //full size
> condor.bcm.tmc.edu/~mm692227/biosequence/Protein.ps     //printable

Out of interest, why do you use Vectors when Collection API List
implementations are much nicer, and (depending on how you use
them) faster.  [Being stuck with a Java1 runtime is a reasonable
answer, but high quality Java2 implementations are now finally being
rolled out.  You can also download a Collections implementation for
Java1 if you want to experiment -- let me know if you want me to
dig out a URL]

> On GUIS:  agree with most of what's been said.  Should definitely keep the
> GUI isolated from the implementation of the model, in accordance with
> Model-View-Contoller paradigm

Yes, I think we all seem to agree on this one.  Java provides a
nice framework for building components which follow this pattern.

> On licenses:  Without a doubt, open source for academic use.  But I have
> no idea what those acronyms stand for.  GPL = gnu public license; LGPL =
> ??? ; MPL = ???.

LGPL == (Lesser|Library) General Public Licence.  See
http://www.fsf.org/.  Similar to GPL, but includes (limitted)
provisions for linking against non-GPLed applications.

MPL == Mozilla Public Licence.  See http://www.mozilla.org/MPL/

The term `Open Source for academic use' is a bit ambiguous.
`Open Source' really means following the Open Source Definition
(http://www.opensource.org/osd.html), which forbids excluding\
any field of endeavour.  Personally I'd welcome commercial
players to the community, so long as they aren't taking things
away -- and don't rule out useful contributions from commercial
sources, it does happen.

> On String implementation of Sequences:
> Ewan says that the Sequence class should implement the internal data as a
> string.  I really have to disagree with this.  It makes much more sense to
> model the data structure like the real thing.  For example,
> ProteinSequence is a linear sequence of Amino Acids.  In my
> implementation, I do exactly this.  ProteinSequence is a linear list of
> Objects which implement ProteinChar interface.  The ProteinChar interface
> defines all of the state properties we have for amino acids (e.g. charge,
> aromaticity).  

There's been some discussion on this topic internally at the
Sanger centre, but I think it's good to see it out of the mailing
list, too.  I'd never thought in terms of recording properties
like charge, but that could be another argument in favour
of Residue -> first class object, rather than residue ->
character.

My own argument is that first class objects are preferable
on the basis that they fit in much more naturally with Java's
strongly typed principles -- you want objects which represent
real-world concepts.  Using a residue/alphabet concept (as in
the Sanger biojava core) makes it easy to ensure that sequences
with a particular alphabet (DNA, protein, or whatever) only
contain valid residues.

That said, we do need to think carefully about how to
interact with the rest of the world, who are still using
characters (which make far more sense in less type-aware
languages, such as Perl).  I'm coming round to the idea
of having a parallel set of interfaces based on the OMG's
BioObjects module, and some bridge implementations which
allow you to construct a BioJava sequence from an OMG
(string based) one, and vice versa.

> This discussion is great.
> Don't let it die!

Doesn't seem to be too much risk of that -- this mailing
list has really heated up this last week...

It's good to see lots of ideas flying around -- let's
get all the ideas on the table, and see if we can bash
out a really good way of represent biological concepts
in Java.

Happy hacking,

Thomas.
-- 
``Science is magic that works''  -- Kurt Vonnegut.