[Biojava-l] Re: Biojava-l digest, Vol 1 #6 - 3 msgs

David Martin david.martin@biotek.uio.no
Mon, 24 Jan 2000 18:58:33 +0100


A few comments. 
I should mention that I am a biologist as background so if I don't
understand what is going on (however smart it may be) you may be limiting
yourself to a few expert comp sci types to write more code. My hopes for
biojava is that it is readily useable by relative novices as well.

Having said that..


> I have just started writing.  I currently have working classes for
> ProteinSequence, DNASequence, DNASequenceList, and others.  My objects are
> pretty smart.  For example DNASequence has a method transcribe() that
> returns an RNASequence.  DNASequenceList has a great static method that
> takes a FASTA file of the genome as an argument and returns a
> DNASequenceList encapsulating all DNASequences (genes) for that genome. 

I hate to throw a spanner in the works but that doesn't make much
biological sense when we can't even predict accurately the genes in a
genome. And FASTA isn't exactly what I would call a highly annotated
sequence format.

> On String implementation of Sequences:
> Ewan says that the Sequence class should implement the internal data as a
> string.  I really have to disagree with this.  It makes much more sense to
> model the data structure like the real thing.  For example,
> ProteinSequence is a linear sequence of Amino Acids.  In my
> implementation, I do exactly this.  ProteinSequence is a linear list of
> Objects which implement ProteinChar interface.  The ProteinChar interface
> defines all of the state properties we have for amino acids (e.g. charge,
> aromaticity).  
> 
> Because all of my ProteinChar objects are smart (i.e. they know their
> internal state), I can write some simple methods really easily.

Either your ProteinChar objects are static or they carry an awful lot of
overhead that is not needed in most cases. ie. every glutamate is
identical wrt charge (& if it is derivatised it is no longer a glutamate).
So an alphabet/lookup scheme would work better.

> For example, the ProteinSequence class can include such methods as:
> 
> public int CountChargedResidues ()
> {
>   int chargedCount=0;
> 
>   for (int i=0; i< this.getLength();i++)
>     if ( this.getCharAt(i).isCharged() )
>       chargedCount++:
> 
>   return chargedCount;
> }
> 
> See how many lines it takes to do that if your Sequence is a string.  You
> can do it, but you need to develop a hashtable for every property
> (ChargedHashTable, AromaticHashTable, etc.)
These are static properties kept in an external class (or a particular
instance of an AminoAcidTable class. So replacing the 
this.getCharAt(i).isCharged() 
with
AminoAcidTable.isCharged(this.getCharAt(i))
is the only substantive change.

Other reasons to use strings would be:
Biologists think of sequences as strings.
Plenty of methods readily available for substring searching and
manipulation.

And reasons to use an object list:
extensible to other properties, ie an Amino Acid object could also have 3D
coordinates associated with it..

Which leads me to two conclusions:

1. Remember OOP 101? Encapsulation.
Basically I don't care how the sequence is implemented internally because
I am not messing about with that. All I need is an interface with suitable
methods (Encapsulation) and letting those who write the underlying objects
worry about whether for a given instance it is better to use a string
representation or an object list representation for a given purpose.

2. Make Sequence an interface. 
Sequence is a property of a larger object. It would be great to be able to
cast a gene to sequence or a PDB structure to sequence so one is working
on the original object rather than having to call a getSequence() method
each time you want to do a sequence search on a PDB database.
eg Wise2Align(myPDB, myGenome);
for a method that takes two Sequence objects as arguments (and is smart
enough to work out which one is which type).

This brings some limitations. One cannot then have the Sequence interface
allowing modification of the underlying object but one can work around
that in many ways (use the original object type for instance).

It also has advantages in that all objects can be OMG sequences, Sanger
sequences, and biojava sequences at the same time. Bit more heavyweight
but it will fit into all the right bits and not require any nasty
translators (except in the constructor).

> 
> Cool -- that's certainly the kind of functionality that's
> nice to have.  But I'd rather not have anything hardcoded:
> you can just about get away with this for transcribe()
> (although what happens for tRNA genes which contain lots of
> unusual nucleotide residues -- where do these get put in?)
> but think about converting an RNASequence to a ProteinSequence:
> you need to know whether to use the universal genetic code
> or some wierd mitochondrial variant (you might like to look
> at what the OMG IDLs do to handle this).  

This is where a 'lookup table' works far better than an object list (IMHO)

I wasn't terribly happy with the earlier drafts of the OMG document in
this respect as they didn't allow ambiguity codes (which made the objects
nigh on useless) and haven't had time to read the final version yet. (Yes
I did make some suggestions because I had already had to solve that
problem and the answer was trivial)

Also there's this
> issue of (especially when these objects are being used to
> back GUI applications) wanting to do things like three-frame
> translations.
> 
> What I'd really like to see for BioJava is a framework for
> converting between different types of Sequence objects, so
> that the converters (transcribers and translaters) can be
> plugged in as needed.  This is something that the Sanger
> centre core doesn't have right now, but it really ought to.
> 
> > DNASequenceList has a great static method that
> > takes a FASTA file of the genome as an argument and returns a
> > DNASequenceList encapsulating all DNASequences (genes) for that genome. 
> 
> Does this kind of functionality belong in a static method?  I
> tend to be a bit wary of statics in general, since they make
> extensibility hard -- this becomes especially important if
> we ever want to see BioJava components being plugged together
> for rapid application development in an IDE.
> 
> > My code is not yet javadoc'd.  And won't be before next weekend.  But you
> > can have a look at a UML diagram of my protein classes.  It communicates a
> > lot.  
> > condor.bcm.tmc.edu/~mm692227/biosequence/Protein.html   //scaled to fit
> > condor.bcm.tmc.edu/~mm692227/biosequence/Protein.gif    //full size
> > condor.bcm.tmc.edu/~mm692227/biosequence/Protein.ps     //printable
> 
> Out of interest, why do you use Vectors when Collection API List
> implementations are much nicer, and (depending on how you use
> them) faster.  [Being stuck with a Java1 runtime is a reasonable
> answer, but high quality Java2 implementations are now finally being
> rolled out.  You can also download a Collections implementation for
> Java1 if you want to experiment -- let me know if you want me to
> dig out a URL]

I am still in the days of Java 1(.1) as most of what I have been doing had
to work round bugs in web browsers. IMHO we should aim to be as backward
compatible as reasonably possible. Again this comes down to implementation
so shouldn't affect the API presented by biojava. (possible to have two
streams, 1.1 and uptodate? ugly but maybe neccessary)

> 
> > On GUIS:  agree with most of what's been said.  Should definitely keep the
> > GUI isolated from the implementation of the model, in accordance with
> > Model-View-Contoller paradigm
> 
> Yes, I think we all seem to agree on this one.  Java provides a
> nice framework for building components which follow this pattern.

Its so obviously the right way to go that I'm surprised anyone even
thought there might be dissention?


Anyway, just another cat amongst the pigeons to keep one happy.

..d

---------------------------------------------------------------------
*  Dr. David Martin                  Biotechnology Centre of Oslo   *
*  Node Manager                      Gaustadalleen 21               *
*  The Norwegian EMBNet Node         P.O. box 1125 Blindern         *
*  tel +47 22 95 87 56               N-0317 Oslo                    *
*  fax +47 22 69 41 30               Norway                         * 
---------------------------------------------------------------------