[Biojava-l] Re: Biojava-l digest, Vol 1 #6 - 3 msgs

Mike Marsh mm692227@bcm.tmc.edu
Mon, 24 Jan 2000 13:21:19 -0600 (CST)


David,

Thanks for the feedback.  My specific replies are below. 
What is OMG?



-------------------------------------------------------------
Mike Marsh
Graduate Student in Structural and Computational Biology
Baylor College of Medicine.  Houston, TX

FON: 713/798-6034
Permanent Email:  mikemarsh@bigfoot.com
-------------------------------------------------------------

On Mon, 24 Jan 2000, David Martin wrote:

> A few comments. 
> I should mention that I am a biologist as background so if I don't
> understand what is going on (however smart it may be) you may be limiting
> yourself to a few expert comp sci types to write more code. My hopes for
> biojava is that it is readily useable by relative novices as well.

My intention is to make a large, useful biosequence package with an easy
and intuitive API so that people without CS training can use it without
problems.  (By the way, I am not a computer scientist either.  I just got
lucky by getting to take some really good CS courses.)

> 
> Having said that..
> 
> 
> > I have just started writing.  I currently have working classes for
> > ProteinSequence, DNASequence, DNASequenceList, and others.  My objects are
> > pretty smart.  For example DNASequence has a method transcribe() that
> > returns an RNASequence.  DNASequenceList has a great static method that
> > takes a FASTA file of the genome as an argument and returns a
> > DNASequenceList encapsulating all DNASequences (genes) for that genome. 
> 
> I hate to throw a spanner in the works but that doesn't make much
> biological sense when we can't even predict accurately the genes in a
> genome. And FASTA isn't exactly what I would call a highly annotated
> sequence format.
> 

what is 'spanner'

I think it makes very much biological sense to do this.  Gene prediction
on bacterial genomes has been very successful (especially compared to
eukaryotic genomes).  There are currently 24 microbial genomes available
for download.  If you have an algorithm you want to test on a genomic
scale, my method is perfect for you.  You can with one line of code make
an object that encapsulates each coding sequence of a genome for
subsequent testing of your algorithm.  Then you can do the same test on
the other 23 genomes.

> These are static properties kept in an external class (or a particular
> instance of an AminoAcidTable class. So replacing the 
> this.getCharAt(i).isCharged() 
> with
> AminoAcidTable.isCharged(this.getCharAt(i))
> is the only substantive change.

What you say is true enough, but it violates encapsulation.  Instead of
the amino acid's state being defined internally, you have to use an
external table to look it up.  

> 
> Other reasons to use strings would be:
> Biologists think of sequences as strings.
> Plenty of methods readily available for substring searching and
> manipulation.


My Sequence objects can return Strings, substrings, etc, that work fine
with other string-manipulating methods.
All object implementations can have wrappers that let you treat them like
strings.  This is really easy.  And for users like you who need an API but
don't need to worry about the internals, it will be perfect.  Like I
explained to Thomas earlier, you can build constructors that use strings:

ProteinSequence myName = new ProteinSequence("MIKEMARSH")
constructs an object whose data is stored internally as 
data.elementAt(0) = Met.instance
data.elementAt(1) = Ile.instance
data.elemantAt(2) = Lys.instance
data.elementAt(3) = Glu.instance 
...

> And reasons to use an object list:
> extensible to other properties, ie an Amino Acid object could also have 3D
> coordinates associated with it..
> 
> Which leads me to two conclusions:
> 
> 1. Remember OOP 101? Encapsulation.
> Basically I don't care how the sequence is implemented internally because
> I am not messing about with that. All I need is an interface with suitable
> methods (Encapsulation) and letting those who write the underlying objects
> worry about whether for a given instance it is better to use a string

I agree 100%.



> 2. Make Sequence an interface. 
> Sequence is a property of a larger object. It would be great to be able to
> cast a gene to sequence or a PDB structure to sequence so one is working
> on the original object rather than having to call a getSequence() method
> each time you want to do a sequence search on a PDB database.
> eg Wise2Align(myPDB, myGenome);
> for a method that takes two Sequence objects as arguments (and is smart
> enough to work out which one is which type).

That is precisely why I have defined Sequence as an abstract class.  My
ProteinSequence is specifically not 'final' for this reason.  You can
extend ProteinSequence with ProteinStructure, a sequence of objects which 
implement both the ProteinChar interface and the Drawable interface.
AlaStructure extends Ala implements both ProteinChar interface and
Drawable interface so it will inherit all of the
intrinsic properties of Ala (charge, aromaticity, etc) and add coordinates
(or Atom objects) for Calpha, Cbeta, etc.



> 
> This brings some limitations. One cannot then have the Sequence interface
> allowing modification of the underlying object but one can work around
> that in many ways (use the original object type for instance).
> 
> It also has advantages in that all objects can be OMG sequences, Sanger
> sequences, and biojava sequences at the same time. Bit more heavyweight
> but it will fit into all the right bits and not require any nasty
> translators (except in the constructor).
> 
> > 
> > Cool -- that's certainly the kind of functionality that's
> > nice to have.  But I'd rather not have anything hardcoded:
> > you can just about get away with this for transcribe()
> > (although what happens for tRNA genes which contain lots of
> > unusual nucleotide residues -- where do these get put in?)
> > but think about converting an RNASequence to a ProteinSequence:
> > you need to know whether to use the universal genetic code
> > or some wierd mitochondrial variant (you might like to look
> > at what the OMG IDLs do to handle this).  
> 
> This is where a 'lookup table' works far better than an object list (IMHO)

Every nucleotide object knows its own internal state.  So
DAde.instance.getComplement() returns Thy.instance.  This keeps all of the
data neatly encapsulated without relying on tables.  I strongly beleive in
encapsulation; when I need to change my objects, I don't want to have to
change a bunch of tables, I'd rather change it in one place, i.e. the
class.

> I am still in the days of Java 1(.1) as most of what I have been doing had
> to work round bugs in web browsers. IMHO we should aim to be as backward
> compatible as reasonably possible. Again this comes down to implementation
> so shouldn't affect the API presented by biojava. (possible to have two
> streams, 1.1 and uptodate? ugly but maybe neccessary)
> 

I disagree totally with this.  New extensions to java simplify the life of
programmers.  New classes in java2 like Collections, Sets, Lists, etc
should be embraced because they define API's which are easier to use than
having all programmers come up with their own incarnations of Sets, Lists,
etc.  Additionally, java is a rapidly evolving langauge, and that is good.
Backwards compatiblity is not an issue.  The browsers already have
plug-ins available to bring them up to speed, and it is only a matter of
time before new versions are relased with built in up-to-date JVMs.


-mike