[Biojava-l] Re: Biojava-l digest, Vol 1 #6 - 3 msgs

Mon, 24 Jan 2000 22:06:01 +0100

On Mon, 24 Jan 2000, Mike Marsh wrote:

> David,
> 
> Thanks for the feedback.  My specific replies are below. 
> What is OMG?
> 

Object management group, specifically the lifesciences research domain
task force Biomolecular sequence analysis proposals.
(hows that for a short snappy title..)

> > A few comments. 
> > I should mention that I am a biologist as background so if I don't
> > understand what is going on (however smart it may be) you may be limiting
> > yourself to a few expert comp sci types to write more code. My hopes for
> > biojava is that it is readily useable by relative novices as well.
> 
> My intention is to make a large, useful biosequence package with an easy
> and intuitive API so that people without CS training can use it without
> problems.  (By the way, I am not a computer scientist either.  I just got
> lucky by getting to take some really good CS courses.)
> 
> > 
> > Having said that..
> > 
> > 
> > > I have just started writing.  I currently have working classes for
> > > ProteinSequence, DNASequence, DNASequenceList, and others.  My objects are
> > > pretty smart.  For example DNASequence has a method transcribe() that
> > > returns an RNASequence.  DNASequenceList has a great static method that
> > > takes a FASTA file of the genome as an argument and returns a
> > > DNASequenceList encapsulating all DNASequences (genes) for that genome. 
> > 
> > I hate to throw a spanner in the works but that doesn't make much
> > biological sense when we can't even predict accurately the genes in a
> > genome. And FASTA isn't exactly what I would call a highly annotated
> > sequence format.
> > 
> 
> what is 'spanner'

A british wrench.

> 
> I think it makes very much biological sense to do this.  Gene prediction
> on bacterial genomes has been very successful (especially compared to
> eukaryotic genomes).  There are currently 24 microbial genomes available
> for download.  If you have an algorithm you want to test on a genomic
> scale, my method is perfect for you.  You can with one line of code make
> an object that encapsulates each coding sequence of a genome for
> subsequent testing of your algorithm.  Then you can do the same test on
> the other 23 genomes.

If you have a specific genetic code for each organism and if you make it
quite clear that this is a PREDICTION not a statement of fact.

> 
> > These are static properties kept in an external class (or a particular
> > instance of an AminoAcidTable class. So replacing the 
> > this.getCharAt(i).isCharged() 
> > with
> > AminoAcidTable.isCharged(this.getCharAt(i))
> > is the only substantive change.
> 
> What you say is true enough, but it violates encapsulation.  Instead of
> the amino acid's state being defined internally, you have to use an
> external table to look it up.  

If each sequence has an instance of AminoAcidTable as a sequence property
(much like a nucleotide sequence would have a 'Genetic Code' object) then
encapsulation is not violated.

> 
> > 
> > Other reasons to use strings would be:
> > Biologists think of sequences as strings.
> > Plenty of methods readily available for substring searching and
> > manipulation.
> 
> 
> My Sequence objects can return Strings, substrings, etc, that work fine
> with other string-manipulating methods.
> All object implementations can have wrappers that let you treat them like
> strings.  This is really easy.  And for users like you who need an API but
> don't need to worry about the internals, it will be perfect.  Like I
> explained to Thomas earlier, you can build constructors that use strings:

> 
> ProteinSequence myName = new ProteinSequence("MIKEMARSH")
> constructs an object whose data is stored internally as 
> data.elementAt(0) = Met.instance
> data.elementAt(1) = Ile.instance
> data.elemantAt(2) = Lys.instance
> data.elementAt(3) = Glu.instance 
> ...
> 
> > And reasons to use an object list:
> > extensible to other properties, ie an Amino Acid object could also have 3D
> > coordinates associated with it..
> > 
> > Which leads me to two conclusions:
> > 
> > 1. Remember OOP 101? Encapsulation.
> > Basically I don't care how the sequence is implemented internally because
> > I am not messing about with that. All I need is an interface with suitable
> > methods (Encapsulation) and letting those who write the underlying objects
> > worry about whether for a given instance it is better to use a string
> 
> I agree 100%.
> 
> 
> 
> > 2. Make Sequence an interface. 
> > Sequence is a property of a larger object. It would be great to be able to
> > cast a gene to sequence or a PDB structure to sequence so one is working
> > on the original object rather than having to call a getSequence() method
> > each time you want to do a sequence search on a PDB database.
> > eg Wise2Align(myPDB, myGenome);
> > for a method that takes two Sequence objects as arguments (and is smart
> > enough to work out which one is which type).
> 
> That is precisely why I have defined Sequence as an abstract class.  My
> ProteinSequence is specifically not 'final' for this reason.  You can
> extend ProteinSequence with ProteinStructure, a sequence of objects which 
> implement both the ProteinChar interface and the Drawable interface.
> AlaStructure extends Ala implements both ProteinChar interface and
> Drawable interface so it will inherit all of the
> intrinsic properties of Ala (charge, aromaticity, etc) and add coordinates
> (or Atom objects) for Calpha, Cbeta, etc.

So we have some ancestral abstract sequence class that goes on to being
split into protein and nucleic acid and then we start to duplicate
properties such as three dimensional structure, near contacts (DALI type
plot for proteins, similar plot for RNA folding etc.)

That is why I like interfaces because then I can (to use a biological
metaphor) have horizontal transfer and convergent evolution as well as
divergent.

Is a PDB structure best derived from a simple sequence object or from
something else?

How does a gene object (which by dint of current environment we think of
as a sequence) fit into a metabolic pathway or regulome object and link
with expression data.

(OK, the comp scis on the list are probably going crazy now as I have
uttered profanities in the temple or something equivalent)

But if bio java is to implement bio in java then we have to be aware of
the constraints of java and the messy nature of the biological 'omes. (If
only biology had been designed by an object oriented programmer).

> > 
> > This brings some limitations. One cannot then have the Sequence interface
> > allowing modification of the underlying object but one can work around
> > that in many ways (use the original object type for instance).
> > 
> > It also has advantages in that all objects can be OMG sequences, Sanger
> > sequences, and biojava sequences at the same time. Bit more heavyweight
> > but it will fit into all the right bits and not require any nasty
> > translators (except in the constructor).
> > 
> > > 
> > > Cool -- that's certainly the kind of functionality that's
> > > nice to have.  But I'd rather not have anything hardcoded:
> > > you can just about get away with this for transcribe()
> > > (although what happens for tRNA genes which contain lots of
> > > unusual nucleotide residues -- where do these get put in?)
> > > but think about converting an RNASequence to a ProteinSequence:
> > > you need to know whether to use the universal genetic code
> > > or some wierd mitochondrial variant (you might like to look
> > > at what the OMG IDLs do to handle this).  
> > 
> > This is where a 'lookup table' works far better than an object list (IMHO)
> 
> Every nucleotide object knows its own internal state.  So
> DAde.instance.getComplement() returns Thy.instance.  This keeps all of the
> data neatly encapsulated without relying on tables.  I strongly beleive in
> encapsulation; when I need to change my objects, I don't want to have to
> change a bunch of tables, I'd rather change it in one place, i.e. the
> class.

I also believe in nonduplication of redundant data (I've spent the last
few months playing with RDBMS for various admin and other reasons) so
where there is no specific need to have each entity in the sequence keep
its own data we should move to a lighter weight approach.

In otherwords, there is no need to use a PDB structure object to describe
a short peptide.

A range of suitable inherited objects should be available to allow the
coder to optimise for comprehensivity (ie the nucleotide sequence object
with linked experssion data, 3d structure and map of all the
internal interactions) through to the lightweight (simple string of
characters for doing a rapid sequence alignment.

> 
> > I am still in the days of Java 1(.1) as most of what I have been doing had
> > to work round bugs in web browsers. IMHO we should aim to be as backward
> > compatible as reasonably possible. Again this comes down to implementation
> > so shouldn't affect the API presented by biojava. (possible to have two
> > streams, 1.1 and uptodate? ugly but maybe neccessary)
> > 
> 
> I disagree totally with this.  New extensions to java simplify the life of
> programmers.  New classes in java2 like Collections, Sets, Lists, etc
> should be embraced because they define API's which are easier to use than
> having all programmers come up with their own incarnations of Sets, Lists,
> etc.  Additionally, java is a rapidly evolving langauge, and that is good.
> Backwards compatiblity is not an issue.  The browsers already have
> plug-ins available to bring them up to speed, and it is only a matter of
> time before new versions are relased with built in up-to-date JVMs.

Note the 'reasonably possible'. I think we should aim to be fairly
conservative if 'reasonably possible' ie, not using java2 bells and
whistles for the sake of it. However there are many things that are
better done in java2..

..d

---------------------------------------------------------------------
*  Dr. David Martin                  Biotechnology Centre of Oslo   *
*  Node Manager                      Gaustadalleen 21               *
*  The Norwegian EMBNet Node         P.O. box 1125 Blindern         *
*  tel +47 22 95 87 56               N-0317 Oslo                    *
*  fax +47 22 69 41 30               Norway                         * 
---------------------------------------------------------------------