[Biojava-l] Re: Biojava-l digest, Vol 1 #6 - 3 msgs

Mark Schreiber mark_s@sanger.otago.ac.nz
Tue, 25 Jan 2000 09:58:15 +1300 (NZDT)


> 
> On Mon, 24 Jan 2000, David Martin wrote:
> 
> > A few comments. 
> > I should mention that I am a biologist as background so if I don't
> > understand what is going on (however smart it may be) you may be limiting
> > yourself to a few expert comp sci types to write more code. My hopes for
> > biojava is that it is readily useable by relative novices as well.
> 
> My intention is to make a large, useful biosequence package with an easy
> and intuitive API so that people without CS training can use it without
> problems.  (By the way, I am not a computer scientist either.  I just got
> lucky by getting to take some really good CS courses.)

Most biologists aren't computer scientists. Personally I am self taught
with no formal education which is why I violate encapsulation whenever I
think it makes sense ie when I want to make pluggable tools that leave my
sequence class light weight.


> > > I have just started writing.  I currently have working classes for
> > > ProteinSequence, DNASequence, DNASequenceList, and others.  My objects are
> > > pretty smart.  For example DNASequence has a method transcribe() that
> > > returns an RNASequence.  DNASequenceList has a great static method that
> > > takes a FASTA file of the genome as an argument and returns a
> > > DNASequenceList encapsulating all DNASequences (genes) for that genome. 
> > 
> > I hate to throw a spanner in the works but that doesn't make much
> > biological sense when we can't even predict accurately the genes in a
> > genome. And FASTA isn't exactly what I would call a highly annotated
> > sequence format.
> > 
> 
> I think it makes very much biological sense to do this.  Gene prediction
> on bacterial genomes has been very successful (especially compared to
> eukaryotic genomes).  There are currently 24 microbial genomes available
> for download.  If you have an algorithm you want to test on a genomic
> scale, my method is perfect for you.  You can with one line of code make
> an object that encapsulates each coding sequence of a genome for
> subsequent testing of your algorithm.  Then you can do the same test on
> the other 23 genomes.

Why not use the GenBank annotation? Personally I didn't because I didn't
feel like having a feature table in my sequence class. I will probably
extend it at some stage to a more heavy weight class that includes such
things but I like to keep overhead down when I can. (I can't always get on
the Ultra Sparc).

> 
> > These are static properties kept in an external class (or a particular
> > instance of an AminoAcidTable class. So replacing the 
> > this.getCharAt(i).isCharged() 
> > with
> > AminoAcidTable.isCharged(this.getCharAt(i))
> > is the only substantive change.
> 
> What you say is true enough, but it violates encapsulation.  Instead of
> the amino acid's state being defined internally, you have to use an
> external table to look it up.  

Like I said if violating encapsulation makes a class smaller and more
adaptable to other packages violate away.

> 
> > 
> > Other reasons to use strings would be:
> > Biologists think of sequences as strings.
> > Plenty of methods readily available for substring searching and
> > manipulation.
> 
> 
> My Sequence objects can return Strings, substrings, etc, that work fine
> with other string-manipulating methods.
> All object implementations can have wrappers that let you treat them like
> strings.  This is really easy.  And for users like you who need an API but
> don't need to worry about the internals, it will be perfect.  Like I
> explained to Thomas earlier, you can build constructors that use strings:
> 
> ProteinSequence myName = new ProteinSequence("MIKEMARSH")
> constructs an object whose data is stored internally as 
> data.elementAt(0) = Met.instance
> data.elementAt(1) = Ile.instance
> data.elemantAt(2) = Lys.instance
> data.elementAt(3) = Glu.instance 
> ...
> 

This is going to be useful because a number of people will only need
strings most of the time but occasionally will want to have the kind of
internal info your classes have.

> 
> > 2. Make Sequence an interface. 
> > Sequence is a property of a larger object. It would be great to be able to
> > cast a gene to sequence or a PDB structure to sequence so one is working
> > on the original object rather than having to call a getSequence() method
> > each time you want to do a sequence search on a PDB database.
> > eg Wise2Align(myPDB, myGenome);
> > for a method that takes two Sequence objects as arguments (and is smart
> > enough to work out which one is which type).
> 
> That is precisely why I have defined Sequence as an abstract class.  My
> ProteinSequence is specifically not 'final' for this reason.  You can
> extend ProteinSequence with ProteinStructure, a sequence of objects which 
> implement both the ProteinChar interface and the Drawable interface.
> AlaStructure extends Ala implements both ProteinChar interface and
> Drawable interface so it will inherit all of the
> intrinsic properties of Ala (charge, aromaticity, etc) and add coordinates
> (or Atom objects) for Calpha, Cbeta, etc.
> 
>

Good job, more abstract and less final classes will make the "final
biojava standard" more easily integrated with existing and future
developments.

 
> 
> > 
> > This brings some limitations. One cannot then have the Sequence interface
> > allowing modification of the underlying object but one can work around
> > that in many ways (use the original object type for instance).
> > 
> > It also has advantages in that all objects can be OMG sequences, Sanger
> > sequences, and biojava sequences at the same time. Bit more heavyweight
> > but it will fit into all the right bits and not require any nasty
> > translators (except in the constructor).
> > 
> > > 
> > > Cool -- that's certainly the kind of functionality that's
> > > nice to have.  But I'd rather not have anything hardcoded:
> > > you can just about get away with this for transcribe()
> > > (although what happens for tRNA genes which contain lots of
> > > unusual nucleotide residues -- where do these get put in?)
> > > but think about converting an RNASequence to a ProteinSequence:
> > > you need to know whether to use the universal genetic code
> > > or some wierd mitochondrial variant (you might like to look
> > > at what the OMG IDLs do to handle this).  
> > 
> > This is where a 'lookup table' works far better than an object list (IMHO)
> 
> Every nucleotide object knows its own internal state.  So
> DAde.instance.getComplement() returns Thy.instance.  This keeps all of the
> data neatly encapsulated without relying on tables.  I strongly beleive in
> encapsulation; when I need to change my objects, I don't want to have to
> change a bunch of tables, I'd rather change it in one place, i.e. the
> class.
>

You shouldn't need to change the object at all, just pass it the
appropriate table. Be very carefull when changing an object as you must
keep the method calls the same (you can change the internal
implementation) as you probably know.
 
> > I am still in the days of Java 1(.1) as most of what I have been doing had
> > to work round bugs in web browsers. IMHO we should aim to be as backward
> > compatible as reasonably possible. Again this comes down to implementation
> > so shouldn't affect the API presented by biojava. (possible to have two
> > streams, 1.1 and uptodate? ugly but maybe neccessary)
> > 
> 
> I disagree totally with this.  New extensions to java simplify the life of
> programmers.  New classes in java2 like Collections, Sets, Lists, etc
> should be embraced because they define API's which are easier to use than
> having all programmers come up with their own incarnations of Sets, Lists,
> etc.  Additionally, java is a rapidly evolving langauge, and that is good.
> Backwards compatiblity is not an issue.  The browsers already have
> plug-ins available to bring them up to speed, and it is only a matter of
> time before new versions are relased with built in up-to-date JVMs.
> 
> 
> -mike
> 

I think this will depend on who you want to distribute to. For a
computationaly savvy crowd you could use java2 but for a more general
release use java1.1 or 1.2 (personally I use 1.2 as I can't afford to
update my IDE program and I it has made me too lazy to use the JDK
(sigh)).

Mark

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Mark Schreiber			Ph: 64 3 4797875
Rm 218				email mark_s@sanger.otago.ac.nz
Department of Biochemistry	email m.schreiber@clear.net.nz
University of Otago		
PO Box 56
Dunedin
New Zealand
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~