[Biojava-dev] Contributing chromatogram support to BioJava

Rhett Sutphin rhett-sutphin at uiowa.edu
Mon Mar 10 16:22:20 EST 2003


I sent this message this morning, but it got held for having 
"suspicious headers" and hasn't been approved by the moderator.  In the 
hopes that the suspicious part was the attachment, I'm resending the 
message without it.

----

Hi Matthew,

Thanks for the quick reply.  I still have some questions.

On Monday, March 10, 2003, at 05:22  AM, Matthew Pocock wrote:
> back in the good old days, we made prety much everything public. Then 
> we realised that was bad. unfortunately, the realy old packages have 
> not been totaly spring-cleaned for cruftily exposed API. Implementing 
> symbols propperly is hard, which is why we attempt to provide all the 
> tools for creating your own without writing new classes. Hey ho.

I'm guessing from this that the reason you want to keep some things 
package-level is to avoid them being "published API" and thereby avoid 
being required to keep their interfaces stable.  That could very well 
be a good reason.  On the other hand, making the Simple*Symbol classes 
public and defining their APIs could make implementing symbols a lot 
easier.  For instance, subclassing from SimpleBasisSymbol I was able to 
create a functioning BasisSymbol by creating a pair of alphabets and 
then using them to fill in the SimpleBasisSymbol#symbols and 
SimpleBasisSymbol#matches fields.

BTW, I think that the tools for creating and using Alphabets and 
Symbols are well-thought and nicely documented.  I just think that they 
aren't sufficient for my needs in this case, as I'll explain in a 
moment.

> Ok - so you want an alphabet that contains symbols that are a DNA 
> nucleotide and an integer. You can do that with some variant of the 
> following:
> <useful Alphabet creation/use examples snipped>

I did do this, but I did it in the context of defining an Alphabet for 
this new type of BasisSymbol called BaseCall.  The reason why I did 
this instead of just defining the Alphabet and using getSymbol (as you 
suggest) is twofold:

1) BaseCalls need to be annotatable (upon creation).  SCFs, for 
instance, contain seven quality values associated with each call.  The 
most natural way (to me) to associate those values with each base call 
is through an Annotation.  Is there another way that would be better?

2) I wanted to provide a way to get at the two halves of each base call 
by name.  That is, instead of doing:

   Symbol basecall = chromat.getBaseCalls().get(3);
   Symbol callDNA = basecall.getSymbols().get(1);
   int callOffset = ((IntegerAlphabet.IntegerSymbol) 
basecall.getSymbols().get(2)).intValue()

You could just do:

   BaseCall basecall = (BaseCall) chromat.getBaseCalls().get(3);
   Symbol callDNA = basecall.getNucleotide();
   int callOffset = basecall.getOffset();

The problem I am most trying to avoid is requiring users of the class 
to know that the first subsymbol of a base call is the nucleotide and 
the second is the peak offset.  It seems like that information should 
be abstracted away.  Since you suggested that subclassing is not the 
way to go, I thought of an alternative.  I could define a class call 
ChromatogramTools and give it methods like these:

   public static int getBaseCallOffset(Symbol basecall) throws 
IllegalSymbolException;
   public static Symbol getBaseCallNucleotide(Symbol basecall) throws 
IllegalSymbolException;

Which would turn the example above into:

   Symbol basecall = chromat.getBaseCalls().get(3);
   try {
     Symbol callDNA = ChromatogramTools.getBaseCallNucleotide(basecall);
     int callOffset = ChromatogramTools.getBaseCallOffset(basecall);
   } catch (IllegalSymbolException ise) {
     throw new BioError(ise, "Can't happen unless there is a problem 
with the chromatogram implementation");
   }

The thing I don't like about the alternative method is that those 
"tools" methods will have to throw IllegalSymbolExceptions since the 
basecall parameter's type is just Symbol (and so might not be a member 
of the base call alphabet).  Therefore you have to wrap every 
invocation of them in a try block, even though (with a well-behaved 
Chromatogram implementation) you are guaranteed the exception won't be 
thrown.

The basic OO-way to get around this is to have a strictly defined type 
for the parameter -- that way the execution-time IllegalSymbolException 
can be a compile-time error, instead.

So it seems to me that the best way to handle this is a 
BasisSymbol-implementing class for BaseCalls.  It is the only way I see 
to handle these two issues.  Do you have another suggestion?

Rhett

BTW: I've attached the code for BaseCall in case my prose argument 
above wasn't clear.


--
Rhett Sutphin
Research Assistant (Software)
Coordinated Laboratory for Computational Genomics
   and the Center for Macular Degeneration
University of Iowa - Iowa City, IA 52242 - USA
4111 MEBRF - email: rhett-sutphin at uiowa.edu



More information about the biojava-dev mailing list