[Biojava-l] Serialization fallout from the Grand Symbol Change

Thomas Down td2@sanger.ac.uk
Tue, 19 Dec 2000 14:06:59 +0000


Hi.

A couple of weeks on from the Grand Symbol Changes, everything
seemed to be going smoothly until...

Serialization of arbitrary symbols (be they AtomicSymbols,
BasisSymbols, or something else).

One of the important characteristics we've always tried to
preserve in BioJava is that the symbols contained within
FiniteAlphabets should always be singletons.  This means:

  - There is always exactly ONE Symbol object within
    your java virtual machine representing, say, the 
    DNA `T' symbol.

  - You can always compare symbols from FiniteAlphabets
    using object identity (== operator), rather than
    using the equals method.

Back in the dark ages, I was able to implement a system
whereby simple, atomic symbols (the only kind we had then)
like the DNA `T' could be serialized and deserialized 
while preserving this object identity property (see
AlphabetManager.WellKnownSymbol if you're interested --
the idea is that instead of serializing the Symbol object
directly, we serialize a special `place holder' object.  Upon
deserialization, this replaces itself with the corresponding
cannonical symbol).  For a while everything worked nicely...

Having more complex symbol objects makes matter a /lot/
harder, though.  Just as there is a single `T' symbol
in the DNA alphabet, there should be a singleton (T T)
symbol in the alphabet (DNA x DNA).  And so on.  We're
looking for a new way of canonicalizing arbitrary symbols.
Possible options are:

  - Keep a pool of all known symbols in the AlphabetManager,
    and use that for canonicalization.  We've done this in
    the past but it seems a /bad idea/ -- especially if
    people start working with very complicated cross-product
    alphabets.  If the alphabet/symbol system is to scale,
    the pools of canonical symbols need to be associated with
    their containing alphabets, so they can be garbage-collected
    once the alphabet is no longer in use.

Options which keep the canonical symbol pool in the Alphabet.

  - Symbol objects keep a reference to their `primary' containing
    alphabet.  This makes serialization/deserialization relatively
    easy.  There has been resistance to this plan in the past,
    though -- does anyone have any trouble with it now?  From
    where I'm standing this looks like the simplest solution which
    is scalable and doesn't break anything too radically.  But
    it /does/ change the idea of what a symbol is slightly...

  - The standard symbol implementations are no longer Serializable.
    Objects which use Symbols (e.g. SymbolLists, Distributions)
    have to provide explicit serialization code.  In support of
    this, we add a method to AlphabetManager to construct a
    `place holder' object which encapsulates all the information
    necessary to reconstitute a given symbol (including the
    Alphabet to use for canonicalization purposes).

  - Something else I've missed?

Any more thoughts on this?  It's a tough one...

   Thomas.
-- 
``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
           -- Terry Pratchett