[Biojava-l] Analysis output on sequences (calculating properties of SymbolLists)

David Martin david.martin@biotek.uio.no
Mon, 15 May 2000 19:52:29 +0200


Some while ago I started a project that is now on the back burner that was
designed to take generic analysis output and map it onto sequences.

There are a number of different aspects of geralds requests to consider:

A single value for a calculation is fine (eg gribskov stat, gc content, aa
content etc.). That can be represented quite easily by a generic 'property
value' object interface.

When you have other properties that relate to a sequence, such as AA
composition calculated in a sliding window over the sequence then you run
into problems. It is not a property of the whole sequence but a property
of a subsequence, often much larger than a single position in the
sequence.

One would probably want a heavier weight object than just a single
analysis. GC content for the whole sequence is a double and there isn't
much else one can add.
GC content ove a sliding window has a minimum of two parameters, one of
which varies over the sequence length.

If there was to be a generic interface for an analysis it should
probably return
some generic analysis object and then we start to head towards something
that looks like the analysis section of the OMG CORBA spec for
Biomolecular Sequence Analysis.

I would want an analysis to carry with it suitable information onthe
program, parameters and so on used to create the result. These can easily
be bundled into a fairly distinct set of analysis types (about 4 or 5)
that can be treated generically with the program parameters as a
Collection.

So we have a generic 
SequenceAnalysis interface (probably really a result factory)


>From which we derive a variety of subtypes depending on the input sequence
type and return type

SingleValueAnalysis
takes a sequence and returns an analysis result with two components:
A parameter object of some sort and a value object of some sort.

ContinuousValueAnalysis
returns a result object that can give a value for every point in the
sequence. as well as holding its parameters

and so on.
Probably a bit heavier weight than Gerald had in mind.

Sorry to be so vague but it is late here, and I am adding a note from home
before I forget.

..d


---------------------------------------------------------------------
*  Dr. David Martin                  Biotechnology Centre of Oslo   *
*  Node Manager                      Gaustadalleen 21               *
*  The Norwegian EMBNet Node         P.O. box 1125 Blindern         *
*  tel +47 22 95 87 56               N-0317 Oslo                    *
*  fax +47 22 69 41 30               Norway                         * 
---------------------------------------------------------------------