[Biojava-l] Cytogenetics in biojava

Thu, 11 Oct 2001 19:41:51 +0200

Hi out there!

Before now, I had some non-list conversation with Matthew, Thomas and
David.
But this conversation is further best carried out here, I think.
I am working on a parser for cytogenetic data, notated in ISCN (this is
the nomenclature for cytogenetic data). The vision at last is to handle
cytogenetic data the same way as other sequence data. Cytogenetic data
is a wealth of information, count alone the already existing cytogenetic
records combined with cell (most cancer-) phenotypes. One of the outputs
of this parser should be annotated biojava sequence objects. One of the
intermediary products should be something like a 'CytoML', for which I
have a layout, but is not published yet.
In my mind, this CytoML will, via some baseURI-mechanism, reference
instances of an AlphabetML. AlphabetML as an XML-dialect to describe
alphabets, and their symbol-substitution logic(ambiguity and
abbrevation). This way CytoML could be independent of the described
resolution of bands and also of the described organism. For example,
take human cytogenetic loci and make a symbol of every locus. So we get
symbols with name '1', '2', and further on symbols with name '1p',
'1cen', '1q', '1p1' and so on. I decided to have every cytogenetic locus
be a symbol in this alphabet, and not the product of a number- and a
{'p','cen','q'}-alphabet, to reflect the biological nature of the loci,
as they are not a combination of anything, but each is unique in its
sequence.
The extra benefit of having an AlphabetML is that
biojava-Alphabet-objects could be generated from the very same
AlphabetML-instances that are referenced from a specific CytoML-(or even
other formats)-instance.
And here comes biojava and my problems with Symbols and Alphabets.
Specific example:
Cytogenetic locus symbol '1' is an ambiguity over the two sequences
{'1p','1cen','1q'} and {'1q','1cen','1p'}.
In my understanding of biojava, a BasisSymbol at last has two methods to
specifiy this.
- getMatches() - to reflect ambiguity
- and getSymbols() - to reflect 'abbrevation' of sequences.
both return a set (this case an alphabet) or list of references to other
symbols.
What I would have to return for the cytogenetic locus '1' symbol in
getMatches() is
an alphabet that has two symbols: one representing the first sequence,
the other the second. But, a symbol in biojava needs to have a token.
Resume: To get it working then, I could
- specifiy extra tokens for the 'anonymous' sequence-representing
symbols.
- change the BasisSymbol interface so that a BasisSymbol can reflect
ambiguity over other sequences?

The first choice is not acceptable, since I would create symbols with
tokens that are not contained in the alphabet described in the
AlphabetML-instance. The second could pile up an amount of extra code
changes spread everywhere around the biojava API.

Maybe I got something wrong here. Maybe I do not really understand the
theory. Has anybody explanations or suggestions? I would much of
appreciate it.

Regards,

Armin