[Biojava-l] diploid alphabet

Thomas Down td2@sanger.ac.uk
Mon, 21 Oct 2002 19:52:01 +0100


On Mon, Oct 21, 2002 at 09:20:40AM -0700, Doug Passey wrote:
> hi all,
> we are faced with the problem of representing heterozygous indels in diploid
> resequenced data.  normal heterozygotes (SNPs) in a diploid sequence can be
> represented with the various ambiquity symbols, but in my cursory look at
> the symbol/alphabet stuff in the biojava API docs, i did not see any way of
> representing ambiquities of the form: A/-, C/-, G/-, or T/- ... which are
> the four forms of a single base heterozygous indel in diploid data.  is
> somebody working on this, and if not, does someone have suggestion about how
> to add this to the whole alphabet/symbol scheme of biojava?  i am a relative
> novice at biojava; so if i have to implement this, i might need a little
> guidance to make sure that it is implemented in the correct way.

Hi...

You can't represent an ambiguity matching either a nucleotide
or a `standard' gap in the  BioJava scheme.  The reason is that
gaps are represented as the empty set, in a world where normal
symbols are singleton sets, and ambiguities are sets with more
than one member.  The gap symbol is an explicit `there's nothing
here', as you would get in a gapped alignment.

One way to represent indel polymorphisms is by making the sequence
a profile hidden markov model.  That's the `ideal' of what you're
really trying to represent, but I can see that it may not be 
terribly efficient or practical for your application.

A reasonable alternative would be to create a new 5-symbol alphabet
containing the 4 DNA symbols plus an extra one called `indel'
(an easy way to create alphabets is to edit the AlphabetManager.xml
file in the resources/ tree of the biojava source code).  In this
alphabet, you can then create ambiguity symbols such as
[adenine indel].

Does this make sense?

    Thomas.