[Biojava-l] Alignment objects

Tue Aug 15 10:02:13 UTC 2006

>OK, so this is where I'm going wrong - I thought a symbol was 1
>residue/amino acid/gap....why is this not the case? it seems rather
>counter intuitive to ask for the gap symbol for an alignment and be
>returned [] x n, where n=the number of sequences in the alignment.
>

It's all really down to inhertitence and the object model. Alignments 
implement SymbolList. Which makes sense really even though it is not 
intuitive. An alignment is really a sequence of Symbols that is N deep. 
It's alphabet therefore must be N deep. It is a compound Alphabet. This 
allows for enormous flexibility. It is perfectly possible to make an 
alignment which is ((DNAxDNAxDNA)xPROTEIN-TERM) which is an alignment of a 
codon alphabet to a protein. You can even do it as 
((DNAxDNAxDNA)x(DNAxDNAxDNA)xPROTEIN-TERM) which is two dna sequences in 
codon alphabet against a single protein!! A gap symbol from any alphabet 
is really N gaps where N is the number of Alphabets (typically 1 as most 
Alphabets are not compound). Unfortunately this very flexible design is as 
you say not intuitive.

Why would we even want this (apart from the trivial DNA to protein 
example). The main reason comes from HMMs. The result of a pairwise 
alignment using an HMM is a 3 part alignment of the query the state path 
and the match. The query and match are probably from the same Alphabet but 
the state path is a SymbolList made from the Alphabet of states in the 
HMM. Similarly the alignment of a profile HMM to a protein is an alignment 
of the HMM alphabet to the protein Alphabet.

>I may be getting a little confused because I'm new to this, and i have
>in mind what i would like to do, think it should be straight forward but
>am finding it not so easy. How then, do I get the gap symbol e.g. []
>from an alignment (which may be DNA, RNA or protein) so I can check if
>any column in the alignment contains 1 or more gaps?

The reason why there cannot be a gap symbol that would work for all 
occasions is because you can ask mulitple questions of gaps in an 
alignment. Eg, is position i composed only of gaps?, does sequence j at 
position i in the alignment contain a gap?, is there any gap in position 
i?

To answer the sequence j position i example the best (only really) way is 
to get the symbol for position i (with label j) and test if it is equal to 
j.getAlphabet.getGapSymbol(). To answer your question it is simply a 
matter of performing the same operation at position i for all SymbolLists 
at that position.

>I am also thinking of using distributions for doing this job in a more
>generic approach so that any symbol could be used, but i think i might
>end up with the same problems as I have here, so I think I should try to
>figure out this "simpler" problem first.

There are methods in DistributionTools for calculating an array of 
Distributions for an Alignment. This is not efficient for your purposes 
because it also counts all the other residues and divides by the total 
(weighted by the background model). If you only want to find gaps that is 
a few more operations than you need. The approach above would be faster. 
Also this DistributionTools method will not work for alignments with more 
than one alphabet like the codon, protein one above (not a problem in your 
case but not generic).

>It seems to me, that if i could have
>alignment.getAlphabet().getGapSymbol() to return the same thing as would
>proteinTools.getAlphabet().getGapSymbol() then my problem would be 
solved.

This can't be done as you would have to assume that all the SymbolLists in 
the Alignment would have to be from the same Alphabet which is not 
required (and not even desirable, especially for HMMs).

- Mark

mark.schreiber at novartis.com wrote:
> Hi Nathan -
>
> You are on the right track, almost.
>
> The alphabet of the alignment is PROTEIN x PROTEIN (possibly it is 
> PROTEIN-TERM x PROTEIN-TERM). PROTEIN-TERM is the same as protein but 
> contains a * symbol to represent a translated stop codon. Useful if 
> someone translates the wrong reading frame.
>
>
> Thus the gap symbol of your alignment is gapxgap or [] [] as you found. 
> The first symbol of your alignment is ([] Ala). The reason you find 
> nothing with the gap symbol of the alignment is that there are no 
columns 
> with only gaps. It is always gap x something or something x gap. To 
check 
> for gaps in columns you could iterate like you have done with each 
> individual sequence. In this case you would need to check for the gap 
> symbol from the alphabet PROTEIN-TERM, or equivalently the gap symbol of 

> the Alphabet of one of the SymbolLists from the alignment (specifically 
> the one you are checking).
>
> You could also search make ambiguity symbols from the Alignment alphabet 

> that contain gaps ([] X) gap with anything  (X []) anything with gap and 

> ([] []) gap with gap or the gap symbol of the Alignment. This approach 
is 
> faster but for larger alignments requires more Symbols to check. It 
would 
> be pretty easy to construct them recursively though.
>
> Hope this helps,
>
> - Mark
>