[Biojava-l] Alignment objects

mark.schreiber at novartis.com mark.schreiber at novartis.com
Wed Aug 16 01:07:19 UTC 2006


You've got it!

There is actually one other way ...

If you have an alignment 3 deep for DNA you can create and test for the 
following symbols ([], N, N) (N, [], N) (N, N, []). These will find sites 
with exactly one gap. You can then make ([], [], N) and (N. [] []) and 
([], N, []) and ([],[],[]) which is the alignment gap symbol. This means 
you can test each site of the alignment for each of the symbols. It 
requires some work up front to make the symbols (possibly recursion) but 
if you put them in a HashMap then you can simply iterate over each site in 
the Alignment and see if the DNAxDNAxDNA symbol returned is contained in 
the HashMap. This may actually be more efficient.

- Mark





"Nathan S. Haigh" <n.haigh at sheffield.ac.uk>
08/15/2006 08:17 PM
Please respond to n.haigh

 
        To:     mark.schreiber at novartis.com
        cc: 
        Subject:        Re: [Biojava-l] Alignment objects


mark.schreiber at novartis.com wrote:
>> OK, so this is where I'm going wrong - I thought a symbol was 1
>> residue/amino acid/gap....why is this not the case? it seems rather
>> counter intuitive to ask for the gap symbol for an alignment and be
>> returned [] x n, where n=the number of sequences in the alignment.
>>
>> 
>
> It's all really down to inhertitence and the object model. Alignments 
> implement SymbolList. Which makes sense really even though it is not 
> intuitive. An alignment is really a sequence of Symbols that is N deep. 
> It's alphabet therefore must be N deep. It is a compound Alphabet. This 
> allows for enormous flexibility. It is perfectly possible to make an 
> alignment which is ((DNAxDNAxDNA)xPROTEIN-TERM) which is an alignment of 
a 
> codon alphabet to a protein. You can even do it as 
> ((DNAxDNAxDNA)x(DNAxDNAxDNA)xPROTEIN-TERM) which is two dna sequences in 

> codon alphabet against a single protein!! A gap symbol from any alphabet 

> is really N gaps where N is the number of Alphabets (typically 1 as most 

> Alphabets are not compound). Unfortunately this very flexible design is 
as 
> you say not intuitive.
>
> Why would we even want this (apart from the trivial DNA to protein 
> example). The main reason comes from HMMs. The result of a pairwise 
> alignment using an HMM is a 3 part alignment of the query the state path 

> and the match. The query and match are probably from the same Alphabet 
but 
> the state path is a SymbolList made from the Alphabet of states in the 
> HMM. Similarly the alignment of a profile HMM to a protein is an 
alignment 
> of the HMM alphabet to the protein Alphabet.
>
> 
>> I may be getting a little confused because I'm new to this, and i have
>> in mind what i would like to do, think it should be straight forward 
but
>> am finding it not so easy. How then, do I get the gap symbol e.g. []
>> 
> >from an alignment (which may be DNA, RNA or protein) so I can check if
> 
>> any column in the alignment contains 1 or more gaps?
>> 
>
> The reason why there cannot be a gap symbol that would work for all 
> occasions is because you can ask mulitple questions of gaps in an 
> alignment. Eg, is position i composed only of gaps?, does sequence j at 
> position i in the alignment contain a gap?, is there any gap in position 

> i?
>
> To answer the sequence j position i example the best (only really) way 
is 
> to get the symbol for position i (with label j) and test if it is equal 
to 
> j.getAlphabet.getGapSymbol(). To answer your question it is simply a 
> matter of performing the same operation at position i for all 
SymbolLists 
> at that position.
>
> 
>> I am also thinking of using distributions for doing this job in a more
>> generic approach so that any symbol could be used, but i think i might
>> end up with the same problems as I have here, so I think I should try 
to
>> figure out this "simpler" problem first.
>> 
>
> There are methods in DistributionTools for calculating an array of 
> Distributions for an Alignment. This is not efficient for your purposes 
> because it also counts all the other residues and divides by the total 
> (weighted by the background model). If you only want to find gaps that 
is 
> a few more operations than you need. The approach above would be faster. 

> Also this DistributionTools method will not work for alignments with 
more 
> than one alphabet like the codon, protein one above (not a problem in 
your 
> case but not generic).
>
> 
>> It seems to me, that if i could have
>> alignment.getAlphabet().getGapSymbol() to return the same thing as 
would
>> proteinTools.getAlphabet().getGapSymbol() then my problem would be 
>> 
> solved.
>
> This can't be done as you would have to assume that all the SymbolLists 
in 
> the Alignment would have to be from the same Alphabet which is not 
> required (and not even desirable, especially for HMMs).
>
> - Mark
>
> 
Fantastic!

Thanks for taking time to explaining this to me - it does make much more
sense to do things this way now i understand things a little better. So,
effectively, an alignment may be made from sequences that are composed
of symbols from different alphabets. The solution I have is that I loop
over the positions in the alignment, and then over the labels for that
position, get the gap symbol for each label using:
alignment.symbolListForLabel(label).getAlphabet().getGapSymbol()
and then test if this symbol is the same as the symbol at position i for
label j. :o)

I think the main reason for my confusion, is that i'm trying to make a
move from Bioperl to Biojava, and Bioperl has an alphabet for the whole
alignment, therefore has a restriction that an alignment can only be
comprised of sequences from the same alphabet.

Thanks very much for your help!
Nathan







More information about the Biojava-l mailing list