[Biojava-l] Alignment objects

mark.schreiber at novartis.com mark.schreiber at novartis.com
Tue Aug 15 01:18:08 UTC 2006


Hi Nathan -

You are on the right track, almost.

The alphabet of the alignment is PROTEIN x PROTEIN (possibly it is 
PROTEIN-TERM x PROTEIN-TERM). PROTEIN-TERM is the same as protein but 
contains a * symbol to represent a translated stop codon. Useful if 
someone translates the wrong reading frame.


Thus the gap symbol of your alignment is gapxgap or [] [] as you found. 
The first symbol of your alignment is ([] Ala). The reason you find 
nothing with the gap symbol of the alignment is that there are no columns 
with only gaps. It is always gap x something or something x gap. To check 
for gaps in columns you could iterate like you have done with each 
individual sequence. In this case you would need to check for the gap 
symbol from the alphabet PROTEIN-TERM, or equivalently the gap symbol of 
the Alphabet of one of the SymbolLists from the alignment (specifically 
the one you are checking).

You could also search make ambiguity symbols from the Alignment alphabet 
that contain gaps ([] X) gap with anything  (X []) anything with gap and 
([] []) gap with gap or the gap symbol of the Alignment. This approach is 
faster but for larger alignments requires more Symbols to check. It would 
be pretty easy to construct them recursively though.

Hope this helps,

- Mark

Mark Schreiber
Research Investigator (Bioinformatics)

Novartis Institute for Tropical Diseases (NITD)
10 Biopolis Road
#05-01 Chromos
Singapore 138670
www.nitd.novartis.com

phone +65 6722 2973
fax  +65 6722 2910





"Nathan S. Haigh" <n.haigh at sheffield.ac.uk>
08/14/2006 04:00 PM

 
        To:     mark.schreiber at novartis.com
        cc: 
        Subject:        Re: [Biojava-l] Alignment objects


Hi Mark - this doesn't seem to be working as I'd expected/hoped. Let me
just recap what I've got so far:

I create an alignment (for testing purposes) like this:

String alnString =
            ">seq1\n" +
            "----FGHIKLMNPQRST\n" +
            ">seq2\n" +
            "ACDEFGHIKLMNPQRST\n";
BufferedReader br = new BufferedReader(new StringReader(alnString));
FastaAlignmentFormat faf = new FastaAlignmentFormat();
alignment = faf.read( br );

I loop over columns of the alignment and test if there are any gaps in
the column, I have shown 2 alternative if statements which are supposed
to test if a gap is present. One of these works (but is a bit of a hack)
and the other (which seems like the correct way to do things) doesn't 
work:
for (int col = 1; col <= alignment.length(); col++) {
    for (Iterator labels = alignment.getLabels().iterator();
labels.hasNext(); ) {
        Symbol sym = alignment.symbolAt(labels.next(),col);
        if (sym.getName().contains("[]")) {                          //
this currently works
        if (sym.equals(alignment.getAlphabet().getGapSymbol())) {    //
this doesn't work
           // add this col to a Location object
        }
    }
}

If I do:
System.out.println(alignment.getAlphabet().getGapSymbol());

I get:
org.biojava.bio.symbol.SimpleBasisSymbol: ([] [])

I'm unsure exactly what I'm supposed to get here, but I suspect that the
gap symbol isn't getting set correctly when I create the alignment. I
really want to use the getGapSymbol method of the alignment, since the
alignment a user may load in practice could be either nucleotide or
amino acid.

Cheers
Nathan

mark.schreiber at novartis.com wrote:
> Sorry, that should be getGapSymbol().
>
> - Mark
>
>
>
>
>
>
> Nathan Haigh <n.haigh at sheffield.ac.uk>
> 08/11/2006 06:12 PM
> Please respond to n.haigh
>
> 
>         To:     mark.schreiber at novartis.com
>         cc: 
>         Subject:        Re: [Biojava-l] Alignment objects
>
>
> mark.schreiber at novartis.com wrote:
> 
>> Hi -
>>
>> There is a difference between the gap returned by
>> AlphabetManager.getGapSymbol and the gap returned by an
>> alphabet.getGapSymbol(). There is some very complex reasons for this 
>> 
> which
> 
>> could make up a large part of a thesis (literally, take a look at 
>> 
> Matthew
> 
>> Pococks thesis some time). Simply speaking, dynamic programming and 
HMMs
>> wouldn't work without it.
>>
>> It becomes especially obvious when you have an alignment. The alphabet 
>> 
> of
> 
>> an alignment of 3 DNA sequences is DNAxDNAxDNA. Thus a gap from that
>> alphabet is really gap x gap x gap.
>>
>> Depending on what you are trying to do you would want to test for
>>
>> Symbol s == align.getAlphabet().getGap()
>>
>> or
>>
>> Symbol s == DNATools.getDNA().getGap().
>>
>> - Mark
>>
>>
>> 
> Is the getGap method part of the Biojava-live API but not the 1.4 API?
>
> Cheers
> Nath
>
>
>
>
> [ Attachment ''N.HAIGH.VCF'' removed by Mark Schreiber ]
>
>
> 


-- 
> A: Yes.
>> Q: Are you sure?
>> 
>>> A: Because it reverses the logical flow of conversation.
>>> 
>>>> Q: Why is top posting frowned upon?
>>>> 
Get Thunderbird <http://www.mozilla.org/products/thunderbird/>






More information about the Biojava-l mailing list