[Biojava-l] merged sequence alphabet?

Thu, 24 Oct 2002 19:11:15 +0100

On Thu, Oct 24, 2002 at 01:27:01PM -0400, Dave Barkan wrote:
> Hi all,
> 
> I was wondering if there is an easily-retrievable alphabet that includes
> all symbols from RNA and DNA sequences; a sort of 'global nucleotide
> sequence' alphabet that would include g, a, c, t, and u.  This would be
> helpful for my application that does not know what kind of sequence it is
> going to be working with.  So far I have been using the pre-defined
> sequence alphabets as it looks tricky to create your own with the full
> functionality that the predefined ones give you, (eg, the tokenization
> features), but if there is no available 'merged' alphabet then I can
> try to create my own.

There's one possible solution to this in some code I checked
in a couple of days ago (so you'll need to use a very recent
CVS version).  As you may have seen, the built-in alphabets are
all defined in an XML file (resources/org/biojava/bio/symbol/AlphabetManager.xml).  When I re-wrote the parser for this file, I made it publically
accessible.  So you could define your merged alphabet in that
format, then do something like:

    import org.xml.sax.*;
    import org.biojava.bio.symbol.*;

    InputSource mergeAlphabetFile = new InputSource("mergealphabet.xml");
    AlphabetManager.loadAlphabets(mergeAlphabetFile);

You should be able to cut and paste one of the existing
alphabets and add the extra symbol.

Note that if you use the loadAlphabets method, you'll be able
to access (using symbolref elements) all the symbols defined
in the core AlphabetManager.xml file.  No need to redefine
them.

Hope this helps,

     Thomas.