[Biojava-dev] [Bug 2854] New: Selection of protein alphabet is hardcoded in ProteinTools class

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Wed Jun 10 21:59:30 UTC 2009


http://bugzilla.open-bio.org/show_bug.cgi?id=2854

           Summary: Selection of protein alphabet is hardcoded in
                    ProteinTools class
           Product: BioJava
           Version: live (CVS source)
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: seq
        AssignedTo: biojava-dev at biojava.org
        ReportedBy: mdharsee at ocbn.ca


In our application we are calling createProtein() in class
org.biojava.bio.seq.ProteinTools to generate SymbolList objects to encapsulate
peptide sequences that are composed of the 20 common amino acid symbols, as
well as the 'X' ambiguity symbol. 

However createProtein() forces the selection of the PROTEIN-TERM alphabet from
AlphabetManager.xml, through the call to 'getTAlphabet()' as copied below:

  public static SymbolList createProtein(String theProtein)
          throws IllegalSymbolException
  {
    SymbolTokenization p = null;
    try {
      p = getTAlphabet().getTokenization("token");
    } catch (BioException e) {
      throw new BioError("Something has gone badly wrong with Protein", e);
    }
    return new SimpleSymbolList(p, theProtein);
  }

This selection should rather be made based on the symbol content of the input
sequence(s), rather than being hardcoded. Only if the input data contains the
symbol 'TER' (terminus) or some abiguity symbol that covers the PROTEIN-TERM
alphabet, should the PROTEIN-TERM alphabet be selected. Otherwise the simpler
PROTEIN alphabet should be selected.

On a related note, the PROTEIN alphabet defined in AlphabetManager.xml consists
of 22 residues and includes the less commonly found 'SEC' (selenocysteine, U)
and 'PYR' (pyroglutamic acid, O). However, many applications only require the
common 20-symbol alphabet that excludes the latter two residues. It would be
useful to include a new alphabet in AlphabetManager.xml that defines the
simpler 20-symbol set of common amino acids. Perhaps this point should be a
feature request?

Cheers,
Moyez


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the biojava-dev mailing list