[Biojava-dev] [Bug 2854] New: Selection of protein alphabet is hardcoded in ProteinTools class

Mark Schreiber markjschreiber at gmail.com
Thu Jun 11 01:30:00 UTC 2009


This actually raises an interesting point for the development of
biojava3. Do we actually need separate protein alphabets? I can't
actually remember the reason these are separate. Is there a good
argument for this???

- Mark

On Thu, Jun 11, 2009 at 5:59 AM, <bugzilla-daemon at portal.open-bio.org> wrote:
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2854
>
>           Summary: Selection of protein alphabet is hardcoded in
>                    ProteinTools class
>           Product: BioJava
>           Version: live (CVS source)
>          Platform: All
>        OS/Version: All
>            Status: NEW
>          Severity: normal
>          Priority: P2
>         Component: seq
>        AssignedTo: biojava-dev at biojava.org
>        ReportedBy: mdharsee at ocbn.ca
>
>
> In our application we are calling createProtein() in class
> org.biojava.bio.seq.ProteinTools to generate SymbolList objects to encapsulate
> peptide sequences that are composed of the 20 common amino acid symbols, as
> well as the 'X' ambiguity symbol.
>
> However createProtein() forces the selection of the PROTEIN-TERM alphabet from
> AlphabetManager.xml, through the call to 'getTAlphabet()' as copied below:
>
>  public static SymbolList createProtein(String theProtein)
>          throws IllegalSymbolException
>  {
>    SymbolTokenization p = null;
>    try {
>      p = getTAlphabet().getTokenization("token");
>    } catch (BioException e) {
>      throw new BioError("Something has gone badly wrong with Protein", e);
>    }
>    return new SimpleSymbolList(p, theProtein);
>  }
>
> This selection should rather be made based on the symbol content of the input
> sequence(s), rather than being hardcoded. Only if the input data contains the
> symbol 'TER' (terminus) or some abiguity symbol that covers the PROTEIN-TERM
> alphabet, should the PROTEIN-TERM alphabet be selected. Otherwise the simpler
> PROTEIN alphabet should be selected.
>
> On a related note, the PROTEIN alphabet defined in AlphabetManager.xml consists
> of 22 residues and includes the less commonly found 'SEC' (selenocysteine, U)
> and 'PYR' (pyroglutamic acid, O). However, many applications only require the
> common 20-symbol alphabet that excludes the latter two residues. It would be
> useful to include a new alphabet in AlphabetManager.xml that defines the
> simpler 20-symbol set of common amino acids. Perhaps this point should be a
> feature request?
>
> Cheers,
> Moyez
>
>
> --
> Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are the assignee for the bug, or are watching the assignee.
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev




More information about the biojava-dev mailing list