[Biojava-dev] [Bug 2854] New: Selection of protein alphabet is hardcoded in ProteinTools class

Thu Jun 11 02:50:42 UTC 2009

Hi Mark,

The way I see the protein structure modules develop is that I will try
to get rid of dependency on the alphabets and replace it with support
for the Chemical component dictionary http://www.wwpdb.org/ccd.html .
The dictionary contains a list standard and modified residues as well
as small molecule ligands. If applicable it provides parent/child
relationship between compounds. There are too many modified residues
and sometimes the boundaries to ligands are also not straightforward
to draw...

Andreas

On Wed, Jun 10, 2009 at 6:30 PM, Mark Schreiber<markjschreiber at gmail.com> wrote:
> This actually raises an interesting point for the development of
> biojava3. Do we actually need separate protein alphabets? I can't
> actually remember the reason these are separate. Is there a good
> argument for this???
>
> - Mark
>
> On Thu, Jun 11, 2009 at 5:59 AM, <bugzilla-daemon at portal.open-bio.org> wrote:
>>
>> http://bugzilla.open-bio.org/show_bug.cgi?id=2854
>>
>>           Summary: Selection of protein alphabet is hardcoded in
>>                    ProteinTools class
>>           Product: BioJava
>>           Version: live (CVS source)
>>          Platform: All
>>        OS/Version: All
>>            Status: NEW
>>          Severity: normal
>>          Priority: P2
>>         Component: seq
>>        AssignedTo: biojava-dev at biojava.org
>>        ReportedBy: mdharsee at ocbn.ca
>>
>>
>> In our application we are calling createProtein() in class
>> org.biojava.bio.seq.ProteinTools to generate SymbolList objects to encapsulate
>> peptide sequences that are composed of the 20 common amino acid symbols, as
>> well as the 'X' ambiguity symbol.
>>
>> However createProtein() forces the selection of the PROTEIN-TERM alphabet from
>> AlphabetManager.xml, through the call to 'getTAlphabet()' as copied below:
>>
>>  public static SymbolList createProtein(String theProtein)
>>          throws IllegalSymbolException
>>  {
>>    SymbolTokenization p = null;
>>    try {
>>      p = getTAlphabet().getTokenization("token");
>>    } catch (BioException e) {
>>      throw new BioError("Something has gone badly wrong with Protein", e);
>>    }
>>    return new SimpleSymbolList(p, theProtein);
>>  }
>>
>> This selection should rather be made based on the symbol content of the input
>> sequence(s), rather than being hardcoded. Only if the input data contains the
>> symbol 'TER' (terminus) or some abiguity symbol that covers the PROTEIN-TERM
>> alphabet, should the PROTEIN-TERM alphabet be selected. Otherwise the simpler
>> PROTEIN alphabet should be selected.
>>
>> On a related note, the PROTEIN alphabet defined in AlphabetManager.xml consists
>> of 22 residues and includes the less commonly found 'SEC' (selenocysteine, U)
>> and 'PYR' (pyroglutamic acid, O). However, many applications only require the
>> common 20-symbol alphabet that excludes the latter two residues. It would be
>> useful to include a new alphabet in AlphabetManager.xml that defines the
>> simpler 20-symbol set of common amino acids. Perhaps this point should be a
>> feature request?
>>
>> Cheers,
>> Moyez
>>
>>
>> --
>> Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
>> ------- You are receiving this mail because: -------
>> You are the assignee for the bug, or are watching the assignee.
>> _______________________________________________
>> biojava-dev mailing list
>> biojava-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>