[Biojava-l] three-letter Protein alphabet names

Richard Holland richard.holland at ebi.ac.uk
Tue Aug 1 08:19:36 UTC 2006


I'm not sure, but it should simply be a matter of defining an alphabet
where each symbol in the alphabet is a 3-letter combo. Then you can use
the alphabet to tokenize the input string appropriately.

Mark will know more about this than me. Mark - comments?

cheers,
Richard


On Tue, 2006-08-01 at 17:41 +1000, Neil Bacon wrote:
> Hi,
> I'm looking at extending biojava sequence io to read sequences from 
> patents (initially current US data formats, later perhaps older formats 
> and other jurisdictions).
> Anyone done this already or interested?
> 
> Protein data uses 3-letter codes. I found an old posting about 3-letter 
> codes:
> 
> [Biojava-dev] Protein alphabet names
> http://lists.open-bio.org/pipermail/biojava-dev/2002-October/000143.html
> 
> >/   - Add an additional tokenization (probably called
> />/ "three-letter"
> />/     unless someone comes up with a better
> />/ suggestion) for people
> />/     who actually want 3-letter codes.
> /
> 
> Did this happen (I can't find it)?
> I'll try extending WordTokenization to do this unless someone has 
> already done it or can advise me better (I'm new here and advice would 
> be very welcome).
> 
> Cheers,
>     Neil Bacon
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
-- 
Richard Holland (BioMart Team)
EMBL-EBI
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
UNITED KINGDOM
Tel: +44-(0)1223-494416




More information about the Biojava-l mailing list