[Biojava-l] DNA letters are lowercase???

hz5@njit.edu hz5@njit.edu
Sun, 15 Sep 2002 00:55:39 -0400 (EDT)

Most of the case, DNA sequence can be expressed in either lowercase or 
uppercase. Mixing the two form won't cause any problem. All I know is in one 
tool, lowercase and UPPERCASE have different meaning, and the tool is used by 
NCBI during the Genome annotation process.

The tool is called RepeatMasker(Smit, AFA & Green, P RepeatMasker at 
http://ftp.genome.washington.edu/RM/RepeatMasker.html), during NCBI contig 
assembly process, after removing sequence contaminations, the sequences are 
repeat masked using RepeatMasker, and the the repeated sequences are converted 
to lowercase. As RepeatMasker is a very useful tool, and also to convert repeat 
seqences into lowercase whereas others remain UPPERCASE is one of the default 
displaying function provided. More sequences output from this program might be 
available for downstream analysis. BioJava might want to take this into account.

I suggest that all DNA sequence use UPPERCASE letter, lower case letter in the 
sequence are special features that can be defined by different user or 
preprocess program.


Quoting Matthew Pocock <matthew_pocock@yahoo.co.uk>:

> Ryan Golhar wrote:
> > Can anyone tell me why the the letters for DNA (a,c,t,g) are lowercase
> in
> > DNATools?
> Hi Ryan,
> The static methods used to retrieve the bases are in lower case. The 
> AtomicSymbol instances returned can be spat out as lower or upper case
> tepending on the SymbolTokenization you use.
> (Someone who knows): does the default tokenization for DNA use upper or
> lower case? I don't care either way.
> Ryan: To maintain the upper/lower case info in cromatograph files we 
> would need to do a little trickery. If you send a file (mixed case) and
> a couple of use-cases, we can probably sort this out quickly enough. If
> the case is important to you (e.g. you need to know where the uncertain
> calls are), we can do this, and if you want to discard this information
> then we can also do that trivialy. I'm thinking thoughts like alighment
> of DNA against booleans (or 0/1) where A,1 would be A and supported 
> (upper case), and T,0 would be T and not well supported (lower case).
> Has this already been done?
> Matthew
> > 
> > Some chromatogram files contains a mix of A,C,T,G and some lowercase
> letters
> > for peaks that it could not absolutely determine.
> > 
> > Regardless, DNA is always represented with uppercase letters...
> > 
> > If there is no argument against it, can this be changed to upper
> case
> > letters instead?
> > 
> > Ryan
> > 
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l@biojava.org
> > http://biojava.org/mailman/listinfo/biojava-l
> > 
> __________________________________________________
> Do You Yahoo!?
> Everything you'll ever need on one web page
> from News and Sport to Email and Music Charts
> http://uk.my.yahoo.com
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l

Haibo Zhang, PhD student
Computational Biology, NJIT & Rutgers University
Center for Applied Genomics, PHRI