[BioSQL-l] Recording "nucleotide" in the sequence table?

Sat May 16 16:48:40 UTC 2009

On May 16, 2009, at 7:53 AM, Peter wrote:

> In a recent bug report (Bug 2829) it was pointed out that we
> (Biopython) don't attempt to record nucleotide alphabets in BioSQL
> (i.e. a sequence which could be DNA or RNA but we don't know which),
> they just get "unknown" as their biosequence.alphabet entry.

I'm assuming that you do know that it's not protein, right? I.e.,  
assigning alphabet "unknown" isn't exactly right.

> Is there any precedent in BioPerl, BioJava or BioRuby for how to
> handle this?  If not, I'd like to introduce and agree on "nucleotide"
> for this situation.

So which letters (symbols) does the "nucleotide" alphabet contain?

Getting back to Mark's question, how do you know that it's either dna  
or rna but not protein? Is the problem that the user can't tell you  
whether it's dna or rna but they know it's not protein, or is it that  
the user doesn't say anything and all you have is the symbols of the  
sequence, which are a, c, g, and t only.

In BioPerl we'll guess the alphabet if the user doesn't say what it  
is, and at present if what we're seeing are the symbols a, c, g, and t  
only, then the guess is dna. If we're seeing u rather than t, we guess  
it's rna. An "unknown" alphabet would be for the user to expressly  
choose.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================