[BioSQL-l] Recording "nucleotide" in the sequence table?

Peter biopython at maubp.freeserve.co.uk
Sat May 16 20:25:21 UTC 2009


Hilmar wrote:
>  I'm assuming that you do know that it's not protein, right?
>  I.e., assigning alphabet "unknown" isn't exactly right.

Yes, if the sequence is using the generic nucleotide alphabet this
means it is NOT protein, and could be DNA or RNA.  So yes,
downgrading a "nucleotide" alphabet to just "unknown" when
storing it in BioSQL (as we do now) is losing information - hence
me starting this thread.

> > Is there any precedent in BioPerl, BioJava or BioRuby for how to
> > handle this?  If not, I'd like to introduce and agree on "nucleotide"
> > for this situation.
>
>  So which letters (symbols) does the "nucleotide" alphabet contain?

Potentially anything - although I would expect the standard (ambiguous)
letters using in RNA or DNA, plus perhaps gap symbols.

> Getting back to Mark's question, how do you know that it's either dna or
> rna but not protein?

We know because the user (or parser) has explicitly used the generic
nucleotide alphabet, this means it is not protein, and is either
DNA or RNA. From the point of loading the sequence into BioSQL,
we don't know or care where the sequence came from - we just get
given the data with a declared alphabet.

> Is the problem that the user can't tell you whether it's dna or
> rna but they know it's not protein, or is it that the user doesn't
> say anything and all you have is the symbols of the sequence,
> which are a, c, g, and t only.

In the situation I'm talking about, either the user has explicitly
picked the alphabet, or perhaps one of our parsers has done so.
This would be because the user don't know, of the file format
doesn't specify this information.  This is admittedly a corner
case - generally there will be either be T or U entries in the
sequence so DNA or RNA can be deduced unambiguously.

> In BioPerl we'll guess the alphabet if the user doesn't say what it is, and
> at present if what we're seeing are the symbols a, c, g, and t only, then
> the guess is dna. If we're seeing u rather than t, we guess it's rna. An
> "unknown" alphabet would be for the user to expressly choose.

What would BioPerl do with the nucleotide sequence GCGCGCGA?
Presumably you guess, thus record either "dna" or "rna" in BioSQL,
so the issue of wanting to record "nucleotide" never arises.

In python "guessing" is discouraged.  If we have a nucleotide sequence
like GCGCGCGA, this could be DNA or RNA - you can't tell.  Our
nucleotide alphabet covers this situation , although another strong
reason for having it is as a common base class for the RNA and
DNA alphabets.

On 5/16/09, Mark Schreiber <markjschreiber at gmail.com> wrote:
> I don't think you can do this with certainty. If you don't know the source
> alphabet then an amino acid sequence could look like dna if it is only
> using acgt and some of the ambiguity codes.
>
> If it is a long sequence it will become increasingly unlikey it is amino
> acid but never certain.

The python answer is don't guess. If you read in a FASTA file with
Biopython it will by default be given a generic alphabet, unless you
explicitly specify otherwise (and in BioSQL the alphabet will be
stored as "unknown").  i.e. the onus is on the user to be explicit.

Peter



More information about the BioSQL-l mailing list