[Biojava-l] opening unknown fasta file

Thomas Down thomas at derkholm.net
Fri Nov 12 11:26:05 EST 2004


On Fri, Nov 12, 2004 at 10:01:13AM +0800, mark.schreiber at group.novartis.com wrote:
> 
> Bascially there is absolutely no failsafe way to know if a fasta file is 
> DNA or Protein (or RNA). It's perfectly reasonable to have a short peptide 
> which contains only acg and t although it becomes very unlikely with 
> longer sequences.

The real problem isn't A, C, G, or T, but the other 11 ambiguity symbols
that appear in DNA sequences.  Ns are everywhere, but many of the other
ambiguities appear from time to time, too.

If we were *really* serious about alphabet-guessing (which scares me, to be
honest), one option would be to calculate histograms of character frequencies
in EMBL and Swissprot, and look for the closest match.  I believe that
Internet Explorer takes this approach when it hits a web page without an
explicitly-specified character encoding -- it apparently works pretty well...

Does anyone feel this serious?

       Thomas.


More information about the Biojava-l mailing list