[Biojava-l] opening unknown fasta file

mark.schreiber at group.novartis.com mark.schreiber at group.novartis.com
Sun Nov 21 21:39:58 EST 2004

One way to do this would be to create a Unicode alphabet (or ASCII 
alphabet) and read the file into a Sequence of that Alphabet, create a 
Distribution, compare it to the DNA/ RNA/ Protein distributions using 
DistributionTools and then convert it to the correct Alphabet.

Even more ambitious would be to read the whole file to a text buffer and 
guess the format and alphabet based on the usage of characters.

Anyone feel inspired to do something like this. We are always getting 
emails from students looking for short projects. How about that one? My 
basic minimal requirement would be that the file should not be read twice. 
I/O is expensive, Memory is cheap.

- Mark

Thomas Down <thomas at derkholm.net>
Sent by: biojava-l-bounces at portal.open-bio.org
11/13/2004 12:26 AM

        To:     Mark Schreiber/GP/Novartis at PH
        cc:     biojava-list <biojava-l at biojava.org>
        Subject:        Re: [Biojava-l] opening unknown fasta file

On Fri, Nov 12, 2004 at 10:01:13AM +0800, 
mark.schreiber at group.novartis.com wrote:
> Bascially there is absolutely no failsafe way to know if a fasta file is 

> DNA or Protein (or RNA). It's perfectly reasonable to have a short 
> which contains only acg and t although it becomes very unlikely with 
> longer sequences.

The real problem isn't A, C, G, or T, but the other 11 ambiguity symbols
that appear in DNA sequences.  Ns are everywhere, but many of the other
ambiguities appear from time to time, too.

If we were *really* serious about alphabet-guessing (which scares me, to 
honest), one option would be to calculate histograms of character 
in EMBL and Swissprot, and look for the closest match.  I believe that
Internet Explorer takes this approach when it hits a web page without an
explicitly-specified character encoding -- it apparently works pretty 

Does anyone feel this serious?

Biojava-l mailing list  -  Biojava-l at biojava.org

More information about the Biojava-l mailing list