[Biojava-l] opening unknown fasta file

mark.schreiber at group.novartis.com mark.schreiber at group.novartis.com
Thu Nov 11 21:01:13 EST 2004

Hi Koen -

There was a method in SeqIOTools that can (mostly) guess the alphabet of a 
file but it is deprecated cause there is no standard convention of file 
naming.  ClustalW guesses by pre-reading the the file and looking for 
symbols that don't occur in DNA that are found in protein. They claim it's 
accuracy at guessing is in the high 90's but I'm not sure how they 
calculate that number.

Bascially there is absolutely no failsafe way to know if a fasta file is 
DNA or Protein (or RNA). It's perfectly reasonable to have a short peptide 
which contains only acg and t although it becomes very unlikely with 
longer sequences. If you have control over the files you could adopt some 
naming specification (I use .fna for fasta DNA or faa for fasta amino 
acid). An alternative is to allow the specification of format and alphabet 
in the arguments to the program.

- Mark

Koen van der Drift <kvddrift at earthlink.net>
Sent by: biojava-l-bounces at portal.open-bio.org
11/12/2004 06:21 AM

        To:     biojava-list <biojava-l at biojava.org>
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-l] opening unknown fasta file


The BioJava tutorial (in anger) suggests the following code to open a 
fasta file:


  // get the appropriate Alphabet
    Alphabet alpha = AlphabetManager.alphabetForName(args[1]);

  // get a SequenceDB of all sequences in the file
    SequenceDB db = SeqIOTools.readFasta(is, alpha);

But what should I do when I don't know if the fasta file contains a 
protein or dna sequence?


- Koen.

Biojava-l mailing list  -  Biojava-l at biojava.org

More information about the Biojava-l mailing list