[Biojava-l] opening unknown fasta file
mark.schreiber at group.novartis.com
mark.schreiber at group.novartis.com
Thu Nov 11 21:01:13 EST 2004
Hi Koen -
There was a method in SeqIOTools that can (mostly) guess the alphabet of a
file but it is deprecated cause there is no standard convention of file
naming. ClustalW guesses by pre-reading the the file and looking for
symbols that don't occur in DNA that are found in protein. They claim it's
accuracy at guessing is in the high 90's but I'm not sure how they
calculate that number.
Bascially there is absolutely no failsafe way to know if a fasta file is
DNA or Protein (or RNA). It's perfectly reasonable to have a short peptide
which contains only acg and t although it becomes very unlikely with
longer sequences. If you have control over the files you could adopt some
naming specification (I use .fna for fasta DNA or faa for fasta amino
acid). An alternative is to allow the specification of format and alphabet
in the arguments to the program.
- Mark
Koen van der Drift <kvddrift at earthlink.net>
Sent by: biojava-l-bounces at portal.open-bio.org
11/12/2004 06:21 AM
To: biojava-list <biojava-l at biojava.org>
cc: (bcc: Mark Schreiber/GP/Novartis)
Subject: [Biojava-l] opening unknown fasta file
Hi,
The BioJava tutorial (in anger) suggests the following code to open a
fasta file:
[snip]
// get the appropriate Alphabet
Alphabet alpha = AlphabetManager.alphabetForName(args[1]);
// get a SequenceDB of all sequences in the file
SequenceDB db = SeqIOTools.readFasta(is, alpha);
But what should I do when I don't know if the fasta file contains a
protein or dna sequence?
thanks,
- Koen.
_______________________________________________
Biojava-l mailing list - Biojava-l at biojava.org
http://biojava.org/mailman/listinfo/biojava-l
More information about the Biojava-l
mailing list