[Biojava-l] opening unknown fasta file

j vermont jvermont at hotmail.com
Sat Nov 13 00:11:29 EST 2004


IMO this should be addressed from a design standpoint of the API's 
themselves. If you are *aware* of the nature of the file you're dealing with 
the APIs should support the ability to differentiate them programmatically, 
either via a Factory design pattern or through subclassing. It would be far 
more efficient to solve via architecture and design a general solution than 
it would be to design a 'parsing' or algorithmic based solution which will 
be specific only (I'm guessing) on a case by case basis. Not to mention the 
legit observation someone made about 'alphabet guessing.'
Obviously take my input for what it's worth, I'm a programmer by trade with 
an interest in genetics so I lean towards (and understand better) the comp 
science aspects of these discussions. I hope my humble suggestions are at 
least somewhat helpful. Based on my understanding of what is being discussed 
in this thread, however, you should be able to programmatically (not 
algorithmically) solive this particular scenario. I could look at it further 
(an API/design based or pattern based solution) when I get a chance, if 
anyone thinks it worthwhile.

just my thoughts,

jess vermont
chicago

Universes of virtually unlimited complexity can be created in the form of 
computer programs. (Joseph Weizenbaum)




>From: Thomas Down <thomas at derkholm.net>
>To: mark.schreiber at group.novartis.com
>CC: biojava-list <biojava-l at biojava.org>
>Subject: Re: [Biojava-l] opening unknown fasta file
>Date: Fri, 12 Nov 2004 16:26:05 +0000
>
>On Fri, Nov 12, 2004 at 10:01:13AM +0800, mark.schreiber at group.novartis.com 
>wrote:
> >
> > Bascially there is absolutely no failsafe way to know if a fasta file is
> > DNA or Protein (or RNA). It's perfectly reasonable to have a short 
>peptide
> > which contains only acg and t although it becomes very unlikely with
> > longer sequences.
>
>The real problem isn't A, C, G, or T, but the other 11 ambiguity symbols
>that appear in DNA sequences.  Ns are everywhere, but many of the other
>ambiguities appear from time to time, too.
>
>If we were *really* serious about alphabet-guessing (which scares me, to be
>honest), one option would be to calculate histograms of character 
>frequencies
>in EMBL and Swissprot, and look for the closest match.  I believe that
>Internet Explorer takes this approach when it hits a web page without an
>explicitly-specified character encoding -- it apparently works pretty 
>well...
>
>Does anyone feel this serious?
>
>        Thomas.
>_______________________________________________
>Biojava-l mailing list  -  Biojava-l at biojava.org
>http://biojava.org/mailman/listinfo/biojava-l

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE! 
hthttp://messenger.msn.click-url.com/go/onm00200471ave/direct/01/



More information about the Biojava-l mailing list