UnivAln 1.004 Beta

Steven E. Brenner brenner@akamail.com
Tue, 18 Mar 1997 10:34:52 +0900 (JST)


> The problem w/ comma-separated is that according to our current
> specs, comma is a legal component of an ID; we only carp on whitespace.
> In other words, ``Mus,musculus'' is a legal ID.
> Since non-whitespace is also a legal component of filenames on many systems 
> I believe, I'd like to keep the convention.

I thought ID's had to be in '\s'; if not, maybe they should be.  Further,
whitespace is a legal component of most filesystems.  (It is on Unix,
Macintosh, and Windows, for example). 

An array seems to me to be "right" way to do this, I think.  But I thought
we were talking about numeration (rather htan identifiers anyway).

> > an array of strings is probably even better still, as that's presumably
> > what you use inside the routines that deal with these things.
> 
> Arrays of integers are interpreted as index lists; since names may be
> integers as well, and Perl doesn't really distinguish integers and strings,
> how do you want to do this ?
> (Of course, the system under discussion can allow {string=>\$sting_of_names}
> as a parameter for seqs().)

I don't follow -- probably because I haven't spent enough time studying
UnivAln


> > The numbering in the code still seems pretty poorly documented/determined.
> 
> Pls be more specific..

You sent an email saying that UnivAln supported arbirary numbering
schemes.  I saw no documentation (even in comments) about this anywhere.
There was lots of code passing around 'numering,' without ever
saying what it was supposed ot be.


> > I agree that a hash permits many options.  But that potentially
> > just indicates lack of clear thinking and good design.  A tenet of OO
> > design is that you shouldn't have redundant interfaces; they raise the
> > learning curve (because there are more options to learn) and make the code
> > less efficient and more error-prone.
> 
> Since ARRAY, CODE and scalar are already taken as the possible type of the
> first real parameter of seq(), HASH seems ideal.

I'm not saying that using a HASH is bad (though I would tend to aruge that
this means that we should reconsider the parameters to the seq()
function).  What I am saying is that allowing multiple ways of specifying
the same data via a hash is generally bad design.


> > I note that you're still using %FormUnivAln and %TypeUnivAln rather than
> > the arrays @UnivAlnType and @UnivAlnForm.  These should be arrays, not
> > hashes.
> 
> You mean, @UnivAlnType = ('Unknown','Dna','Rna','Amino','OtherSeq') and 
> @UnivAlnForm = ('unknown','raw','fasta','nexus') ? On second thoughts,
> I must admit I fail to remember the advantages, but can clearly see
> the disadvantages; given ``fasta'', how do you find out what the corresponding
> number is ? It's my feeling that this is a costly change on which I'll spend 
> hours, _or_ I just misunderstand.

The idea was that you would have

@UnivAlnType = ('unknown','dna','rna','amino','other'); #note lower case
foreach  $i (0..$#UnivAlnType) {
  %UnivAlnType{$UnivAlnType[$i] = $i;
}


This way we can index from number to string with @UnivAlnType and from
string to number with %UnivAlnType.

The problem is that you have replaced @UnivAlnType with %TypeAlnUniv... 
and you're putting a number as the parameter to a hash.  This is
inefficient. But worse, it can lead to problems because $foo =" 1" would
give the right results in $UnivAlnType[$foo] but not in
%TypeAlnUniv{$foo}

To restate, to go from a string to a number use a  hash
            to go from a number to a string use an array

Steve