New Bio::Seq and Bio::Seq::Parse (.025 BETA)

Georg Fuellen fuellen@dali.Mathematik.Uni-Bielefeld.DE
Tue, 18 Mar 1997 11:28:07 +0000 (GMT)


Hi,

I've collected everything into one long message...
Read the first 100 lines, get a coffee, read the next 100 lines, etc :-)

Steve Br. wrote,
> > >   A few other nits from a _very_ cursory look-through
> > > 
> > > @SeqForm appears never to be created
> > > 
> > > I would change [@%]SeqForm to [@%]SeqFmt, or even [@%]seq_fmt (to be
> > > consistent with the rest of the naming). 
> > 
> > I think then we should have seq_ffmt.
> > Then again, doesn't SeqForm hint at the fact that these variables are 
> > very special ?
> 
> Good Point; these are supposed to be constants, after all.  Then how about
> SeqFmt? 

Well, then we should leave it as is, i.e. SeqForm, and save us the hassle 
of changing names. IMHO. Steve, Chris, do you want a change? 
[ ] yes 
[ ] no
If we change, SeqFfmt looks more consistent to me than SeqFmt.

> > > There's no 'valid' field to indicate whether or not the object is indeed
> > > valid for any operation.  For example, if setseq is used to set an invalid
> > > sequence.  
> > 
> > What if we don't allow this to happen ?
> > If we keep the object valid all the time ?
> 
> That means that we have to 'croak' on any error rather than carp.  

Why not just refuse to make changes that invalidate the object, AND carp.
As you note yourself, 'croak' should be avoided.

> > > Functions which can return an invalid result (such as parse_bad) should
> > > return undef ratehr
> > 
> > You mean, rather than 0 ? I thought zero and the null string ("") 
> > are interpreted as false, and returning 0 or "" seems the standard 
> > convention, no ?
> 
> No.  undef, 0, and "" are all 'false' in Perl.  However undef is
> qualitiatively different in that it returns 'false' to the defined()
> function.  The others don't.  undef is "more false" than the others. 
> Therefore failures are always supposed to be indicated as undef 
> (except from syscall() and system() )

I gather that severe failures should return ``undef'', and 
ordinary ones should return 0/"". That would also be consistent w/ the 
code samples I saw, which return 0 on failure quite often like UNIX does.
Alternatively, I could gather from your statement that you're feeling
strongly about returning undef everywhere, and that may be much easier to
maintain - less uncertainty about return values. Pls reply - per default, I'll
keep things as they are. (running out of time, I tend to be conservative:)
 
> > The problem w/ comma-separated is that according to our current
> > specs, comma is a legal component of an ID; we only carp on whitespace.
> > In other words, ``Mus,musculus'' is a legal ID.
> > Since non-whitespace is also a legal component of filenames on many systems 
> > I believe, I'd like to keep the convention.
> 
> I thought ID's had to be in '\s'; if not, maybe they should be.  Further,

Do you mean ``\S'', i.e. everything but space and ``\t\n\r\f''  ??

> whitespace is a legal component of most filesystems.  (It is on Unix,
> Macintosh, and Windows, for example). 

Space (`` '') may be OK, but newline (``\n'') certainly not ?!

> An array seems to me to be "right" way to do this, I think.  But I thought
> we were talking about numeration (rather htan identifiers anyway).

Misunderstanding. In the current implementation, identifiers ARE USED
to support arbitrary numbering, I've merged both concepts into one !!!

> > > an array of strings is probably even better still, as that's presumably
> > > what you use inside the routines that deal with these things.
> > 
> > Arrays of integers are interpreted as index lists; since names may be
> > integers as well, and Perl doesn't really distinguish integers and strings,
> > how do you want to do this ?
> > (Of course, the system under discussion can allow {string=>\$sting_of_names}
> > as a parameter for seqs().)
> 
> I don't follow -- probably because I haven't spent enough time studying
> UnivAln
>
> > > The numbering in the code still seems pretty poorly documented/determined.
> > 
> > Pls be more specific..
> 
> You sent an email saying that UnivAln supported arbirary numbering
> schemes.  I saw no documentation (even in comments) about this anywhere.
> There was lots of code passing around 'numering,' without ever
> saying what it was supposed ot be.
 
Misunderstanding, see above. Arbitrary numbering == arbitrary identifiers
(aka ids, aka names) for the __columns__ !!!

> > > I agree that a hash permits many options.  But that potentially
> > > just indicates lack of clear thinking and good design.  A tenet of OO
> > > design is that you shouldn't have redundant interfaces; they raise the
> > > learning curve (because there are more options to learn) and make the code
> > > less efficient and more error-prone.
> > 
> > Since ARRAY, CODE and scalar are already taken as the possible type of the
> > first real parameter of seq(), HASH seems ideal.
> 
> I'm not saying that using a HASH is bad (though I would tend to aruge that
> this means that we should reconsider the parameters to the seq()
> function).  What I am saying is that allowing multiple ways of specifying
> the same data via a hash is generally bad design.

In this specific case it looks like the ideal design...  IMO !!
* It's not more redundant than allowing different named parameters for a 
function, ``-seqs'',``-file'', etc, VERSUS ``ids=>'',``descs=>'',``string=>''.
* Learning curve: In the spirit of Perl, what you don't know won't hurt 
you much; it's all about added convenience.
* less efficient: I had to add one check ``if (ref($...) eq 'HASH')''
and if it's a hash, I will need to add a switch that takes care of
the different possible key values. In general, I value programmer+
+maintainer+user efficiency more than space+time efficiency, and 
I believe that you usually cannot predict where the space+time
bottlenecks are - the better approach is to only make space+time efficiency 
a big deal for double loops and square/cubic data structures, AND
benchmark the code in real applications for everything else.
E.g. I don't worry about the row/column ids since they're a linear
(not square/cubic) data structure, but I do worry most about
the square array of characters that represents the alignment, and that's
why I feel that using a PDL structure for this would imply savings in
time+space that are several magnitudes higher than anything else I could
do; and such a change (_and_ giving the user the option to use the
regular array of array of characters if s/he needs to) should be easy
in the OO world... once PDL is stable.
* more error-prone: Adding convenience features like access by name to
rows and columns makes the code more error-prone, naturally. But you
get a big benefit, among them support for arbitrary numbering schemes.

> > > I note that you're still using %FormUnivAln and %TypeUnivAln rather than
> > > the arrays @UnivAlnType and @UnivAlnForm.  These should be arrays, not
> > > hashes.
> > 
> > You mean, @UnivAlnType = ('Unknown','Dna','Rna','Amino','OtherSeq') and 
> > @UnivAlnForm = ('unknown','raw','fasta','nexus') ? On second thoughts,
> > I must admit I fail to remember the advantages, but can clearly see
> > the disadvantages; given ``fasta'', how do you find out what the corresponding
> > number is ? It's my feeling that this is a costly change on which I'll spend 
> > hours, _or_ I just misunderstand.
> 
> The idea was that you would have
> 
> @UnivAlnType = ('unknown','dna','rna','amino','other'); #note lower case

Lower case? Please... I'm really running out of time! Such a change consumes
a lot of time, updating docu, test scripts, my own research code, etc !!
(Also, let's not completely forget the beta testers I mailed personally.)
[ ] yes, I really think lower case is much better in this case as well
[ ] let's keep things the way they are

> foreach  $i (0..$#UnivAlnType) {
>   %UnivAlnType{$UnivAlnType[$i] = $i;
> }
> 
> This way we can index from number to string with @UnivAlnType and from
> string to number with %UnivAlnType.
> 
> The problem is that you have replaced @UnivAlnType with %TypeAlnUniv... 
> and you're putting a number as the parameter to a hash.  This is
> inefficient. But worse, it can lead to problems because $foo =" 1" would
> give the right results in $UnivAlnType[$foo] but not in
> %TypeAlnUniv{$foo}
> 
> To restate, to go from a string to a number use a  hash
>             to go from a number to a string use an array

Good point. Now I think I understand; will change this asap.

> > o Site-specific configuration issues.
> > Right now, Seq.pm does not have to be edited by users but Parse.pm and the
> > test scripts do. I'm going to hit the POD docs for MakeMaker, etc. and try
> > to figure out how setup a system where users edit a ".config" file or
> > somesuch and the resulting info is used to automatically tweak Parse.pm and
> > Seq.pm during the 'make' process. Again, any help/suggestions on this would
> > be appreciated.
> 
> Again, I'm not sure of the right thing to do here; I haven't worked with
> MakeMaker much before.
> 
> Probalby the right think to do is to have a real make, which runs a
> program which spits out a Parse.pm.  (i.e., there's no Parse.pm in the
> distribution, but it is the output of a ParseMaker Perl script which
> queries users for file locations, etc.)  One place to possibly look for
> guidance are things like PGPLOT which require external programs and
> libraries.

PGPLOT has C _and_ Fortran, I think we'll spend a long time figuring
out what's going on there. I hope there's a better example somewhere,
maybe Chris should post to c.l.p.m ?!

> If you are really pressed, I think it would be ok to simply set the
> default to be for $OK to be false and force people to edit things (before
> installation) to set them right.
> 
> > o Proposed validity markers
> >   - A marker that would be set to 'false' whenever Seq.pm makes a call to carp()
> >   - A marker to specify valid/invalid biosequence object
> > Are these permutations of the same idea or two different things? I'm also
> > not sure about how to implement.
> 
> Yes.  These are the same thing.  Basically, there should be a 'valid'
> flag, and the code should carp() or croak() on any operation if the valid
> flag is not set.

I really like the ``always valid'' approach, the more I think about the issue.
Just refuse to have any invalid object ever created. But be very restrictive
w/ the definition of ``invalid''; in a lot of cases carp() is enough,
and after the carp() the user has to expect warnings (like ``use of
unitialized value'') and possibly fatal dies for certain operations.

> Alternatively (as mentioned in hte previous mail), croak()-ing on any
> failure would always ensure that the object is valid.  It would
> potentially cause programs to die often.

And that's not good. Perl itself usually makes the best of a situation;
the spirit is to prefer warnings to ``die''. E.g. if I use ``=='' on string
values, Perl will warn, but not die.

> > o Default constructor ID
> > Steve commented that the default constructor ID should be changed from
> > "No_Id_Given" to "No_Id" plus a unique number. Assigning a number is easy
> > enough but how would you keep track of "unique" numbers assigned? Is there
> > a way to save state or remember these numbers each time new() is called? I
> > think I see the potential problems that objects with the same 'ID' field
> > could cause but I'm unsure how a 'unique' naming process would work.
> 
> in the package have a package global something like
> 
> my $UniqNum = 1234;
> 
> and also have a function something like
> 
> sub uniq_num {
>   return $UniqNum++;
> }

Hm. What about ids that we inherit from somewhere ? E.g. from a file ?
On a parallel machine, this won't work either I think. What about other
distributed computation; CORBA may offer solutions, but it's another
big can of worms although I feel that we'll have to open it at some time -
does anyone know more about CORBA ? (I've just heard rumors! :)

>... If possible, I would like to permit both cases as people sometimes use
> them to mean different things.  We may want to add upcase() and downcase()  
> [or something like that maybe toupper() and tolower()].

to_upper and to_lower ?

> > > o Proposed validity markers
> > >   - A marker that would be set to 'false' whenever Seq.pm makes a call to carp()
> > >   - A marker to specify valid/invalid biosequence object
> > > Are these permutations of the same idea or two different things? I'm also
> > 
> > They are both ways of defining what ``valid'' is. ..
> > For me a valid object conforms to some requirements, like (for UnivAln), 
> > that $self{type} is correct (especially that it reflects the fact that the 
> > alignment is just a sequence bag, i.e.  the rows are of different length), 
> > $self{id} has no whitespace, $self{desc} conforms to $self{descffmt}, 
> > $self{row_ids}, etc, have the correct size.
> > This is something I don't have time for right now, but it's needed eventually.
> 
> You don't want to have to check all those various things every time you do
> an operation.  It would be much similar to have a $valid flag which is set
> or cleared after every operation which changes internal variables which
> could affect validity.

As above, my current thinking is that ``always keep the object valid''
is the cleanest approach. What are the downsides ? 

Steve Ch wrote,
> >o Proposed validity markers
> >   - A marker that would be set to 'false' whenever Seq.pm makes a call to carp()
> >   - A marker to specify valid/invalid biosequence object
> > Are these permutations of the same idea or two different things? I'm also
> > not sure about how to implement.
> 
> I think this (and Steve B's recent comments on this issue) opens up an 
> issue that could use some discussion: how to best handle errors and 
> exceptions in Perl objects. I've created some modules that I use to help 
> manage errors. See the "More advanced object" example at: 
> 
> http://genome-www.stanford.edu/~sac/perlOOP/examples/
> 
> This is my attempt to manage the wide variety of errors and 
> exceptions that can occur in complex objects. The primary motivation 
> for this work is to allow objects to handle error conditions without 
> killing the script by calling die or croak. The code is at an early 
> stage of development (it hasn't received much independent critiquing),
> but it may inspire some useful ideas.

My first q. on this line is - doesn't Perl 5.004 (just out as a late beta)
have much more support for exception handling than the current one ?
At least, it offers class SUPER, which gives you a way to check what
methods a given object is capable of. If it's not in 5.004, are there
plans for 5.005 ? More generally, sophisticated exception handling is a 
complex subject - we at least need the independent critiquing of someone 
who has experience with it. IF exception handling were trivial+easy,
I suppose Perl would offer this already, no ? I remember that it was
one of the things added last to C++ a few years back, and ppl weren't
really happy with it.

best wishes,
georg