Notes Seq.pm Aln.pm

Georg Fuellen fuellen@dali.mathematik.uni-bielefeld.de
Mon, 17 Feb 1997 12:29:15 +0000 (GMT)


Steve:

Thanks for the extensive feedback !

> [Note, these were copy-and-pasted from a wordprocessor, as I wrote the
> note at home.  Some charaters, like apostrophes, didn't copy right.  I
> hope no information is lost from this.]
> 
> Wow!  I cant believe how far this code has come since the drafts I read on
> airplanes in December.  Both the Bio::Seq and Bio::Aln modules were really
> a joy to read though.  The code was typically clear, well-documented, and
> effectively implemented.  I cant thank you guys enough for doing such a
> great job.
> 
> I spent most of my time on Bio::Seq, and will address that first.  As I
> said, the code looks really great.  I found  small number of bugs and have
> a couple of (hopefully) small design issues.  Once they are resolved, I
> think that the code can be packaged and sent to CPAN as a public beta.
> 
> Most of the comments are written on paper and I dont have time to
> transcribe them.  Who should I mail them to (and should I send a photocopy
> to anyone else)?  Please give me a postal address.

For Bio::Seq, pls mail your comments to Chris (I don't know his snail address).
For Bio::Aln, use Georg Fuellen, Univ. Bielefeld, Techfak - AG PI,
Postfach 100131, D-33501 Bielefeld, Germany
Of course, it may turn out to be useful if you mail all your comments
about both modules to both of us.

> The main nits were:
>   [ \t] seem to be sometimes used as the opposite of \S when \s would be a

IMHO, the first line of Fasta has to match /^>[ \t]*(\S*)[ \t]*(.*)$/
and if we allow \s instead of [ \t], the first line of the actual sequence
could end up in the $head, no?! At least, I doubt that the regexp 
I developed for multiline Fasta would still work, i.e.
$ent =~ /^>[ \t]*\S*[ \t]*.*?\n(?:\n|.)*?(?=\n>|\Z)/mg
Test it with
>
ABCD
>
DEFG
HIJK

> better option.
>   Some of the package globals use hashes rather than the arrays that they
> should

Here you need to be more specific (email or snailmail)

>   There needs to be a validly-check bit somewhere in the code; is this
> object valid or could it be bezerk?

This is not needed if there's never a need to allow the construction
of invalid objects, or permit accessors without an underscore to make an 
object invalid.  But probably there is such a need ?! So how about 
$seq->{'valid'} ? (Or, Steve Ch's stuff may be handy ?!)

>   The alphabet checking idea and implementation both seem to have problems

Yup. Your advice needed :)

>   No mechanism of undefining values was provided

You mean, accessors which turn an internal hash value to 'undef' ?

>   Several functions return strings rather than Bio::Seq objects

Which function should return Bio::Seq, except copy() ?

>   The layout functions should be autoloaded.  This can wait for later
> releases, though.
>   The numbering is limited to numerics which increment uniformly
>   The parsing code seems to be a bit schizophrenic about strings versus
> filenames

Yeah, see below.

> All of these issues can be dealt with pretty neatly, I think, except the
> last one.  The actual code for doing the parsing was quite nice.  However,
> there was the weird and poorly documented weird behavior regarding passing
> around both $ent and $filename.  This seemed like duplication and a good
> way for code to break in the future.  Moreover, right now the code is
> quite inefficient in the way it reads in the entry (doing intrinsic
> parsing) and then later tries to do the "real" parsing.  In addition to
> causing a potentially problematic time delay, it can also use enormous
> amount of memory if large files (e.g., a genome of 4.5Mb) were to be read
> in.
> 
> I realized that this was the legacy of the way I had originally designed
> the code and that you had simply tried to work-around my approach in a
> general way.  Moreover, this half of one, half the other 
> I think that now is the time to reconsider this approach and, perhaps, to
> redesign (without, hopefully, too much re-coding) the whole file input.
> 
> On reconsideration, I think that it would be best for the object to take
> an entirely file-based view of parsing data formats.  That is, all parsing
> will rely on the data coming from a file.  If the data isnt in a file, it
> will be written to a temporary file.  The only exceptions (at least
> initially) would be raw and fasta, which would be parsed by entirely
> different functions, separate from the ordinary parsing functions.
> 
> I think that while redesigning the code to approach things this way, we
> should also address the issue of parsing (and writing) multiple sequences.
> This requires some sort of Bio::Seq::Set object which simply contains a
> set of sequences, probably keeping them in order (so it isnt really a set,
> but a list).  Both Bio::Seq::Set and Bio::Seq could draw on the same
  ^^^^^^^^^^^

Let's use ``Bio::Seq::List''! ``Set'' implies that each element _must be 
unique_. {1,1} is exactly the same as {1} in terms of set notation. 
Otherwise, it's a bag, or multiset ({1,1} != {1} in terms of bags.). 

> parsing routines, perhaps in Bio::Seq::Parse.   (This also opens a can of
> worms, because some files with indices are expensive to open and close, so
> it is worth trying to keep them open all the time if they will be
> repeatedly accessed.  I have code to deal with such files, but that could
> always be folded in later without affecting the interface to Bio::Seq.
> 
> In order to expedite matters, I suggest that Bio::Seq::Set be given an
> entirely minimal interface initially, probably just the ability 1) to
> produce a list of all included Bio::Seqs, NOT necessarily in order and 2)
> to return a Bio::Seq with a requested id.
> 
> What do people think of this idea?

Personally, I agree. If you really hate ``Bio::Seq::List''/``Bio::Seq::Bag''
I'd even get along with ``Bio::Seq::Set'' :-) 

> -=-=-
> 
> Bio::Seq::Aln was also very enjoyable to read, though I was able to spend
> less time going through it in detail.  It seems simple yet powerful and
> should be a very useful tool.   It does, however, focus very strongly on
> the aspects of the alignment object which Georg required for his nucleic
> acid alignments and essentially omits entirely those features which were
> crucial for typical protein alignments.
> 
> Some of the missing features are:
>   It has no support for non-numeric and non-contiguous numbering

Right. I won't have time to add this, so we'd need someone to take
care of numbering issues in both Bio::Seq and Bio::Aln.

>   There is no linkage between the original sequence and the alignments,
> especially for numeration.  (For example, there is no way to retrieve the
> column corresponding to the 5th residue of protein foo_bar)

Right, though %names and _select_flat can be used for this, or something like 
what Bio::Rel was intended to do (heh, not that again ;-)

>   It is inconvenient and inefficent at performing typical protein
> alignment operations, such as extracting a column as a string

Hm? What about a simple wrapper around _select_flat, or a degenerated
version of it ? 

>   No mechanism for dealing with structural alignments
> 
> Moreover, it seems that the internal design of the code makes
> implementation of these features extremely difficult.  Lest there be any

I'm really not sure. Do you mean that storing the alignment as an
array of array references is already problematic ?
If no, what is problematic ?
I think array of array references is the most efficient if you need
rapid access to the data (using strings or Bio::Seq's would slow this
down considerably). Unless, of course, you move towards PerlDL / C plug-ins.

> doubt, I should make it clear that I dont blame Georg for this; I think
> that it would have been hard, if not impossible, to reconcile the
> different requirements.  That he only implemented the features he really
> requires is also entirely reasonable and understandable; that he
> implemented so much in the first place is totally commendable.
> 
> However, I have to ask if a module whose basic design effectively
> precludes frequently-required operation should be permanently enshrined as
> the default alignment object.  I dont think so, but I also think that
> theres an easy solution.  Instead of calling the module Bio::Seq::Aln, why
> dont we call it Bio::Seq::GFAln (to credit Georg) or Bio::Seq::NucAln (to
> reflect its main use for nucleic acid alignments, even though it can be
> used for proteins).  Then someone else might write a module more focusing
> on protein needs called Bio::Seq::ProtAln.  Eventually, it may be possible
> to merge the objects to form a Bio::Seq::Aln.  Alternatively, one of them
> might grow dominant and its name could be changed to Bio::Seq::Aln (while
> old code would go on unaffected by the name change).
                       ^^^^^^^^^^^^I doubt that;)

Bio::NucAln would be ok, but _first_ let's check whether Bio::Aln
is really bad for Protein data. I doubt this very much !

> What do people think?  In terms of code-change, it should just mean
> changing the Makefile.PL and the package line in the module.
> 
> As noted, I didnt spend as much time combing through the code of
> Bio::Seq::Aln.  All of my noted are on paper that Im mailing to Georg.
 
Looking forward !

> -=-=-
> 
> A general note for both pieces of code is that we should assume that
> people will use the bioperl routines just like theyd use anything else
> from CPAN.  That is, they will download it and install it in the default
> way by just saying perl Makefile.PL; make all; make test; make install.
> Users should never be expected to edit the files.  Additional
> documentation about use lib is ok for people who might not have access to
> the system site_perl directory, but the INSTALLDIR notes are likely to be
> confusing and unhelpful. 
> 
> Further, since people should not ever edit the modules, code which relies
> on external items (e.g., PGPLOT; ReadSeq) should be in a separate module
> file.  My feeling is that there should be a separate ReadSeq module which
> does nothing but provide a wrapper around the ReadSeq routines.

ok; I can give this make stuff another try, although I find it's an
unpleasant aspect of Perl, contrary to the original intent of just
having .pm files, and everything works ok and is simple...
http://world.std.com/~swmcd/steven/Perl/module_mechanics.html
also reflects that resentment, and the offerings -re-
``Building Related Perl Modules'' and
``Testing a module from a Perl program'' made me feel uneasy.
But as the page's author says, ``You get used to it. Really.'' Oh well.
Maybe you've got better recommandations than the ones on that page ?

> Steve
> 
> P.S.  The Bio::Seq module on Georgs site seems to be corrupted; I couldnt

Which one ??
Actually, both are NOT on my site; Chris is offering 2 tar files, and for one 
of them, Netscape may remove the .gz without unzipping the file...
Or ? Chris, pls check !

> untar, uncompress, or gunzip it.  The one from Chriss site worked fine.
> 
> P.P.S.  PLEASE, as soon as possible, send me the postal addresses to which
> my written comments should be sent.

See above..
best wishes,
georg

>