Notes Seq.pm Aln.pm
Georg Fuellen
fuellen@dali.mathematik.uni-bielefeld.de
Mon, 17 Feb 1997 12:29:15 +0000 (GMT)
Steve:
Thanks for the extensive feedback !
> [Note, these were copy-and-pasted from a wordprocessor, as I wrote the
> note at home. Some charaters, like apostrophes, didn't copy right. I
> hope no information is lost from this.]
>
> Wow! I cant believe how far this code has come since the drafts I read on
> airplanes in December. Both the Bio::Seq and Bio::Aln modules were really
> a joy to read though. The code was typically clear, well-documented, and
> effectively implemented. I cant thank you guys enough for doing such a
> great job.
>
> I spent most of my time on Bio::Seq, and will address that first. As I
> said, the code looks really great. I found small number of bugs and have
> a couple of (hopefully) small design issues. Once they are resolved, I
> think that the code can be packaged and sent to CPAN as a public beta.
>
> Most of the comments are written on paper and I dont have time to
> transcribe them. Who should I mail them to (and should I send a photocopy
> to anyone else)? Please give me a postal address.
For Bio::Seq, pls mail your comments to Chris (I don't know his snail address).
For Bio::Aln, use Georg Fuellen, Univ. Bielefeld, Techfak - AG PI,
Postfach 100131, D-33501 Bielefeld, Germany
Of course, it may turn out to be useful if you mail all your comments
about both modules to both of us.
> The main nits were:
> [ \t] seem to be sometimes used as the opposite of \S when \s would be a
IMHO, the first line of Fasta has to match /^>[ \t]*(\S*)[ \t]*(.*)$/
and if we allow \s instead of [ \t], the first line of the actual sequence
could end up in the $head, no?! At least, I doubt that the regexp
I developed for multiline Fasta would still work, i.e.
$ent =~ /^>[ \t]*\S*[ \t]*.*?\n(?:\n|.)*?(?=\n>|\Z)/mg
Test it with
>
ABCD
>
DEFG
HIJK
> better option.
> Some of the package globals use hashes rather than the arrays that they
> should
Here you need to be more specific (email or snailmail)
> There needs to be a validly-check bit somewhere in the code; is this
> object valid or could it be bezerk?
This is not needed if there's never a need to allow the construction
of invalid objects, or permit accessors without an underscore to make an
object invalid. But probably there is such a need ?! So how about
$seq->{'valid'} ? (Or, Steve Ch's stuff may be handy ?!)
> The alphabet checking idea and implementation both seem to have problems
Yup. Your advice needed :)
> No mechanism of undefining values was provided
You mean, accessors which turn an internal hash value to 'undef' ?
> Several functions return strings rather than Bio::Seq objects
Which function should return Bio::Seq, except copy() ?
> The layout functions should be autoloaded. This can wait for later
> releases, though.
> The numbering is limited to numerics which increment uniformly
> The parsing code seems to be a bit schizophrenic about strings versus
> filenames
Yeah, see below.
> All of these issues can be dealt with pretty neatly, I think, except the
> last one. The actual code for doing the parsing was quite nice. However,
> there was the weird and poorly documented weird behavior regarding passing
> around both $ent and $filename. This seemed like duplication and a good
> way for code to break in the future. Moreover, right now the code is
> quite inefficient in the way it reads in the entry (doing intrinsic
> parsing) and then later tries to do the "real" parsing. In addition to
> causing a potentially problematic time delay, it can also use enormous
> amount of memory if large files (e.g., a genome of 4.5Mb) were to be read
> in.
>
> I realized that this was the legacy of the way I had originally designed
> the code and that you had simply tried to work-around my approach in a
> general way. Moreover, this half of one, half the other
> I think that now is the time to reconsider this approach and, perhaps, to
> redesign (without, hopefully, too much re-coding) the whole file input.
>
> On reconsideration, I think that it would be best for the object to take
> an entirely file-based view of parsing data formats. That is, all parsing
> will rely on the data coming from a file. If the data isnt in a file, it
> will be written to a temporary file. The only exceptions (at least
> initially) would be raw and fasta, which would be parsed by entirely
> different functions, separate from the ordinary parsing functions.
>
> I think that while redesigning the code to approach things this way, we
> should also address the issue of parsing (and writing) multiple sequences.
> This requires some sort of Bio::Seq::Set object which simply contains a
> set of sequences, probably keeping them in order (so it isnt really a set,
> but a list). Both Bio::Seq::Set and Bio::Seq could draw on the same
^^^^^^^^^^^
Let's use ``Bio::Seq::List''! ``Set'' implies that each element _must be
unique_. {1,1} is exactly the same as {1} in terms of set notation.
Otherwise, it's a bag, or multiset ({1,1} != {1} in terms of bags.).
> parsing routines, perhaps in Bio::Seq::Parse. (This also opens a can of
> worms, because some files with indices are expensive to open and close, so
> it is worth trying to keep them open all the time if they will be
> repeatedly accessed. I have code to deal with such files, but that could
> always be folded in later without affecting the interface to Bio::Seq.
>
> In order to expedite matters, I suggest that Bio::Seq::Set be given an
> entirely minimal interface initially, probably just the ability 1) to
> produce a list of all included Bio::Seqs, NOT necessarily in order and 2)
> to return a Bio::Seq with a requested id.
>
> What do people think of this idea?
Personally, I agree. If you really hate ``Bio::Seq::List''/``Bio::Seq::Bag''
I'd even get along with ``Bio::Seq::Set'' :-)
> -=-=-
>
> Bio::Seq::Aln was also very enjoyable to read, though I was able to spend
> less time going through it in detail. It seems simple yet powerful and
> should be a very useful tool. It does, however, focus very strongly on
> the aspects of the alignment object which Georg required for his nucleic
> acid alignments and essentially omits entirely those features which were
> crucial for typical protein alignments.
>
> Some of the missing features are:
> It has no support for non-numeric and non-contiguous numbering
Right. I won't have time to add this, so we'd need someone to take
care of numbering issues in both Bio::Seq and Bio::Aln.
> There is no linkage between the original sequence and the alignments,
> especially for numeration. (For example, there is no way to retrieve the
> column corresponding to the 5th residue of protein foo_bar)
Right, though %names and _select_flat can be used for this, or something like
what Bio::Rel was intended to do (heh, not that again ;-)
> It is inconvenient and inefficent at performing typical protein
> alignment operations, such as extracting a column as a string
Hm? What about a simple wrapper around _select_flat, or a degenerated
version of it ?
> No mechanism for dealing with structural alignments
>
> Moreover, it seems that the internal design of the code makes
> implementation of these features extremely difficult. Lest there be any
I'm really not sure. Do you mean that storing the alignment as an
array of array references is already problematic ?
If no, what is problematic ?
I think array of array references is the most efficient if you need
rapid access to the data (using strings or Bio::Seq's would slow this
down considerably). Unless, of course, you move towards PerlDL / C plug-ins.
> doubt, I should make it clear that I dont blame Georg for this; I think
> that it would have been hard, if not impossible, to reconcile the
> different requirements. That he only implemented the features he really
> requires is also entirely reasonable and understandable; that he
> implemented so much in the first place is totally commendable.
>
> However, I have to ask if a module whose basic design effectively
> precludes frequently-required operation should be permanently enshrined as
> the default alignment object. I dont think so, but I also think that
> theres an easy solution. Instead of calling the module Bio::Seq::Aln, why
> dont we call it Bio::Seq::GFAln (to credit Georg) or Bio::Seq::NucAln (to
> reflect its main use for nucleic acid alignments, even though it can be
> used for proteins). Then someone else might write a module more focusing
> on protein needs called Bio::Seq::ProtAln. Eventually, it may be possible
> to merge the objects to form a Bio::Seq::Aln. Alternatively, one of them
> might grow dominant and its name could be changed to Bio::Seq::Aln (while
> old code would go on unaffected by the name change).
^^^^^^^^^^^^I doubt that;)
Bio::NucAln would be ok, but _first_ let's check whether Bio::Aln
is really bad for Protein data. I doubt this very much !
> What do people think? In terms of code-change, it should just mean
> changing the Makefile.PL and the package line in the module.
>
> As noted, I didnt spend as much time combing through the code of
> Bio::Seq::Aln. All of my noted are on paper that Im mailing to Georg.
Looking forward !
> -=-=-
>
> A general note for both pieces of code is that we should assume that
> people will use the bioperl routines just like theyd use anything else
> from CPAN. That is, they will download it and install it in the default
> way by just saying perl Makefile.PL; make all; make test; make install.
> Users should never be expected to edit the files. Additional
> documentation about use lib is ok for people who might not have access to
> the system site_perl directory, but the INSTALLDIR notes are likely to be
> confusing and unhelpful.
>
> Further, since people should not ever edit the modules, code which relies
> on external items (e.g., PGPLOT; ReadSeq) should be in a separate module
> file. My feeling is that there should be a separate ReadSeq module which
> does nothing but provide a wrapper around the ReadSeq routines.
ok; I can give this make stuff another try, although I find it's an
unpleasant aspect of Perl, contrary to the original intent of just
having .pm files, and everything works ok and is simple...
http://world.std.com/~swmcd/steven/Perl/module_mechanics.html
also reflects that resentment, and the offerings -re-
``Building Related Perl Modules'' and
``Testing a module from a Perl program'' made me feel uneasy.
But as the page's author says, ``You get used to it. Really.'' Oh well.
Maybe you've got better recommandations than the ones on that page ?
> Steve
>
> P.S. The Bio::Seq module on Georgs site seems to be corrupted; I couldnt
Which one ??
Actually, both are NOT on my site; Chris is offering 2 tar files, and for one
of them, Netscape may remove the .gz without unzipping the file...
Or ? Chris, pls check !
> untar, uncompress, or gunzip it. The one from Chriss site worked fine.
>
> P.P.S. PLEASE, as soon as possible, send me the postal addresses to which
> my written comments should be sent.
See above..
best wishes,
georg
>