[Bioperl-l] Philosophy, BioPerl Object Creation [was Query Unigene title from input a ACC number]

Wed Mar 26 11:03:37 EST 2003

On Tue, 25 Mar 2003, Jason Stajich wrote:

> On Tue, 25 Mar 2003, Jamie Hatfield (AGCoL) wrote:
>
> > The blessed hash was actually something I was planning on trying, I just
> > wasn't sure if that was a "sanctioned" method of speeding up my code.  I
> > haven't been around to hear all the discussion re:
> > speed/flexibility/solid object model, so I didn't realize this topic was
> > becoming a dead horse.  :-)
> >
> it's still alive - don't worry.  We want to fix perf problems, but there
> hasn't been a good solution suggested yet.
>
> > Another area I was curious about:  my fpc module ISA MapIO, so when
> > reading in newlines, it uses the _readline function.  Are there any
> > plans to buffer this or do we assume that the os/hardware does a good
> > enough job as is?  Also, What was the motivation for abstracting this
> > away?  I mean, I assume you're saying that there is a significant
> > performance hit in perl when calling methods (more so, I assume, than
> > other programming languages).
> >
>
> There is a significant performance hit calling 'new' with the way we have
> implemented it because it walks up the chained constructor hierarchy and
> perl doesn't seem to want to cache that walk (actually I really not sure
> what is going on in the guts there, others have taken a look and can
> report).

One thing I dislike about the bioperl 'new' system
is that if you mistype the name of a paramter it
just gets silently ignored.  Perl/Tk has a similar
system, but passes a reference to a hash of named
parameters (instead of flatening and
reconstructing the hash at each level).  Each
level deletes the parameters it has taken, and the
root object can throw an exception if there are
any left in the hash.  Implementing this now in
bioperl whould, however, be horrific!  We can wait
for the Perl6 rewrite, which has named parameters
built into the language.

> But you're only calling new once to instantiate the parser so I
> don't think you have to worry about things there.
>
> I am assuming perl is doing just fine buffering things.
>
> > More than half of the time spent reading in a fpc file has ended up in
> > the _readline method, but it really doesn't take that long to read the
> > file in if you do it yourself with open, <>, close and such.  I'm just
> > trying to find a good way to keep within the object model, but still
> > make this a useable object.
> >

I think the point of the _readline method is that
the parser can put a line back onto the stack if
it decides it has read too far.

> I didn't think there is a serious performance hit with the IO class but
> would be good try and quantify.  There are several reasons to use the
> class we have setup, not the least is that we support this transparent -fh
> or -file to either specify a filehandle or a filename.  Using <> means you
> assume you are always being given a file.  Also, we allow a method called
> _pushback so that you can realize you've read too far when doing parsing
> and push back onto the stack the last line (or lines) you have read.  It
> also allows us to unify access to data streams across the project so that
> all the parser modules behave in a common way.  This is by far the main
> reason for having a common module for IO access.
>
> > I really am not trying to be argumentative/critical.  Just trying to
> > make it good and make it fast.
>
> No it's good to question these things, we get stuck in our ways sometimes
> I think. Usually because we feel we've solved that problem and want to
> move on, but sometimes more appropriate or creative solutions should be
> applied to 'old' problems.
>
> My feeling is people can criticize the project all they want, just be
> willing to step to the plate and lead an effort to improve things (and
> still maintain our attempt at quality stable releases).  If you just stand
> from afar and criticize you aren't really helping since most of us already
> have quite full plates and Bioperl mostly just a free-time sort of thing.
>
> >
> > Is there a developer paper/primer that I should read that has a lot of
> > this discussion in it?
> >
> Our OO design stuff mainly comes from Damian Conway's principals in his
> book and some early frameworks that Steve Chervitz helped establish and
> them some lighter weight stuff that Ewan and others help pioneer.
>
> How I wish there was a real document about all of this - the mailing list
> is a wealth of these things, but no one really has collected various
> edicts or positions on things into the appropriate documents.  The wiki
> was supposed to be the place for this but really has not worked out how
> one might have hoped.
>
> Other texts are emerging slowly from discussions that have been had
> off-line and some attempt at framing the future directions of the project
> looks like they'll be finished being written in the coming weeks.  RFCs
> will be posted as soon as the core devs have agreed on what we envision
> should be part of the development efforts over the next 6-9 months.
>
>
> In case you are wondering, nothing too magical about the core devs - we're
> just folks who have agreed to invest a fair amount of our time in making
> sure the development effort is coordinated, releases code that meets a
> certain threshhold of testing, only breaks backwards compatability when it
> is appropriate, has at least a minimum of documentation, and tries to
> establish a coherent design philosophy.
>
> Anyone and everyone is always welcome to post their own RFC about project
> directions or improvements to the toolkit and lead an effort.
>
> Hope that helps some.
>
> -jason
>
> > Thanks for your help and advice.
> >
> > > -----Original Message-----
> > > From: Jason Stajich [mailto:jason at cgt.mc.duke.edu]
> > >
> > > On Tue, 25 Mar 2003, Jamie Hatfield (AGCoL) wrote:
> > >
> > > > Maybe it's just me, but I've never been too pleased with BioPerl's
> > > > ability to handle large amounts of data like these unigene clusters.
> > > > You all might remember I recently proposed a FPC module for
> > > reading in
> > > > FPC data files.  Well, that is still in progress, but it is
> > > DOG slow,
> > > > and the only reason I can seem to make out of it is that
> > > object creation
> > > > is a bear.
> > > >
> > > > I would really like some input myself, from the BioPerl
> > > experts about
> > > > what I can do to speed up the creation of say . . . 100k
> > > objects?  :-)
> > > >
> > > You have to take a different approach then.  We've gone back
> > > and forth on
> > > this a lot wrt to speed and flexibility and a solid object model.
> > > Apparently Perl doesn't make it easy to have all three.
> > >
> > > You can get around some of the problems by instead of
> > > building things with
> > > new, you bless a hash and then call some methods to push the data in.
> > > This prevents the walk-up-the-tree for inheritance that
> > > happens on every
> > > new() call which is the main bottleneck.  We do this with features and
> > > locations in the genbank parser right now to get a modest performance
> > > gain.  It is still an area that we are trying to rethink and improve.
> > >
> > > I think we want to also move more in the realm of event based parsing
> > > which would allow you to attach a listener which would only
> > > catch certain
> > > events and perhaps wouldn't need to actually create objects
> > > for certain
> > > quick and dirty tasks.  But the framework for this needs to
> > > be laid pretty
> > > explicitly to make it really work.
> > >
> > > I believe Ensembl hit this perf problem and went with a
> > > simplier object
> > > initialization scheme to buy them the performance they
> > > needed.  It means
> > > that you have to code up more things when you inherit from an object
> > > (and have to remember to update all child classes when every a parent
> > > class changes) but you get some performance increase.
> > >
> > > -jason
> > >
> > > --
> > > Jason Stajich
> > > Duke University
> > > jason at cgt.mc.duke.edu
> > >
> >
>
> --
> Jason Stajich
> Duke University
> jason at cgt.mc.duke.edu
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
James G.R. Gilbert            The Wellcome Trust Sanger Institute
Fax: +44 (0)1223 494919              Wellcome Trust Genome Campus
Tel: +44 (0)1223 494906              Hinxton, Cambridge, CB10 1SA