[Bioperl-pipeline] Requirements for pipeline installation

Elia Stupka elia@fugu-sg.org
Thu, 29 Aug 2002 01:34:24 +0800 (SGT)


> Hey now ! I dont want to be the bone of contention! :)

He he no worries, contentions are good. ;)

> But, since you have experience with ensembl, we will go with that.  

That is what I meant, easiest right now to go with what we know works. In
the future actually we want most things to work off GFD, bioperl-db and
GMOD, but no point in suggesting it when it is not up and running.

> sure I have this right, sequences are stored in the ensembl database,
> they are retrieved and run throught the pipeline and then the entry is
> updated with what ever analysis was carried out..

Absolutely correct. The update keeps happening live. Don't forget the
pipeline is not a database to store any data, all its tables are purely
for logic, workflow, job management, etc. input and output are always
from/to external files/dbs. So as soon as a job is finished it will update
the ensembl database.

> Since I am woefully ignorant, i need to learn how i go from blast results
> to an object with features to entering this object into a database, perhaps

Actually the example pipeline we provide already does it for you. You see,
the fact is that the results from the pipeline are already objects. We
never do anything with flatfiles. As soon as a blast run is producing
output, that output is parsed into an object, and the output file is
deleted. That object than gets stored to the specified database, with the
specified method. In certain cases the object gets lightly converted to a
"friendlier" object for the database it is going to. For example if you
run blast using the bioperl run module it will produce bioperl HSP
objects, than those are converted to ensembl FeaturePair objects, which
are stored in the database.

Best thing is to start playing with the example pipeline so you see it
actually working...

> Oh, we are not using LSF. Right now we are writing our own queing
> systems but we are also checking out PBS. LSF is right out! (just too
> expensive for us).

Andy, did you get a quote in the last month or two? Their pricing has
dropped ridiculously lately because they are losing market share, you
should make sure you double-check, I've heard of some ridiculous deals. We
do have a module for PBS, but we've found PBS not too reliable when the
cluster size is more than a few nodes. If you write your own than it would
be very easy to write a module to make the pipeline interact with it, we
can guide you, but it's really pluggable, so it's just one module you
would need to write. By the way Jim Kent has been writing also one of his
own, for higher throughput than the standard ones, called parasol, you
might want to check with him... how large is the cluster you are going to
run on?

Elia

********************************
* http://www.fugu-sg.org/~elia *
* tel:    +65 6874 1467        *
* mobile: +65 9030 7613        *
* fax:    +65 6779 1117        *
********************************