From elia at tll.org.sg  Tue Jul  1 14:42:20 2003
From: elia at tll.org.sg (Elia Stupka)
Date: Tue Jul  1 01:42:04 2003
Subject: [Bioperl-pipeline] Re: BioPipe
In-Reply-To: <52805F78-AAB8-11D7-928A-000A957702FE@tll.org.sg>
Message-ID: <C8319E7A-AB86-11D7-A282-000A95767E46@tll.org.sg>

> If the thread problem cannot be solved in perl, you cannot make 
> pipeline manager server program access multiple connection as well.

But you can still use 'fork' and mutliple processes, right? We used to 
use it 4 years ago without a problem in Ensembl....

> My suggest is to let pipeline managers running as process and each has 
> one database.

I doubt it will work sensibly in the long term, I think it makes no 
sense to populate the servers with tons of MySQL databases to store 
multiple pipelines. And as I mentioned, I think we ought to have 
pipeline ids on the jobs, files, tables,etc.

Elia

---
Bioinformatics Program Manager
Temasek Life Sciences Laboratory
1, Research Link
Singapore 117604
Tel. +65 6874 4945
Fax. +65 6872 7007

From elia at tll.org.sg  Tue Jul  1 16:11:29 2003
From: elia at tll.org.sg (Elia Stupka)
Date: Tue Jul  1 03:11:12 2003
Subject: [Bioperl-pipeline] multiple pipelines
In-Reply-To: <34801.128.192.15.158.1056998159.squirrel@sgx3.bmb.uga.edu>
Message-ID: <3CC9F63C-AB93-11D7-8BB0-000A95767E46@tll.org.sg>

Hi Jeremy,

we are currently having an internal discussion about this, we are 
actually trying to work towards a new multi-pipeline system, where one 
database could contain multiple pipelines. Also, files relating to jobs 
would have pipeline ids, etc. etc. and finally the web manager would 
track multiple pipelines. This is at discussion stage at the moment, 
though Juguang and Aaron over here seem set to work on this soon.

> One other note: with our setup, reading/writing from/to an nfs 
> directory
> during a blast analysis is very io bound.

Absolutely. To achieve best performance you need:

1-Blast database local to node with best possible read speed (in our 
case with 2  mirrored local hard disks)

2-Write STDOUT and STDERR to local node, read results from there and 
finally store results in database (no need to copy anywhere)

The only current caveat with point 2 is that if a job fails, the error 
file stays there, and there is no simple way to track which node a job 
is running on. We are about to change the database schema and the code 
to make sure we keep track of the node id that a job is running on 
after it is submitted.

> then copied back to the nfs mounted directory the analysis was started 
> in

If you are using a database (e.g. BioSQL or Ensembl) to store your 
blast results, you don't even need this last step, you just parse the 
file locally and then write results back to the db.

Elia

---
Bioinformatics Program Manager
Temasek Life Sciences Laboratory
1, Research Link
Singapore 117604
Tel. +65 6874 4945
Fax. +65 6872 7007

From jeremyp at sgx3.bmb.uga.edu  Tue Jul  1 13:21:55 2003
From: jeremyp at sgx3.bmb.uga.edu (jeremyp@sgx3.bmb.uga.edu)
Date: Tue Jul  1 12:21:32 2003
Subject: [Bioperl-pipeline] multiple pipelines
Message-ID: <33517.128.192.15.158.1057076515.squirrel@sgx3.bmb.uga.edu>

Hi,

>
> ...finally the web manager would track multiple pipelines. This is at
> discussion stage at the moment, though Juguang and Aaron over here
> seem set to work on this soon.

Yes, these ideas seem great. I would personally put in a vote for at
least having  pure cgi as an option (as opposed to only having a Java
based client for example).

But, for now, how safe is it to run two pipelines at once? Especially,
has anyone done any workarounds to allow the PipelineManager to write to
different tmp directories? (if not, I I will something very simple to
keep the execution scripts separate)

>
> Absolutely. To achieve best performance you need:
>
> 1-Blast database local to node with best possible read speed (in our
> case with 2  mirrored local hard disks)

I don't know if you have any numbers or not, but I wonder what the
approximate percent speed gain is from doing this... any idea? That is
obviously a very aggressive setup... the type of setup I would expect on
a heavily used/publicly accessible resource.

>
> 2-Write STDOUT and STDERR to local node, read results from there and
> finally store results in database (no need to copy anywhere)
>
> The only current caveat with point 2 is that if a job fails, the error
> file stays there...


So, is doing this included in the current code? I didn't notice this...
or is it not there due to the problem you mentioned.

Actually, initially, I was doing basically this. I set NFSTMP_DIR to
/tmp, which is local on each machine. But, I had to stop doing that when
the pipeline started making subdirectories in NFSTMP_DIR. I think the
pbs software was automatically copying (scp) the output to /tmp on the
master node... I'm not exactly sure how that was working though.

Jeremy


From shawnh at fugu-sg.org  Wed Jul  2 02:51:21 2003
From: shawnh at fugu-sg.org (Shawn Hoon)
Date: Tue Jul  1 13:50:39 2003
Subject: [Bioperl-pipeline] multiple pipelines
In-Reply-To: <33517.128.192.15.158.1057076515.squirrel@sgx3.bmb.uga.edu>
Message-ID: <4C699380-AC27-11D7-A41A-000A95783436@fugu-sg.org>

>
> But, for now, how safe is it to run two pipelines at once? Especially,
> has anyone done any workarounds to allow the PipelineManager to write 
> to
> different tmp directories? (if not, I I will something very simple to
> keep the execution scripts separate)
>

Not a problem running two pipelines if you have 2 pipeline databases 
and run PipelineManager twice.

Temp files written to the same tmp directories should not pose a 
problem as they are
named using the pipeline database id as part of the id so no conflicts 
should occur.


>> Absolutely. To achieve best performance you need:
>>
>> 1-Blast database local to node with best possible read speed (in our
>> case with 2  mirrored local hard disks)
>
> I don't know if you have any numbers or not, but I wonder what the
> approximate percent speed gain is from doing this... any idea? That is
> obviously a very aggressive setup... the type of setup I would expect 
> on
> a heavily used/publicly accessible resource.
>

Something for Chen Peng to answer..
>>
>> 2-Write STDOUT and STDERR to local node, read results from there and
>> finally store results in database (no need to copy anywhere)
>>
>> The only current caveat with point 2 is that if a job fails, the error
>> file stays there...
>
>
> So, is doing this included in the current code? I didn't notice this...
> or is it not there due to the problem you mentioned.


The current way I do things is to have STDOUT and STDERR written to 
NFSTMP_DIR.
These are pipeline log files which is more convenient for me to access 
and I haven't had massive
problems so far. By default, the data input and output of programs are 
handled by bioperl-run
and this should be written at the local node. Jobs are run locally and 
the wrapper modules will write their
files to the local temp directory as should be the case.  Usually /tmp 
(on the local node) or whatever your env var tempdir
is set to.  Results are parsed locally and objects written to
database. If you are writing to files, then the files should get copied 
to some NFS mounted result
directory. This location is usually set in the runnable parameters.

>
> Actually, initially, I was doing basically this. I set NFSTMP_DIR to
> /tmp, which is local on each machine. But, I had to stop doing that 
> when
> the pipeline started making subdirectories in NFSTMP_DIR. I think the
> pbs software was automatically copying (scp) the output to /tmp on the
> master node... I'm not exactly sure how that was working though.
>

We don't have PBS installed so I'm not sure for this. But the 
subdirectories were created so that the multitude of log files
created do not pile up into a single directory making it hard to 
access. So spliting the files among the subdirectories will lessen the
load in that sense.


shawn

From elia at tll.org.sg  Wed Jul  2 13:29:21 2003
From: elia at tll.org.sg (Elia Stupka)
Date: Wed Jul  2 00:29:01 2003
Subject: [Bioperl-pipeline] multiple pipelines
In-Reply-To: <33517.128.192.15.158.1057076515.squirrel@sgx3.bmb.uga.edu>
Message-ID: <C0B325FA-AC45-11D7-86EB-000A95767E46@tll.org.sg>

> Yes, these ideas seem great. I would personally put in a vote for at
> least having  pure cgi as an option (as opposed to only having a Java
> based client for example).

You hit home ground with this, since we are a strong perl-shop. We are 
thinking of going SOAP/XML in terms of the actual protocol to ship data 
between client and server, and then we can implement CGI/Perl 
shell/applet on top of that.

> But, for now, how safe is it to run two pipelines at once?

See Shawn's answer, they will not get mixed up, no worries. We 
currently do it quite often.

> I don't know if you have any numbers or not, but I wonder what the 
> approximate percent speed gain is from doing this... any idea? That is 
> obviously a very aggressive setup... the type of setup I would expect 
> on a heavily used/publicly accessible resource.

I can't get you numbers off the top of my head, but basically you are 
killing BLAST if you read the database from a remote NFS mounted 
location.

About the agressive setup.... it is high-performance, no doubt, but not 
particularly expensive. You can buy pretty cheap processors, buy EIDE 
disks rather than SCSI, and achieve this "aggressive" setup, while 
often newbies dish out money on 3Ghz Xeons and then put in a single, 
tiny, SCSI hard disk. Ensembl, for example, runs on Blades running 
Celerons 800Mhz with mirrored EIDE disks, cheap, and fantastic 
performance. Of course the fact of distributing and mirroring databases 
only makes sense if you have at least a few nodes, though one could 
advocate that as soon as you have more than 2 processors accessing that 
data you should do it. The other option is having a dedicated SAN or 
NAS though it will never beat local access (which is still the cheapest 
solution).

Elia

---
Bioinformatics Program Manager
Temasek Life Sciences Laboratory
1, Research Link
Singapore 117604
Tel. +65 6874 4945
Fax. +65 6872 7007