From elia at tll.org.sg Tue Jul 1 14:42:20 2003 From: elia at tll.org.sg (Elia Stupka) Date: Tue Jul 1 01:42:04 2003 Subject: [Bioperl-pipeline] Re: BioPipe In-Reply-To: <52805F78-AAB8-11D7-928A-000A957702FE@tll.org.sg> Message-ID: > If the thread problem cannot be solved in perl, you cannot make > pipeline manager server program access multiple connection as well. But you can still use 'fork' and mutliple processes, right? We used to use it 4 years ago without a problem in Ensembl.... > My suggest is to let pipeline managers running as process and each has > one database. I doubt it will work sensibly in the long term, I think it makes no sense to populate the servers with tons of MySQL databases to store multiple pipelines. And as I mentioned, I think we ought to have pipeline ids on the jobs, files, tables,etc. Elia --- Bioinformatics Program Manager Temasek Life Sciences Laboratory 1, Research Link Singapore 117604 Tel. +65 6874 4945 Fax. +65 6872 7007 From elia at tll.org.sg Tue Jul 1 16:11:29 2003 From: elia at tll.org.sg (Elia Stupka) Date: Tue Jul 1 03:11:12 2003 Subject: [Bioperl-pipeline] multiple pipelines In-Reply-To: <34801.128.192.15.158.1056998159.squirrel@sgx3.bmb.uga.edu> Message-ID: <3CC9F63C-AB93-11D7-8BB0-000A95767E46@tll.org.sg> Hi Jeremy, we are currently having an internal discussion about this, we are actually trying to work towards a new multi-pipeline system, where one database could contain multiple pipelines. Also, files relating to jobs would have pipeline ids, etc. etc. and finally the web manager would track multiple pipelines. This is at discussion stage at the moment, though Juguang and Aaron over here seem set to work on this soon. > One other note: with our setup, reading/writing from/to an nfs > directory > during a blast analysis is very io bound. Absolutely. To achieve best performance you need: 1-Blast database local to node with best possible read speed (in our case with 2 mirrored local hard disks) 2-Write STDOUT and STDERR to local node, read results from there and finally store results in database (no need to copy anywhere) The only current caveat with point 2 is that if a job fails, the error file stays there, and there is no simple way to track which node a job is running on. We are about to change the database schema and the code to make sure we keep track of the node id that a job is running on after it is submitted. > then copied back to the nfs mounted directory the analysis was started > in If you are using a database (e.g. BioSQL or Ensembl) to store your blast results, you don't even need this last step, you just parse the file locally and then write results back to the db. Elia --- Bioinformatics Program Manager Temasek Life Sciences Laboratory 1, Research Link Singapore 117604 Tel. +65 6874 4945 Fax. +65 6872 7007 From jeremyp at sgx3.bmb.uga.edu Tue Jul 1 13:21:55 2003 From: jeremyp at sgx3.bmb.uga.edu (jeremyp@sgx3.bmb.uga.edu) Date: Tue Jul 1 12:21:32 2003 Subject: [Bioperl-pipeline] multiple pipelines Message-ID: <33517.128.192.15.158.1057076515.squirrel@sgx3.bmb.uga.edu> Hi, > > ...finally the web manager would track multiple pipelines. This is at > discussion stage at the moment, though Juguang and Aaron over here > seem set to work on this soon. Yes, these ideas seem great. I would personally put in a vote for at least having pure cgi as an option (as opposed to only having a Java based client for example). But, for now, how safe is it to run two pipelines at once? Especially, has anyone done any workarounds to allow the PipelineManager to write to different tmp directories? (if not, I I will something very simple to keep the execution scripts separate) > > Absolutely. To achieve best performance you need: > > 1-Blast database local to node with best possible read speed (in our > case with 2 mirrored local hard disks) I don't know if you have any numbers or not, but I wonder what the approximate percent speed gain is from doing this... any idea? That is obviously a very aggressive setup... the type of setup I would expect on a heavily used/publicly accessible resource. > > 2-Write STDOUT and STDERR to local node, read results from there and > finally store results in database (no need to copy anywhere) > > The only current caveat with point 2 is that if a job fails, the error > file stays there... So, is doing this included in the current code? I didn't notice this... or is it not there due to the problem you mentioned. Actually, initially, I was doing basically this. I set NFSTMP_DIR to /tmp, which is local on each machine. But, I had to stop doing that when the pipeline started making subdirectories in NFSTMP_DIR. I think the pbs software was automatically copying (scp) the output to /tmp on the master node... I'm not exactly sure how that was working though. Jeremy From shawnh at fugu-sg.org Wed Jul 2 02:51:21 2003 From: shawnh at fugu-sg.org (Shawn Hoon) Date: Tue Jul 1 13:50:39 2003 Subject: [Bioperl-pipeline] multiple pipelines In-Reply-To: <33517.128.192.15.158.1057076515.squirrel@sgx3.bmb.uga.edu> Message-ID: <4C699380-AC27-11D7-A41A-000A95783436@fugu-sg.org> > > But, for now, how safe is it to run two pipelines at once? Especially, > has anyone done any workarounds to allow the PipelineManager to write > to > different tmp directories? (if not, I I will something very simple to > keep the execution scripts separate) > Not a problem running two pipelines if you have 2 pipeline databases and run PipelineManager twice. Temp files written to the same tmp directories should not pose a problem as they are named using the pipeline database id as part of the id so no conflicts should occur. >> Absolutely. To achieve best performance you need: >> >> 1-Blast database local to node with best possible read speed (in our >> case with 2 mirrored local hard disks) > > I don't know if you have any numbers or not, but I wonder what the > approximate percent speed gain is from doing this... any idea? That is > obviously a very aggressive setup... the type of setup I would expect > on > a heavily used/publicly accessible resource. > Something for Chen Peng to answer.. >> >> 2-Write STDOUT and STDERR to local node, read results from there and >> finally store results in database (no need to copy anywhere) >> >> The only current caveat with point 2 is that if a job fails, the error >> file stays there... > > > So, is doing this included in the current code? I didn't notice this... > or is it not there due to the problem you mentioned. The current way I do things is to have STDOUT and STDERR written to NFSTMP_DIR. These are pipeline log files which is more convenient for me to access and I haven't had massive problems so far. By default, the data input and output of programs are handled by bioperl-run and this should be written at the local node. Jobs are run locally and the wrapper modules will write their files to the local temp directory as should be the case. Usually /tmp (on the local node) or whatever your env var tempdir is set to. Results are parsed locally and objects written to database. If you are writing to files, then the files should get copied to some NFS mounted result directory. This location is usually set in the runnable parameters. > > Actually, initially, I was doing basically this. I set NFSTMP_DIR to > /tmp, which is local on each machine. But, I had to stop doing that > when > the pipeline started making subdirectories in NFSTMP_DIR. I think the > pbs software was automatically copying (scp) the output to /tmp on the > master node... I'm not exactly sure how that was working though. > We don't have PBS installed so I'm not sure for this. But the subdirectories were created so that the multitude of log files created do not pile up into a single directory making it hard to access. So spliting the files among the subdirectories will lessen the load in that sense. shawn From elia at tll.org.sg Wed Jul 2 13:29:21 2003 From: elia at tll.org.sg (Elia Stupka) Date: Wed Jul 2 00:29:01 2003 Subject: [Bioperl-pipeline] multiple pipelines In-Reply-To: <33517.128.192.15.158.1057076515.squirrel@sgx3.bmb.uga.edu> Message-ID: > Yes, these ideas seem great. I would personally put in a vote for at > least having pure cgi as an option (as opposed to only having a Java > based client for example). You hit home ground with this, since we are a strong perl-shop. We are thinking of going SOAP/XML in terms of the actual protocol to ship data between client and server, and then we can implement CGI/Perl shell/applet on top of that. > But, for now, how safe is it to run two pipelines at once? See Shawn's answer, they will not get mixed up, no worries. We currently do it quite often. > I don't know if you have any numbers or not, but I wonder what the > approximate percent speed gain is from doing this... any idea? That is > obviously a very aggressive setup... the type of setup I would expect > on a heavily used/publicly accessible resource. I can't get you numbers off the top of my head, but basically you are killing BLAST if you read the database from a remote NFS mounted location. About the agressive setup.... it is high-performance, no doubt, but not particularly expensive. You can buy pretty cheap processors, buy EIDE disks rather than SCSI, and achieve this "aggressive" setup, while often newbies dish out money on 3Ghz Xeons and then put in a single, tiny, SCSI hard disk. Ensembl, for example, runs on Blades running Celerons 800Mhz with mirrored EIDE disks, cheap, and fantastic performance. Of course the fact of distributing and mirroring databases only makes sense if you have at least a few nodes, though one could advocate that as soon as you have more than 2 processors accessing that data you should do it. The other option is having a dedicated SAN or NAS though it will never beat local access (which is still the cheapest solution). Elia --- Bioinformatics Program Manager Temasek Life Sciences Laboratory 1, Research Link Singapore 117604 Tel. +65 6874 4945 Fax. +65 6872 7007