[Bioperl-pipeline] xml dir housekeeping

Thu Jan 30 11:19:38 EST 2003

> > On writing pipeline tests, pipeline tests should  be file based, 
> > meaning it doesn't assume the availability of biosql or ensembl or 
> > other schema. Also no hardcoding.  Within the xml, one can add the 
> > various adaptors and stuff but commented out for testing purposes.
> 
> Juguang, if any of this is not clear please let us know, so that we 
> make sure any of your XMLs are up to speed.

Hi Shawn, Elia and Kiran,

Sure, I agreed with the idea of dev directory for xml. However, I am thinking about the pipeline test with ensembl. Mainly, as you know, the converter stuff is designed to convert objects between bioperl and ensembl so far, so that pipeline can make full use of huge data set, rather than flatfile. The problem raises that whether we need to test the instance of converter through running a pipeline OR without pipeline. In the case of ensembl series converter, it does not make sense if not storing into db. Right?

I had an idea on the pipeline running with db, it may be an idea of lazy man, :) but does be my experience on converter development. Each time, BEFORE I want to run a test to store the result of converter into the ensembl, I usually did the following steps

1) Make sure that we have ensembl in the environment. Of course, we, fugu team, have it on the pulse mahine. For others, they make use kaka.sanger.ac.uk 
2) Prepare a set of data for testing. I write a shell script to create a new test database and copy a set of data from ensembl db.

############

#!/usr/bin/sh
mysqldump -u root -d homo_sapiens_core_9_30 > ens_core_9_30.sql
mysqldump -u root -t -w 'dna_id<1000' homo_sapiens_core_9_30 contig > ens_homo_c
ore_9_30.contig.sql
mysqldump -u root -t -w 'dna_id<1000' homo_sapiens_core_9_30 dna > ens_homo_core
_9_30.dna.sql
mysqldump -u root -t homo_sapiens_core_9_30 analysis > ens_homo_core_9_30.analys
is.sql

mysqladmin -u root drop $1
mysqladmin -u root create $1

mysql -u root $1 < ens_core_9_30.sql
mysql -u root $1 < ens_homo_core_9_30.contig.sql
mysql -u root $1 < ens_homo_core_9_30.dna.sql
mysql -u root $1 < ens_homo_core_9_30.analysis.sql
mysql -u root $1 < my_analysis.sql

rm ens_core_9_30.sql
rm ens_homo_core_9_30.contig.sql
rm ens_homo_core_9_30.dna.sql
rm ens_homo_core_9_30.analysis.sql

#############

3) add analysis records into the test db. This ensembl analysis table is set according the ensembl computer environment, and may not be directly suitable in our server such as the analysis that we use.

With such prerequisite, I can test my converter instance on pipeline. I am thinking whether we can write a analysis to make the above work done within the pipeline.

I feel quite hard to define the internal and external parts of pipeline now. I think the internal framework of the pipeine is done. However, to meet the needs to run some special analysis or handle different data source, we developed converter subsystem, dumper for flatfile, input_creates for handling the mutiple or special inputs such as genewise with 2 inputs. Even the instances of the runnable, I see, are also external part of the pipeline. We developed the framework of pipeline done, and now are trying to make more and more instances of pipeline for our use or demos. Please correct me if wrong.

Hence, developing a module for preparing the dataset is a reasonable requirement. I think more varieties of individual requirement will come to pipeline when more people are using them.