fastA and BLAST under EMBOSS, the DB problem

Guy Bottu gbottu at ben.vub.ac.be
Mon Oct 27 16:23:11 UTC 2003


from : BEN

	Dear colleagues,
	
Recently the discussion about integrating BLAST and fastA under EMBOSS has 
flared up again. At the BEN site we have BLAST, fastA and some other programs 
installed under an EMBOSS "wrapper" program. In their present form these 
programs are however not readily installable elsewhere. When a year ago we 
dropped GCG I rushed to replace the lost functionality for our users and did 
not consider portability. I think it would be a good idea to distribute our 
BLAST and fastA wrappers as an "Embassadir". Therefore I would like to 
co-operate with the developement team of EMBOSS as well as with the developers 
of GUI's to see what is necessary.

The most serious problem is databank access. The fastA programs need a databank 
in fastA format. If we choose the option of editing the original code there
must be a way to retrieve the sequences and feed them serially to the
algorithm. If we however choose the option of making a "wrapper" (like
emma), as I did, the question is how to provide the sequences. Saving
the complete EMBL databank in /tmp/xxx each time you want to search it
is evidently not convenient. Therefore my idea was to access directly
already installed databanks in fastA format.
In the current version of our wrappers the names of the available
databanks are written in the ACD files. This is of course not convenient
at all. A way of getting around thi hurdle is, I think, the following :
the local manager must anyway have the "native" fastA installed and
properly configured. The available databanks must be listed in a configuration
file. The name of the file is pointed at with the environment variable
FASTLIBS or is put on the command line as -l xxx . The file contains
lines of type :
EMBL (general nucleic acid databank)$1+em+@/dbfb/fastalibs/embl.lib
SwissProt (general protein databank)$0+sw+/dbfb/swissprot/sw
So, it contains as well a text to be displayed by the interface as a
short name (em) to be put on the command line as the actual location of the
databank (singe file or "library").
Evidently EMBOSS could parse this file. Maybe a new ACD object of type
"fastalib" could be implemented. Then the ACD files would need to
contain just :
fastalib: fastalib [
  required: "@....
  type: "@....nucleotide.....protein.... 
]
The local manager would have to put in emboss.defaults a line of type
FASTADB [ file: configfile_xxx ]
or EMBOSS could just read the environment variable FASTLIBS.

For BLAST you need databanks in BLAST format, so editing the original code
will not circumvent the problem. A similar approach is however possible :
if all the databanks are put in a single directory (environment variable 
BLASTDB). A problem is how to obtain a long text with the databank description 
to be displayed in the interface. If all the databanks have a *.nal or *.pal 
file the TITLE attribute could be used, although letting EMBOSS parse all those 
files each time you start the program is maybe not a good idea.

What do you think about it.

	Regards,
	Guy Bottu




More information about the emboss-dev mailing list