Pre first draft of admin guide. (fwd)
David Martin
david.martin at biotek.uio.no
Wed Aug 2 14:50:24 UTC 2000
And the file is here as an attachment.
..d
---------------------------------------------------------------------
* Dr. David Martin Biotechnology Centre of Oslo *
* Node Manager Gaustadalleen 21 *
* The Norwegian EMBNet Node P.O. box 1125 Blindern *
* tel +47 22 95 87 56 N-0317 Oslo *
* fax +47 22 69 41 30 Norway *
---------------------------------------------------------------------
---------- Forwarded message ----------
Date: Wed, 2 Aug 2000 15:49:18 +0100
From: David Martin <damartin at ulrik.uio.no>
Reply-To: admin at embnet.uio.no
To: emboss-dev at embnet.org
Subject: Pre first draft of admin guide.
OK it is in raw text form. I'll mark it up for LaTeX soon but here it is
for your delectation and delight.
The major sticking points at the moment are Database Indexing, especially
DBIBLAST but there are unresolved issues with DBIFLAT and FASTA files and
DBIGCG (because it loops until armageddon in the form of SIGDIEDIEDIE) so
I haven't been able to test it properly.
Comments are welcome. I'm hoping it can be pretty much a recipe book for
EMBOSS setup.
With a bit of standardising of macros, it should be possible to dump out
the program docs as LaTeX and incorporate those too. I'll look at marking
up the quick guide, and then with Val's tutorial and Thon's ACD guide we
are approaching a reasonable manual for EMBOSS.
Maybe I should create a small EMBOSS logo in LaTeX like
EMB that would slot into the text at about the right height.
OSS
..d
---------------------------------------------------------------------
* Dr. David Martin Biotechnology Centre of Oslo *
* Node Manager Gaustadalleen 21 *
* The Norwegian EMBNet Node P.O. box 1125 Blindern *
* tel +47 22 95 87 56 N-0317 Oslo *
* fax +47 22 69 41 30 Norway *
---------------------------------------------------------------------
-------------- next part --------------
The EMBOSS Administrators Guide
What is EMBOSS?
Where do I get it?
Installation
Configuration
Databases
Database access
Indexing and configuring flatfile databases
Indexing and configuring GCG format databases
Indexing and configuring BLAST databases
Configuring EMBOSS to use SRS for database lookup.
Indexing and configuring other databases
Other data
Logging
What is EMBOSS?
EMBOSS is a freely available suite of bioinformatics applications and libraries. It can be downloaded via the internet, copied, customised, and passed on under the terms of the various General Public Licenses. EMBOSS has been developed in response to the need for a powerful, adaptable suite of software that can interface readily with many different situations and meet the need of professional bioinformaticists, particularly those needing high throughput and/or scriptable capabilities.
EMBOSS has primarily been developed by those responsible for the public extensions to the GCG package. Whilst EMBOSS duplicates much of EGCG it includes far better database interaction and has the benefit of freely accessible source code so novel applications can be developed rapidly and at minimal cost.
EMBOSS is currently only available for Unix/Linux systems but it ahs been known to compile and run on Windows NT. This document will only consider the UNIX version and will assume the reader has some familiarity with UNIX system administration.
Where do I get it?
EMBOSS is available for download from the primary site at the UK EMBnet node via ftp. ftp.uk.embnet.org/pub/EMBOSS/
This directory contains the EMBOSS package and several associated packages (collectively known as EMBASSY) that are distributed with EMBOSS. Download these to a suitable location. Documentation is available at http://www.uk.embnet.org/Software/EMBOSS
Installation
Unpacking
You will have downloaded the EMBOSS and EMBASSY packages to a suitable directory. For this example we will assume you have downloaded them to /packages so you should now have the following files (or similar) and maybe more packages in EMBASSY.
EMBOSS-1.0.0.tar.gz
PHYLIP-3.573c.tar.gz
MSE-0.0.4.tar.gz
TOPO-0.1.tar.gz
First unpack the EMBOSS distribution
gunzip EMBOSS-1.0.0.tar.gz
tar xf EMBOSS-1.0.0.tar
This will create a new directory, EMBOSS-1.0.0
Enter the EMBOSS directory
cd EMBOSS-1.0.0
create a directory for the EMBASSY packages
mkdir embassy
Now copy the EMBASSY packages to the EMBASSY directory
cp ../MSE-0.0.4.tar.gz PHYLIP-3.573c.tar.gz TOPO-0.1.tar.gz embassy
Go into the EMBASSY directory and unpack those packages.
cd embassy
gunzip MSE-0.0.4.tar.gz
tar xf MSE-0.0.4.tar
and so on for each EMBASSY package.
go back up one directory to th emain EMBOSS package directory and prepare to start compilation.
Compilation.
Building EMBOSS is easy. It follows the usual GNU style of configure, make, make install. We'll take these steps one at a time.
Configuration
To accept the default configuration, just type ./configure and let EMBOSS get on with it. You may want to make some changes to the configuration parameters according to your local policy. This section will not cover all the possibilities, just some of the more common. The configuration script will attempt to find the neccessary components in your system to determine haow to successfully build EMBOSS. It typically expects the GNU C compiler (gcc) and several standard libraries that should already be part of your Unix/Linux system. Most modern Linux distributions should work straight out of the box.
Installation directory.
You need to have write permission on the directory in which you eventually wish to install EMBOSS. You may also wish to put it somewhere else other than the standard location of /usr/local/emboss.
This is controlled by the --prefix argument. In my case I have all my applications owned by a non-priviledged user and installed under /site/prog
./configure --prefix=/site/prog/emboss
will install EMBOSS under /site/prog/emboss. The binaries will be in /site/prog/emboss/bin with shared libraries in /site/prog/emboss/lib. Data will be in /site/prog/emboss/data, and the configuration files (ACD files) for the applications will be under /site/prog/emboss/share in directories corresponding to the package name.
The individual directories for installation can be modified with other configuration commands but this is usually not neccessary. Run ./configure --help to get more information on the directories that can be changed and other configuration options.
Run ./configure with the options you wish to use. This may take a short while during which various messages will scroll up the screen.
Depending on your system you may need to explicitly configure the graphics. Please see the section 'Configuring EMBOSS graphics' below.
./configure --prefix=/site/prog/emboss --with-pngdriver=/site/lib
All should be well with this and configure should exit with a message like this:
creating ./config.status
creating plplot/Makefile
creating plplot/lib/Makefile
creating nucleus/Makefile
creating ajax/Makefile
creating emboss/Makefile
creating emboss/acd/Makefile
creating test/Makefile
creating test/data/Makefile
creating test/embl/Makefile
creating test/pir/Makefile
creating test/swiss/Makefile
creating test/swnew/Makefile
creating test/wormpep/Makefile
creating emboss/data/Makefile
creating emboss/data/CODONS/Makefile
creating emboss/data/REBASE/Makefile
creating emboss/data/PRINTS/Makefile
creating emboss/data/PROSITE/Makefile
creating Makefile
Configuration is now complete.
Configuring EMBOSS graphics.
The PLPLOT library can produce output to many devices but requires certain libraries that are NOT distributed with EMBOSS
To get X-windows based output you must have X installed else PLplot will not build the
required driver. You may need to specify the location of your X-windows library with the configuration options:
--x-includes=DIR (X include files are in DIR)
--x-libraries=DIR (X library files are in DIR)
To explicitly configure PLPLOT without X-windows, use --without-x.
To get PLPLOT to produce PNG images you will need to have the z,png and gd
librarys installed. In particular gd version >= 1.6.3 must be used.
If for some reason you do not have the required librarys and your
system support group will not update these ( In particular gd, as
the older versions support GIF which is NOT supported in later
versions) then install all three latest versions (z,gd,png) to a
new directory and then add this new directory to your configure
line for EMBOSS.
i.e. ./configure --with-pngdriver=my_dir
where the z, gd and png libraries were each installed using ./configure --prefix=my_dir
You can explicitly tell EMBOSS to not include PNG support with --without-pngdriver
How to tell if ./configure has found PNG.
Watch for something like the following when running ./configure:
checking if png driver is wanted... yes
checking for inflateEnd in -lz... (cached) yes
checking for png_destroy_read_struct in -lpng... (cached) yes
checking for gdImageCreateFromPng in -lgd... (cached) yes
This means that the configuration script has located the PNG libraries on your system. If you see a message indicating that ./configure could not find the libraries or that the version of gd was too old then you should install the latest versions of the libraries yourself and rerun configure with the correct --with-pngdriver value.
Building EMBOSS
Building EMBOSS is a matter of typing 'make' and going to find something else to do for the next ten minutes to half an hour depending on the speed of your system. EMBOSS will first build the shared libraries (PL_PLOT, AJAX, and NUCLEUS) and then build the applications.
You will see plenty of warnings complaining about libraries not being used to resolve any symbols. These can be safely ignored.
If all goes according to plan you should have built EMBOSS successfully. If not you will have to try to work out why the build failed. If you can't work it out yourself, send an email describing the problem to emboss-bug at sanger.ac.uk with a copy of the config.status and config.cache files from your EMBOSS directory. (These will tell the developers what state your system was in whaen compilation failed).
I am assuming that compilation was successful. You nw have to checkthat you have the correct access permissions for the directory in which you wish to install EMBOSS and type 'make install'. After a few minutes and many pagefuls of messages, EMBOSS should be installed where you specified.
Tidying up the environment.
You will now need to make a few adjustments to your environment to ensure that EMBOSS runs smoothly.
EMBOSS looks for certain environment variables to determine where the libraries and data are found. These instructions assumed you installed EMBOSS in /site/prog/emboss. Adjust these instructions to suit your installaation.
Insert the following lines at the end of /etc/cshrc (or ~/.cshrc for a personal installation)
setenv EMBOSS_DATA /site/prog/emboss/data
setenv PLPLOT_LIB /site/prog/emboss/lib
set path=( /site/prog/emboss ${path} )
Or for bash/ksh/sh users, insert the following at the end of /etc/profile or ~/.bashrc
EMBOSS_DATA=/site/prog/emboss/data
PLPLOT_LIB=/site/prog/emboss/lib
PATH=/site/prog/emboss:$PATH
export EMBOSS_DATA PLPLOT_LIB PATH
EMBOSS should now be ready for use.
You can test this by trying the program 'wossname'
wossname -auto |more
This should give a long list of programs that are available. Press space to page down through the list. This is just the EMBOSS programs and doesn't include any of the EMBASSY programs.
Installing EMBASSY
As well as the base libraries and standard EMBOSS distribution, various extra packages (EMBASSY) are distributed with EMBOSS.
To install an EMBASSY package, go to the relevant directory. For example to install PHYLIP (which was unpacked into /packages/EMBOSS-1.0.0/embassy/PHYLIP-3.573c earlier) go to the relevant directory.
cd /packages/EMBOSS-1.0.0/embassy/PHYLIP-3.573c
./configure --prefix=/site/prog/emboss
make
make install
NB. You MUST use the same arguments for configure that you used for the installation of the main EMBOSS package.
Repeat as neccessary for the other EMBASSY packages.
You should now find that running wossname as before lists the EMBASSY programs.
Configuration
EMBOSS can be configured to match your requirements. EMBOSS looks for a configuration file in several places. Firstly it looks in /site/prog/emboss/share/EMBOSS for a file 'emboss.default'. It then looks in your home directory for the file '.embossrc' and finally in the current directory for '.embossrc'. In each case definitions will override those previously defined.
Several aspects of EMBOSS can be defined. These are:
EMBOSS environment variables
EMBOSS databases
Default behaviour of EMBOSS programs
As Databases are by far the most complex of these they will be covered in a seperate section.
EMBOSS environment variables
These are set with an 'env' or a 'set' declaration. 'env' and 'set' are interchangeable.
The most important environment variable is the location of the acd files that describe each program.
set emboss_acdroot /site/prog/emboss/share/EMBOSS/acd
Environment variables are useful for easing the maintenance of your emboss.default. For example you may want to specify the location of your databases as an environment variable. Then if you move the databases you only have to update one line in the configuration file.
set emboss_database_dir /data/databases/flatfiles
This would then be referred to as
$emboss_database_dir/embl
for the directory /data/databases/flatfiles/embl
Databases
Database access
Emboss offers three methods for accessing databases:
All: EMBOSS returns all the sequences in the database in no particular order
Query: EMBOSS retrieves a set of sequences corresponding to a wildcard query.
Single: EMBOSS retrieves a single sequence indexed by ID or accession number.
Each database definition can configure one or many of these methods for database access.
Typically EMBOSS uses the 'emblcd' system of database indexing. This comes in three variants depending on the original format of your database. The emblcd method assumes that you have both ID and accession number in each record. If you do not have both ID and accession number you will have to use an alternative method. Please see the 'other databases' section below.
General Database configuration.
Each database is configured using a DB declaration.
The generalised form is
DB databasename [
Configuration options
]
The configuration options are tag/value pairs and must contain at least a description of the access method (using method: or one or more of methodsingle:, methodquery: and methodall:) and a description of the format the sequences will be returned in ( using format:).
In addition to these tags there will be other tags that are needed for particular methods and other tags that are optional.
method: & scope & Description &
DIRECT & a & Returns all the database entries, one after the other. It assumes no indexing. &
DB mydb [
#required parameters
method: direct
format: fasta
dir: $emboss_db_dir/mydb
file: *.dat
#optional parameters
type: N
release: 63.0
comment: "My own database with no indices"
exclude: "est*.dat"
]
SRS & a q s & Returns entries from a local installation of SRS using the -e switch to getz to return entries in the original format.
DB mydb [
#required parameters
method: srs
format: embl
app: getz
#optional parameters
dbalias: embl
type: N
comment: 'My srs indexed database'
release: '63.0'
]
SRSFASTA & a q s & As SRS but returns the sequences in FASTA format.
URL & s & Uses a defined web server to retreive a specific entry. EMBOSS may fail if the HTML causes complications. &
DB mydb [
# required parameters
method: url
format: genbank
url: "http://www.infobiogen.fr/srs5bin/cgi-bin/wgetz?-e+[genbank-id:%s]"
#optional parameters
type: N
comment: "Genbank by ID from InfoBiogen"
]
The %s in the URL string indicates where EMBOSS will insert the identifier portion of the USA.
EMBLCD & a q s & Uses EMBLCD indices created with DBIFLAT to access EMBL format databases in the original format. & directory: files:
DB mydb [
method: emblcd
format: embl
dir: $emboss_db_dir/embl
file: *.dat
#optional parameters
type: N
release: 63.0
comment: "my comment"
exclude: est*.dat
indexdir: $emboss_db_dir/indices
]
GCG & a q s & Uses EMBLCD indices created with DBIGCG to access databases in GCG format. & As for EMBLCD but format: gcg and method: gcg
BLAST & a q s & Uses EMBLCD indices created with DBIBLAST to access databases in BLAST format. & As for EMBLCD but format: blast and method: blast
EXTERNAL & a q s & Uses an external application to retrieve sequences, returning them on STDOUT & The ID is passed as an argument to the application, either replacing %s in the command string (if present) or as an additional arguement (if there is no %s)
DB mydb [
#required parameters
method: app
format: fasta
app: "getfromdb thisfastadb"
#optional parameters
type: P
comment: "my own protein database with a custom retrieval program"
]
APP & a q s & same as EXTERNAL.
NBRF & a q s &
for a method: declaration, EMBOSS will use that method for those access methods supported by the method.
If you wish to specify which accessmethod should be handled by which method then the methodsingle: methodquery: and methodall: declarations should be used instead of method:
DB mydb [
methodsingle: app
format: fasta
app: "customapp myproteindb"
methodall: direct
dir: $emboss_db_dir/myproteindb
file: myproteindb.dat
type: P
comment: "single and all access for myproteindb"
]
Indexing and configuring flatfile databases
Flatfile databases are those released by EMBL, Swissprot and so on. The EMBOSS program DBIFLAT is used to generate emblcd indices that can be used for all types of database access. DBIFLAT can process databases in EMBL, SWISSPROT and GENBANK format. Pseudo EMBL format databases which do not have unique ID and AC entries will cause DBIFLAT to do mysterious things and should be avoided.
DBIFLAT requires the databases to be uncompressed. This example will not probe the deeper secrets of DBIFLAT (for which the reader is referred to the documentation, or failing that the source code) but will show a typical installation for a common database.
We assume EMBOSS has been installed and works. This can be tested with the command wossname -auto which should list all the programs available.
In this example we will index and configure the EMBL database for use with EMBOSS.
First download and unpack the EMBL database. This will require a considerable amount of disk space.
cd to the directory in which you have unpacked EMBL. This should look something like this when you run ls:
est_fun.dat
est_hum1.dat
est_hum10.dat
.
.
.
syn.dat
unc.dat
vrl.dat
vrt.dat
Run DBIFLAT to create the emblcd indices.
% dbiflat
Index a flat file database
EMBL : EMBL
SWISS : Swiss-Prot, SpTrEMBL, TrEMBLnew
GB : Genbank, DDBJ
FASTA : FASTA format
Entry format [SWISS]: EMBL
Database name: embl
Database directory [.]:
Wildcard database filename [*.dat]:
Release number [0.0]: 63.0
Index date [00/00/00]: 31/07/00
DBIFLAT should happily chug away for some considerable time (up to a few hours depending on the speed of your machine) and will generate (eventually) the following index files:
acnum.hit
acnum.trg
division.lkp
Now we create an entry in the EMBOSS configuration files to acces sthe database. It is probably a good idea to try new database definitions in your local configuration file first.
Put the following entry in your .embossrc
set emboss_db_dir /path_to_databases
DB embl [
type: N
method: emblcd
format: embl
dir: $emboss_db_dir/embl
file: "*.dat"
release: "63.0"
comment: "EMBL release 63.0"
]
Save .embossrc and try showdb. You should see a line that looks like:
embl N OK OK OK EMBL release 63.0
Fine tuning the installation:
It is probably a good idea to set up subsections of the database so that end users can search just the regions they wish to search.
Files can be included with the declaration files: or excluded with the declaration exclude:
In order to just take the EST files try the following:
DB emblest [
type: N
method: emblcd
format: embl
dir: $emboss_db_dir/embl
file: "est*.dat"
release: "63.0"
comment: "EMBL release 63.0"
]
Files can also be given as a space seperated list. For example to set up a database of all mamallian sequences (except genomes) try the following:
DB emblallmam [
type: N
method: emblcd
format: embl
dir: $emboss_db_dir/embl
file: "rod*.dat hum*.dat mam*.dat"
release: "63.0"
comment: "EMBL release 63.0"
]
It can be quite tedious to set up a long list of sequences to search. In many cases you can use the exclude function to make things easier.
DB emblnoest [
type: N
method: emblcd
format: embl
dir: $emboss_db_dir/embl
file: "*.dat"
exclude: "est*.dat"
release: "63.0"
comment: "EMBL release 63.0"
]
This configures the emblnoest database to contain all of EMBL except the EST's.
Indexing and configuring GCG format databases
EMBOSS can access GCG formatted databases, thus avoiding having multiple copies of the same databases in different formats. EMBOSS creates EMBLCD like indices for the GCG format databases using the program DBIGCG. This runs in much the same way as DBIFLAT. You will need the GCG format .seq and .header files in order to create an indexed database.
cd to the GCG database directory containing your data and run DBIGCG
Index a GCG formatted database
EMBL : EMBL
SWISS : Swiss-Prot, SpTrEMBL, TrEMBLnew
GB : Genbank, DDBJ
PIR : NBRF
Entry format [EMBL]:
Database name: embl
Database directory [.]:
Wildcard database filename [*.seq]:
Release number [0.0]: 63.0
Index date [00/00/00]: 31/07/00
The program will chug along for a while and will then generate the emblcd index files for the GCG format database.
The following entry should be put in your .embossrc
DB gcgembl [
type: N
method: gcg
format: embl
dir: $emboss_db_dir/embl
file: "*.dat"
release: "63.0"
comment: "EMBL release 63.0"
]
SHOWDB should show your newly configured database.
You can configure substes of th edatabases in the same way as for the original format databases.
Indexing and configuring BLAST databases
Here be dragons
Configuring EMBOSS to use SRS for database lookup.
Here be lions
Indexing and configuring other databases
Many institutions may have local databases set up in their own Laboratory Information Management System. EMBOSS provides a simple mechanism for interfacing with such systems.
As long as a program is available that can be called noninteractively and returns the specified sequence on standard output, EMBOSS can interface with it.
Use method: app or external (the two are equivalent) and app: "program command".
The ID given in the USA will be appended to the command used to run the program. It is probably best to specify the methods available using the method subsets, methodall:, methodquery: and methodsingle: rather than using the generic method: tag.
Other data
EMBOSS can be integrated with some common biological databases. These are described in this section.
REBASE
Rebase is the restriction enzyme database maintained by New England Biolabs. It is needed for programs such as remap and restrict.
The latest version of Rebase can be obtained by anonymous FTP from ftp://ftp.ebi.ac.uk/pub/databases/rebase. EMBOSS needs the 'withrefm' file. The data is extracted for EMBOSS with the program rebaseextract.
If you installed EMBOSS with the --prefix option you may need to create the REBASE directory under the emboss data directory (/site/prog/emboss/data in this example) This directory only needs creating once.
% mkdir /site/prog/emboss/data/REBASE
% rebaseextract
Extract data from REBASE
Full pathname of WITHREFM: /data/rebase/withrefm.008
Rebase is now installed and ready to use.
TRANSFAC
Transfac is the transcription factor binding site database. It is available by anonymous ftp from ftp://ftp.ebi.ac.uk/pub/databases/transfac/transfac32.tar.Z
Unpacking the distribution reveals a file called site.dat. This is the one EMBOSS needs.
Run TFEXTRACT to extract the data from TRANSFAC.
% tfextract
Extract data from TRANSFAC
Full pathname of transfac SITE.DAT: /databases/transfac/site.dat
tfscan can now access the TF database.
PROSITE
Prosite is a database of regular expressions that match potentially diagnostic regions for structural/functional classification of proteins. EMBOSS needs this database for the patmatmotifs program.
PROSITE can be obtained via anonymous FTP from ftp://ftp.ebi.ac.uk/pub/databases/prosite.
You may need to create a PROSITE subdirectory under data in the EMBOSS installation directory.
Then run prosextract to build the EMBOSS Prosite database.
Builds the PROSITE motif database for patmatmotifs to search
Enter name of prosite directory: /data/prosite
PROSITE is now integrated into your EMBOSS installation.
PRINTS
Prints is a database of diagnostic patterns of blocks of sequence homology in protein families. The PRINTS database can be searched using the EMBOSS program pscan.
PRINTS can be obtained via anonymous ftp from ftp://ftp.ebi.ac.uk/pub/databases/prints. The database is made available as compressed files which should be uncompressed using gzip before integrating them into EMBOSS
PRINTS is integrated with EMBOSS using the command printsextract
% printsextract
Extract data from PRINTS
Input file: /data/prints/prints27_0.dat
The PRINTS database is now integrated with EMBOSS.
Miscellaneous data files
Other data files should be kept in the data directory under the main EMBOSS installation. Individual users personal data files can be kept in the current working directory, a subdirectory .embossdata of the current directory, their home directory or a subdirectory .embossdata of their home directory. EMBOSS will search these locations in this order and will stop as soon as it finds a matching file. If the personal directories do not contain the desired file, EMBOSS will search the system wide data directory, /site/prog/emboss/data in this example.
Apparently inexplicable errors when running EMBOSS programs may be caused by the system not using the data files one expects. The search path can be displayed in search order using the command embossdata.
Logging
Many system administrators may wish to make use of the logging facilities of EMBOSS. Setting the variable emboss_logfile in emboss.default or .embossrc allows the system to keep a log of which programs are used when and by whom.
set emboss_logfile /site/log/emboss.log
The log file structure is very simple. Three tab seperated fields are stored, program name, user name, and the date and time.
prettyplot joeuser Wed Aug 02 14:29:13 2000
The file set in emboss_logfile should be world writable.
These settings can be overridden in a users .embossrc files by redefining emboss_logfile. eg. to prevent my system usage being logged I can put the following entry in my .embossrc file.
set emboss_logfile /dev/null
This behaviour may change in the future to prevent users redefining some system settings.
More information about the emboss-dev
mailing list