Pre first draft of admin guide. (fwd)

Wed Aug 2 14:50:24 UTC 2000

And the file is here as an attachment.

..d

---------------------------------------------------------------------
*  Dr. David Martin                  Biotechnology Centre of Oslo   *
*  Node Manager                      Gaustadalleen 21               *
*  The Norwegian EMBNet Node         P.O. box 1125 Blindern         *
*  tel +47 22 95 87 56               N-0317 Oslo                    *
*  fax +47 22 69 41 30               Norway                         * 
---------------------------------------------------------------------

---------- Forwarded message ----------
Date: Wed, 2 Aug 2000 15:49:18 +0100
From: David Martin <damartin at ulrik.uio.no>
Reply-To: admin at embnet.uio.no
To: emboss-dev at embnet.org
Subject: Pre first draft of admin guide.

OK it is in raw text form. I'll mark it up for LaTeX soon but here it is
for your delectation and delight.

The major sticking points at the moment are Database Indexing, especially
DBIBLAST but there are unresolved issues with DBIFLAT and FASTA files and
DBIGCG (because it loops until armageddon in the form of SIGDIEDIEDIE) so
I haven't been able to test it properly.

Comments are welcome. I'm hoping it can be pretty much a recipe book for
EMBOSS setup.

With a bit of standardising of macros, it should be possible to dump out
the program docs as LaTeX and incorporate those too. I'll look at marking
up the quick guide, and then with Val's tutorial and Thon's ACD guide we
are approaching a reasonable manual for EMBOSS.

Maybe I should create a small EMBOSS logo in LaTeX like 

EMB that would slot into the text at about the right height.
OSS 

..d
---------------------------------------------------------------------
*  Dr. David Martin                  Biotechnology Centre of Oslo   *
*  Node Manager                      Gaustadalleen 21               *
*  The Norwegian EMBNet Node         P.O. box 1125 Blindern         *
*  tel +47 22 95 87 56               N-0317 Oslo                    *
*  fax +47 22 69 41 30               Norway                         * 
---------------------------------------------------------------------

-------------- next part --------------
The EMBOSS Administrators Guide

What is EMBOSS?

Where do I get it?

Installation

Configuration

Databases

	Database access

	Indexing and configuring flatfile databases

	Indexing and configuring GCG format databases

	Indexing and configuring BLAST databases

	Configuring EMBOSS to use SRS for database lookup.

	Indexing and configuring other databases

Other data

Logging

What is EMBOSS?

EMBOSS is a freely available suite of bioinformatics applications and libraries. It can be downloaded via the internet, copied, customised, and passed on under the terms of  the various General Public Licenses.  EMBOSS has been developed in response to the need for a powerful, adaptable suite of software that can interface readily with many different situations and meet the need of professional bioinformaticists, particularly those needing high throughput and/or scriptable capabilities.

EMBOSS has primarily been developed by those responsible for the public extensions to the GCG package. Whilst EMBOSS duplicates much of EGCG it includes far better database interaction and has the benefit of freely accessible source code so novel applications can be developed rapidly and at minimal cost.

EMBOSS is currently only available for Unix/Linux systems but it ahs been known to compile and run on Windows NT. This document will only consider the UNIX version and will assume the reader has some familiarity with UNIX system administration.

Where do I get it?

EMBOSS is available for download from the primary site at the UK EMBnet node via ftp. ftp.uk.embnet.org/pub/EMBOSS/ 

This directory contains the EMBOSS package and several associated packages (collectively known as EMBASSY) that are distributed with EMBOSS. Download these to a suitable location. Documentation is available at http://www.uk.embnet.org/Software/EMBOSS

Installation

Unpacking

You will have downloaded the EMBOSS and EMBASSY packages to a suitable directory. For this example we will assume you have downloaded them to /packages so you should now have the following files (or similar) and maybe more packages in EMBASSY. 

EMBOSS-1.0.0.tar.gz

PHYLIP-3.573c.tar.gz

MSE-0.0.4.tar.gz

TOPO-0.1.tar.gz

First unpack the EMBOSS distribution

gunzip EMBOSS-1.0.0.tar.gz

tar xf EMBOSS-1.0.0.tar

This will create a new directory, EMBOSS-1.0.0

Enter the EMBOSS directory

cd EMBOSS-1.0.0

create a directory for the EMBASSY packages

mkdir embassy

Now copy the EMBASSY packages to the EMBASSY directory

cp ../MSE-0.0.4.tar.gz PHYLIP-3.573c.tar.gz TOPO-0.1.tar.gz embassy

Go into the EMBASSY directory and unpack those packages.

cd embassy

gunzip MSE-0.0.4.tar.gz

tar xf MSE-0.0.4.tar

and so on for each EMBASSY package.

go back up one directory to th emain EMBOSS package directory and prepare to start compilation.

Compilation.

Building EMBOSS is easy. It follows the usual GNU style of configure, make, make install. We'll take these steps one at a time.

Configuration

To accept the default configuration, just type ./configure and let EMBOSS get on with it. You may want to make some changes to the configuration parameters according to your local policy. This section will not cover all the possibilities, just some of the more common. The configuration script will attempt to find the neccessary components in your system to determine haow to successfully build EMBOSS. It typically expects the GNU C compiler (gcc) and several standard libraries that should already be part of your Unix/Linux system. Most modern Linux distributions should work straight out of the box.

Installation directory.

You need to have write permission on the directory in which you eventually wish to install EMBOSS. You may also wish to put it somewhere else other than the standard location of /usr/local/emboss.

This is controlled by the --prefix argument. In my case I have all my applications owned by a non-priviledged user and installed under /site/prog

./configure --prefix=/site/prog/emboss

will install EMBOSS under /site/prog/emboss. The binaries will be in /site/prog/emboss/bin with shared libraries in /site/prog/emboss/lib. Data will be in /site/prog/emboss/data, and the configuration files (ACD files) for the applications will be under /site/prog/emboss/share in directories corresponding to the package name.

The individual directories for installation can be modified with other configuration commands but this is usually not neccessary. Run ./configure --help to get more information on the directories that can be changed and other configuration options.

Run ./configure with the options you wish to use. This may take a short while during which various messages will scroll up the screen.

Depending on your system you may need to explicitly configure the graphics. Please see the section 'Configuring EMBOSS graphics' below.

./configure --prefix=/site/prog/emboss --with-pngdriver=/site/lib

All should be well with this and configure should exit with a message like this:

creating ./config.status

creating plplot/Makefile

creating plplot/lib/Makefile

creating nucleus/Makefile

creating ajax/Makefile

creating emboss/Makefile

creating emboss/acd/Makefile

creating test/Makefile

creating test/data/Makefile

creating test/embl/Makefile

creating test/pir/Makefile

creating test/swiss/Makefile

creating test/swnew/Makefile

creating test/wormpep/Makefile

creating emboss/data/Makefile

creating emboss/data/CODONS/Makefile

creating emboss/data/REBASE/Makefile

creating emboss/data/PRINTS/Makefile

creating emboss/data/PROSITE/Makefile

creating Makefile

Configuration is now complete.

Configuring EMBOSS graphics.

The PLPLOT library can produce output to many devices but requires certain libraries that are NOT distributed with EMBOSS

To get X-windows based output you must have X installed else PLplot will not build the

required driver. You may need to specify the location of your X-windows library with the configuration options:

  --x-includes=DIR        (X include files are in DIR)

  --x-libraries=DIR       (X library files are in DIR)

To explicitly configure PLPLOT without X-windows, use --without-x.

To get PLPLOT to produce PNG images you will need to have the z,png and gd

librarys installed. In particular gd version >=  1.6.3 must be used.

If for some reason you do not have the required librarys and your 

system support group will not update these ( In particular gd, as

the older versions support GIF which is NOT supported in later

versions) then install all three latest versions (z,gd,png) to a 

new directory and then add this new directory to your configure

line for EMBOSS.

i.e. ./configure --with-pngdriver=my_dir

where the z, gd and png libraries were each installed using ./configure --prefix=my_dir 

You can explicitly tell EMBOSS to not include PNG support with --without-pngdriver

How to tell if ./configure has found PNG.

Watch for something like the following when running ./configure:

checking if png driver is wanted... yes

checking for inflateEnd in -lz... (cached) yes

checking for png_destroy_read_struct in -lpng... (cached) yes

checking for gdImageCreateFromPng in -lgd... (cached) yes

This means that the configuration script has located the PNG libraries on your system. If you see a message indicating that ./configure could not find the libraries or that the version of gd was too old then you should install the latest versions of the libraries yourself and rerun configure with the correct --with-pngdriver value.

Building EMBOSS

Building EMBOSS is a matter of typing 'make' and going to find something else to do for the next ten minutes to half an hour depending on the speed of your system. EMBOSS will first build the shared libraries (PL_PLOT, AJAX, and NUCLEUS) and then build the applications.

You will see plenty of warnings complaining about libraries not being used to resolve any symbols. These can be safely ignored.

If all goes according to plan you should have built EMBOSS successfully. If not you will have to try to work out why the build failed. If you can't work it out yourself, send an email describing the problem to emboss-bug at sanger.ac.uk with a copy of the config.status and config.cache files from your EMBOSS directory. (These will tell the developers what state your system was in whaen compilation failed).

I am assuming that compilation was successful. You nw have to checkthat you have the correct access permissions for the directory in which you wish to install EMBOSS and type 'make install'. After a few minutes and many pagefuls of messages, EMBOSS should be installed where you specified.

Tidying up the environment.

You will now need to make a few adjustments to your environment to ensure that EMBOSS runs smoothly.

EMBOSS looks for certain environment variables to determine where the libraries and data are found. These instructions assumed you installed EMBOSS in /site/prog/emboss. Adjust these instructions to suit your installaation.

Insert the following lines at the end of /etc/cshrc (or ~/.cshrc for a personal installation)

setenv EMBOSS_DATA /site/prog/emboss/data

setenv PLPLOT_LIB /site/prog/emboss/lib

set path=( /site/prog/emboss ${path} )

Or for bash/ksh/sh users, insert the following at the end of /etc/profile or ~/.bashrc

EMBOSS_DATA=/site/prog/emboss/data

PLPLOT_LIB=/site/prog/emboss/lib

PATH=/site/prog/emboss:$PATH

export EMBOSS_DATA PLPLOT_LIB PATH

EMBOSS should now be ready for use.

You can test this by trying the program 'wossname'

wossname -auto |more

This should give a long list of programs that are available. Press space to page down through the list. This is just the EMBOSS programs and doesn't include any of the EMBASSY programs.

Installing EMBASSY

As well as the base libraries and standard EMBOSS distribution, various extra packages (EMBASSY) are distributed with EMBOSS.

To install an EMBASSY package, go to the relevant directory. For example to install PHYLIP (which was unpacked into /packages/EMBOSS-1.0.0/embassy/PHYLIP-3.573c earlier) go to the relevant directory.

cd  /packages/EMBOSS-1.0.0/embassy/PHYLIP-3.573c

./configure --prefix=/site/prog/emboss

make

make install

NB. You MUST use the same arguments for configure that you used for the installation of the main EMBOSS package.

Repeat as neccessary for the other EMBASSY packages.

You should now find that running wossname as before lists the EMBASSY programs.

Configuration

EMBOSS can be configured to match your requirements. EMBOSS looks for a configuration file in several places. Firstly it looks in /site/prog/emboss/share/EMBOSS for a file 'emboss.default'. It then looks in your home directory for the file '.embossrc' and finally in the current directory for '.embossrc'. In each case definitions will override those previously defined.

Several aspects of EMBOSS can be defined. These are:

EMBOSS environment variables

EMBOSS databases

Default behaviour of EMBOSS programs

As Databases are by far the most complex of these they will be covered in a seperate section.

EMBOSS environment variables

These are set with an 'env' or a 'set' declaration. 'env' and 'set' are interchangeable.

The most important environment variable is the location of the acd files that describe each program. 

set emboss_acdroot /site/prog/emboss/share/EMBOSS/acd

Environment variables are useful for easing the maintenance of your emboss.default. For example you may want to specify the location of your databases as an environment variable. Then if you move the databases you only have to update one line in the configuration file.

set emboss_database_dir /data/databases/flatfiles

This would then be referred to as

$emboss_database_dir/embl 

for the directory  /data/databases/flatfiles/embl

Databases

	Database access

Emboss offers three methods for accessing databases:

       All: EMBOSS returns all the sequences in the database in no particular order

       Query: EMBOSS retrieves a set of sequences corresponding to a wildcard query.

       Single: EMBOSS retrieves a single sequence indexed by ID or accession number.

Each database definition can configure one or many of these methods for database access.

Typically EMBOSS uses the 'emblcd' system of database indexing. This comes in three variants depending on the original format of your database. The emblcd method assumes that you have both ID and accession number in each record. If you do not have both ID and accession number you will have to use an alternative method. Please see the 'other databases' section below.

General Database configuration.

Each database is configured using a DB declaration.

The generalised form is 

DB databasename [

Configuration options

]

The configuration options are tag/value pairs and must contain at least a description of the access method (using method: or one or more of methodsingle:, methodquery: and methodall:) and a description of the format the sequences will be returned in ( using format:). 

In addition to these tags there will be other tags that are needed for particular methods and other tags that are optional.

method:	  &  scope &   Description	&

DIRECT	 &   a	&     Returns all the database entries, one after the other. It assumes no indexing.	& 

DB mydb [ 

#required parameters

   method: direct

   format: fasta

   dir: $emboss_db_dir/mydb

   file: *.dat   

#optional parameters

   type: N

   release: 63.0

   comment: "My own database with no indices"

   exclude: "est*.dat"

]

SRS	 &   a q s & Returns entries from a local installation of SRS using the -e switch to getz to return entries in the original format.

DB mydb [

#required parameters

   method: srs

   format: embl

   app: getz

#optional parameters

   dbalias: embl

   type: N

   comment: 'My srs indexed database'

   release: '63.0'

]

SRSFASTA & a q s & As SRS but returns the sequences in FASTA format.

URL & s & Uses a defined web server to retreive a specific entry. EMBOSS may fail if the HTML causes complications. & 

DB mydb [

# required parameters

    method: url

    format: genbank

    url: "http://www.infobiogen.fr/srs5bin/cgi-bin/wgetz?-e+[genbank-id:%s]"

#optional parameters

    type: N

    comment: "Genbank by ID from InfoBiogen"

]

The %s in the URL string indicates where EMBOSS will insert the identifier portion of the USA.

EMBLCD & a q s & Uses EMBLCD indices created with DBIFLAT to access EMBL format databases in the original format. & directory: files: 

DB mydb [

   method: emblcd

   format: embl

   dir: $emboss_db_dir/embl

   file: *.dat

#optional parameters

   type: N

   release: 63.0

   comment: "my comment"

   exclude: est*.dat

   indexdir: $emboss_db_dir/indices

]

GCG & a q s & Uses EMBLCD indices created with DBIGCG to access databases in GCG format. & As for EMBLCD but format: gcg and method: gcg

BLAST  & a q s & Uses EMBLCD indices created with DBIBLAST to access databases in BLAST format. & As for EMBLCD but format: blast and method: blast

EXTERNAL & a q s & Uses an external application to retrieve sequences, returning them on STDOUT & The ID is passed as an argument to the application, either replacing %s in the command string (if present) or as an additional arguement (if there is no %s)

DB mydb [

#required parameters

    method: app

    format: fasta

    app: "getfromdb thisfastadb"

#optional parameters

    type: P

    comment: "my own protein database with a custom retrieval program"

]

APP & a q s & same as EXTERNAL.

NBRF & a q s & 

for a method: declaration, EMBOSS will use that method for those access methods supported by the method.

If you wish to specify which accessmethod should be handled by which method then the methodsingle: methodquery: and methodall: declarations should be used instead of method:

DB mydb [

methodsingle: app

format: fasta

app: "customapp myproteindb"

methodall: direct

dir: $emboss_db_dir/myproteindb

file: myproteindb.dat

type: P

comment: "single and all access for myproteindb"

]

	Indexing and configuring flatfile databases

Flatfile databases are those released by EMBL, Swissprot and so on. The EMBOSS program DBIFLAT is used to generate emblcd indices that can be used for all types of database access. DBIFLAT can process databases in EMBL, SWISSPROT and GENBANK format. Pseudo EMBL format databases which do not have unique ID and AC entries will cause DBIFLAT to do mysterious things and should be avoided.

DBIFLAT requires the databases to be uncompressed. This example will not probe the deeper secrets of DBIFLAT (for which the reader is referred to the documentation, or failing that the source code) but will show a typical installation for a common database.

We assume EMBOSS has been installed and works. This can be tested with the command wossname -auto which should list all the programs available.

In this example we will index and configure the EMBL database for use with EMBOSS.

First download and unpack the EMBL database. This will require a considerable amount of disk space.

cd to the directory in which you have unpacked EMBL. This should look something like this when you run ls:

est_fun.dat

est_hum1.dat

est_hum10.dat

.

.

.

syn.dat

unc.dat

vrl.dat

vrt.dat

Run DBIFLAT to create the emblcd indices.

% dbiflat

Index a flat file database

      EMBL : EMBL

     SWISS : Swiss-Prot, SpTrEMBL, TrEMBLnew

        GB : Genbank, DDBJ

     FASTA : FASTA format

Entry format [SWISS]: EMBL   

Database name: embl

Database directory [.]: 

Wildcard database filename [*.dat]: 

Release number [0.0]: 63.0

Index date [00/00/00]: 31/07/00

DBIFLAT should happily chug away for some considerable time (up to a few hours depending on the speed of your machine) and will generate (eventually) the following index files:

acnum.hit

acnum.trg

division.lkp

Now we create an entry in the EMBOSS configuration files to acces sthe database. It is probably a good idea to try new database definitions in your local configuration file first.

Put the following entry in your .embossrc

set emboss_db_dir /path_to_databases

DB embl [

   type: N

   method: emblcd

   format: embl

   dir: $emboss_db_dir/embl

   file: "*.dat"

   release: "63.0"

   comment: "EMBL release 63.0"

]

Save .embossrc and try showdb. You should see a line that looks like:

embl          N    OK  OK  OK  EMBL release 63.0

Fine tuning the installation:

It is probably a good idea to set up subsections of the database so that end users can search just the regions they wish to search.

Files can be included with the declaration files: or excluded with the declaration exclude:

In order to just take the EST files try the following:

DB emblest [

   type: N

   method: emblcd

   format: embl

   dir: $emboss_db_dir/embl

   file: "est*.dat"

   release: "63.0"

   comment: "EMBL release 63.0"

]

Files can also be given as a space seperated list. For example to set up a database of all mamallian sequences (except genomes) try the following:

DB emblallmam [

   type: N

   method: emblcd

   format: embl

   dir: $emboss_db_dir/embl

   file: "rod*.dat hum*.dat mam*.dat"

   release: "63.0"

   comment: "EMBL release 63.0"

]

It can be quite tedious to set up a long list of sequences to search. In many cases you can use the exclude function to make things easier.

DB emblnoest [

   type: N

   method: emblcd

   format: embl

   dir: $emboss_db_dir/embl

   file: "*.dat"

   exclude: "est*.dat"

   release: "63.0"

   comment: "EMBL release 63.0"

]

This configures the emblnoest database to contain all of EMBL except the EST's.

	Indexing and configuring GCG format databases

EMBOSS can access GCG formatted databases, thus avoiding having multiple copies of the same databases in different formats. EMBOSS creates EMBLCD like indices for the GCG format databases using the program DBIGCG. This runs in much the same way as DBIFLAT. You will need the GCG format .seq and .header files in order to create an indexed database.

cd to the GCG database directory containing your data and run DBIGCG

Index a GCG formatted database

      EMBL : EMBL

     SWISS : Swiss-Prot, SpTrEMBL, TrEMBLnew

        GB : Genbank, DDBJ

       PIR : NBRF

Entry format [EMBL]: 

Database name: embl

Database directory [.]: 

Wildcard database filename [*.seq]: 

Release number [0.0]: 63.0

Index date [00/00/00]: 31/07/00

The program will chug along for a while and will then generate the emblcd index files for the GCG format database.

The following entry should be put in your .embossrc

DB gcgembl [

   type: N

   method: gcg

   format: embl

   dir: $emboss_db_dir/embl

   file: "*.dat"

   release: "63.0"

   comment: "EMBL release 63.0"

]

SHOWDB should show your newly configured database.

You can configure substes of th edatabases in the same way as for the original format databases.

	Indexing and configuring BLAST databases

Here be dragons

	Configuring EMBOSS to use SRS for database lookup.

Here be lions

	Indexing and configuring other databases

Many institutions may have local databases set up in their own Laboratory Information Management System. EMBOSS provides a simple mechanism for interfacing with such systems.

As long as a program is available that can be called noninteractively and returns the specified sequence on standard output, EMBOSS can interface with it.

Use method: app or external (the two are equivalent) and app: "program command".

The ID given in the USA will be appended to the command used to run the program. It is probably best to specify the methods available using the method subsets, methodall:, methodquery: and methodsingle: rather than using the generic method: tag.

Other data

EMBOSS can be integrated with some common biological databases. These are described in this section.

      REBASE

      Rebase is the restriction enzyme database maintained by New England Biolabs. It is needed for programs such as remap and restrict.

The latest version of Rebase can be obtained by anonymous FTP from ftp://ftp.ebi.ac.uk/pub/databases/rebase. EMBOSS needs the 'withrefm' file. The data is extracted for EMBOSS with the program rebaseextract.

If you installed EMBOSS with the --prefix option you may need to create the REBASE directory under the emboss data directory (/site/prog/emboss/data in this example) This directory only needs creating once.

% mkdir /site/prog/emboss/data/REBASE

% rebaseextract

Extract data from REBASE

Full pathname of WITHREFM: /data/rebase/withrefm.008

Rebase is now installed and ready to use.

      TRANSFAC

Transfac is the transcription factor binding site database. It is available by anonymous ftp from ftp://ftp.ebi.ac.uk/pub/databases/transfac/transfac32.tar.Z

Unpacking the distribution reveals a file called site.dat. This is the one EMBOSS needs.

Run TFEXTRACT to extract the data from TRANSFAC.

% tfextract

Extract data from TRANSFAC

Full pathname of transfac SITE.DAT: /databases/transfac/site.dat

tfscan can now access the TF database.

      PROSITE

Prosite is a database of regular expressions that match potentially diagnostic regions for structural/functional classification of proteins. EMBOSS needs this database for the patmatmotifs program.

PROSITE can be obtained via anonymous FTP from ftp://ftp.ebi.ac.uk/pub/databases/prosite. 

You may need to create a PROSITE subdirectory under data in the EMBOSS installation directory. 

Then run prosextract to build the EMBOSS Prosite database.

Builds the PROSITE motif database for patmatmotifs to search

Enter name of prosite directory: /data/prosite

PROSITE is now integrated into your EMBOSS installation.

      PRINTS

Prints is a database of diagnostic patterns of blocks of sequence homology in protein families. The PRINTS database can be searched using the EMBOSS program pscan.

PRINTS can be obtained via anonymous ftp from ftp://ftp.ebi.ac.uk/pub/databases/prints. The database is made available as compressed files which should be uncompressed using gzip before integrating them into EMBOSS

PRINTS is integrated with EMBOSS using the command printsextract

% printsextract

Extract data from PRINTS

Input file: /data/prints/prints27_0.dat

The PRINTS database is now integrated with EMBOSS.

Miscellaneous data files

Other data files should be kept in the data directory under the main EMBOSS installation. Individual users personal data files can be kept in the current working directory, a subdirectory .embossdata of the current directory, their home directory or a subdirectory .embossdata of their home directory. EMBOSS will search these locations in this order and will stop as soon as it finds a matching file. If the personal directories do not contain the desired file, EMBOSS will search the system wide data directory, /site/prog/emboss/data in this example.

Apparently inexplicable errors when running EMBOSS programs may be caused by the system not using the data files one expects. The search path can be displayed in search order using the command embossdata.

Logging

Many system administrators may wish to make use of the logging facilities of EMBOSS. Setting the variable emboss_logfile in emboss.default or .embossrc allows the system to keep a log of which programs are used when and by whom.

set emboss_logfile /site/log/emboss.log

The log file structure is very simple. Three tab seperated fields are stored, program name, user name, and the date and time.

prettyplot      joeuser        Wed Aug 02 14:29:13 2000

The file set in emboss_logfile should be world writable. 

These settings can be overridden in a users .embossrc files by redefining emboss_logfile. eg. to prevent my system usage being logged I can put the following entry in my .embossrc file. 

set emboss_logfile /dev/null

This behaviour may change in the future to prevent users redefining some system settings.