[EMBOSS] EMBOSS 3.0.0 released

Alan Bleasby ableasby at hgmp.mrc.ac.uk
Thu Jul 14 23:43:30 UTC 2005


EMBOSS 3.0.0 is now available for download from:

   ftp://emboss.open-bio.org/pub/EMBOSS/

   and, until the 27th July, from:
   ftp://ftp.rfcgr.mrc.ac.uk/pub/EMBOSS/

The following text details some of the changes from the previous
release.

Alan




EMBOSS main package:

New database indexing programs dbxflat, dbxfasta and dbxgcg. A
dbxblast program will be added if we can extract data from the new
BLAST formatdb output. These programs allow indexing of files
larger than 2Gb.
N.B.: Indexes will be created faster if they are written through a
      different disc controller than that used to read the database
      being indexed. If that is not possible then reading from and
      writing to different hard drives on the same controller is
      recommended. Note that each index can be created independently
      of the others e.g. you can create keyword and description
      indexes after you've created the ID and ACC indexes.

To support these programs, the emboss.default and .embossrc files can
include "resource" definitions. See the documentation of these
programs for more information. "resource" definitions are intended to
define anything other than environment variables and databases.

In the emboss.default and .embossrc files the same name can be used
for variables, databases, and resources (we now store them in separate
tables). In previous versions a single table was used and name clashes
could occur. This becomes an issue with the increasing use of resource
definitions.

Sequence sets in ACD have a new attribute "aligned" that reports
whether the sequences are aligned (reading a multiple alignment in for
visualisation) or not (reading a set of sequences into memory for
further processing - perhaps for alignment).

Sequence formats have been reviewed. "experiment" format is that used
by the Staden package. "staden" and "gcg" formats now parse out
comments from anywhere in the sequence. "nexus" and "nexusnon" formats
now correctly report protein sequence datatypes. "nbrf" or "pir"
format data can now be read from an SRSWWW server (for technical
reasons, SRS servers are unable to exactly reproduce NBRF/PIR
format). "clustal" output no longer writes in blocks of 10.  "Phylip3"
output is now renamed "phylipnon" for compatibility with other
non-interleaved output format names. The "phylip3" name remains valid
for back-compatibility. The header record for phylipnon format has
been changed to that accepted by phylip 3.6 (no YF on the header line,
number of sequences specified). Sequence format information on the web
has been updated to reflect these changes.

Codon usage table formats can be in these formats (-format qualifier):
  "emboss",    "EMBOSS codon usage file",
        "All numbers read, #comments for extras"
  "cut",       "EMBOSS codon usage file",
        "Same as EMBOSS, output default format is 'cut'"
  "gcg",       "GCG codon usage file",
        "All numbers read, #comments for extras"
  "cutg",      "CUTG codon usage file",
        "All numbers (cutgaa) read or fraction calculated, extras added"
  "cutgaa",    "CUTG codon usage file with aminoacids",
        "Cutg with all numbers"
  "spsum",     CUTG species summary file",
        "Number only, species and CDSs in header"
  "cherry",    "Mike Cherry codonusage database file",
        "GCG format with species and CDSs in header"
  "transterm", "TransTerm database file",
        "GCG format with no extras"
  "codehop",   "FHCRC codehop program codon usage file",
        "Freq only, extras at end"
  "staden",    "Staden package codon usage file with percentages",
        "Freq or number only, no extras"
  "numstaden", "Staden package codon usage file with numbers",
       "Number only, no extras. Can be read as 'staden'"

Any of these formats should be readable by default. Some files are
"readable" in more than one format (staden and numstaden for example
can both be read as "staden"). The extra names are used so we can
reuse them as output format names.

For output of codon usage tables, the same formats are available
(-oformat qualifier).

A new application codcopy (not codret because coderet is already an
EMBOSS program name) will convert from one format to another in the
same way as seqret converts sequence formats.

Coderet reports the number of CDS, mRNA and translation sequences.

Correction to sequence numbering for reversed nucleotide sequences in
alignments. Correction to sequence alignment functions returning
slightly suboptimal alignments.

The entrails program reports codon usage formats. Description of
report format entrails output improved. Entrails is built by "make
check" and is provided so that developers of wrappers can obtain all
EMBOSS internal details needed, for example all ACD datatypes and
input/output format names and descriptions.

Sequence types are explicitly set in cons, sixpack and backtranseq as
some output formats failed to recognise them as protein.

EMBASSY packages:

MYEMBOSS is a new EMBASSY package for developing your own code.

Installation requires recent versions of GNU packages autoconf,
automake and libtool.

To install, you must first build the configure and make files with
these commands:

aclocal -I m4

autoconf

automake -a

When you add your own programs, do so by adding source files in
myemboss/source and ACD files in myemboss/emboss_acd and add these
filenames to the Makefile.am files in each directory. There are
"myseq" and "mytest" examples provided to guide you.

There is no need to modify configure or Makefile files - these will be
automatically updated.

To allow MYEMBOSS to be installed by one user, and linked to an EMBOSS
installation maintained for the site by someone else, new variables
are added to locate the ACD files for any EMBASSY package. If myemboss
is not installed in the same place as EMBOSS, define
EMBOSS_MYEMBOSSROOT as the location of the myemboss installed ACD
files or the myemboss/emboss_acd source directory. This requires that
EMBASSY programs call the embInitP function with the name of the
package ("myemboss"). For ACD utilities such as acdvalid or acdc to
work, as these use the EMBOSS embInit call, another variable
EMBOSS_ACDUTILROOT must be defined, pointing to the same directory.

PHYLIP is a beta release port of PHYLIP 3.6b. We welcome comments on
the EMBOSS interface to the programs. Program names are prefixed by
'f' to avoid clashes with the old PHYLIP EMBASSY package. We still
need to work on adding new tree input and output formats, and updating
the code to PHYLIP 3.63 (December 2004). We are also considering
splitting more of the programs to simplify the ACD interface. In this
release seqboot and treedist are already split. seqboot is split by
input type into seqboot, restboot, discboot and freqboot. Treedist is
split by the number of input files into treedist and
treedistpair. Acdvalid objects to the dependencies in other programs,
for example the method used by fdnadist.

The DOMAINATRIX package of earlier releases has been extended and
replaced by 5 EMBASSY packages described below (32 applications in
total).  These tools were developed as part of a research project and
are distinct from other EMBOSS apps in being intended mostly for
computational biologists rather than biologist end-users.

STRUCTURE

The STRUCTURE package is used for parsing the PDB database and
generating secondary databases of coordinate and derived data.  The
tools have the following scope: (i) For parsing PDB files and writing
clean coordinate files (CCF files) that "clean-up" many PDB
inconsistencies.  For example, residue numbers give the correct index
into the biological sequence.  (ii) To generate CCF files for whole
PDB files or individual domains from the SCOP and CATH databases.
(iii) To augment CCF files with residue solvent accessibility and
secondary structure data.  (iv) To generate contact files (CON files)
of intra-chain and inter-chain residue-residue contact data. (v) To
generate CON files of residue-ligand contact data. (vi) Miscellaneous
file handling, e.g. dictionary of heterogen groups.

DOMAINATRIX

The DOMAINATRIX package is used for handling the SCOP and CATH
databases of protein domain classification, the parsable files of
which can be inconvenient, e.g. for comparative studies, extending and
processing.  The tools have the following scope: (i) For parsing raw
SCOP and CATH parsable files and writing domain classification files
(DCF files) with a single, simple and extensible format. (ii) To add
sequence records to a DCF file. (iii) To remove low resolution
domains.  (iv) To flexibly calculate and remove redundancy.  (v)
Primitive tools for secondary structure element mapping to domains in
a DCF file.

DOMALIGN

The DOMALIGN package is used for generating alignments for families of
domains, especially across large datasets, e.g. the whole of SCOP.
The tools have the following scope: (i) For identifying representative
structures for different nodes in the SCOP and CATH hierarchies.  (ii)
For generating annotated, structure-based sequence alignments for
these nodes.  (iii) For extending these domain alignment files (DAF
files) with sequences of unknown structure. (iv) All-versus-all global
sequence alignment.

DOMSEARCH 

The DOMSEARCH package is used for deriving extended sequence families,
especially from large structural datasets such as the whole of SCOP.
The tools have the following scope: (i) To generate domain hits files
(DHF files) of sequence relatives to an alignment or other
sequences. (ii) To remove fragmentary sequences from a DHF file.
(iii) To flexibly calculate and remove redundancy.  (iv) To remove
hits hits of ambiguous classification and collate sequences into
families.

SIGNATURE

The SIGNATURE package is used for generating, scanning and evaluating
sparse signatures and other predictive elements for protein sequence
characterisation.  The tools have the following scope: (i) To generate
sparse signatures for protein families from alignments and residue
contact data.  (ii) Generate other types of discriminator (e.g. HMMs)
from alignments. (iii) Generate ligand-binding signatures from
residue-ligand contacts.  (iv) Generate domain hits files (DHF files)
and ligand hits files (LHF files) of hits (sequences) from signature
scans. (v) Interpretation and display of signature performance by
using ROC analysis.


Where data, files etc are mentioned above or in the application
documentation, data structures and functions for manipulating such are
usually provided in the AJAX and NUCLEUS C programming libraries.  For
example, there are objects for handling protein atoms, residues,
chains, for SCOP and CATH domains and so on.



More information about the EMBOSS mailing list