[EMBOSS] EMBOSS 6.4.0 released

Fri Jul 15 08:54:26 UTC 2011

EMBOSS Release 6.4.0

This release is now available on our OBF ftp server.

UNIX version:
   ftp://emboss.open-bio.org/pub/EMBOSS/

mEMBOSS (MS Windows version):
   ftp://emboss.open-bio.org/pub/EMBOSS/windows/mEMBOSS-6.4.0.0-setup.exe

It includes major extensions to the type and number of data resources
available to EMBOSS users.

In addition, three books are published by Cambridge University Press:

EMBOSS User's Guide: Practical Bioinformatics
http://www.cambridge.org/gb/knowledge/isbn/item5979294/?site_locale=en_GB

EMBOSS Developer's Guide: Bioinformatics Programming
http://www.cambridge.org/gb/knowledge/isbn/item5979293/?site_locale=en_GB

EMBOSS Administrator's Guide: Bioinformatics Software Management
http://www.cambridge.org/gb/knowledge/isbn/item5979238/?site_locale=en_GB

They are comprehensive and definitive guides to administering,
developing and using EMBOSS. We hope they will prove useful to the
EMBOSS community and to anyone providing training courses covering
EMBOSS.

In addition to these publications we have a new website.

http://emboss.open-bio.org

Updates for the new features in 6.4.0 will be made available soon on
the new EMBOSS website, with tutorials to be developed on the EBI
e-Learning Portal.

Contents:

1.0 New in 6.4.0
1.1 Server definitions
1.2 Access methods
1.3 emboss.standard file
1.4 new data types
1.5 new query language
1.6 Hash tables and lists
1.7 Cross-references
1.8 URL generation
1.9 Database index compression
1.10 Database indexing applications
1.11 Generating server cache files
1.12 Server and database attributes
1.13 HTTP redirection
1.14 EMBOSS version number
1.15 ACD list 'select all'
2.0 EDAM Ontology
2.1 EDAM in ACD files
2.2 EDAM applications
3.0 DRCAT Data Resource Catalogue
4.0 NCBI Taxonomy
5.0 Maintenance
6.0 Installation Notes
6.1 UNIX
6.1.1 MySQL
6.1.2 PostgreSQL
6.1.3 axis2c
6.1.4 Other optional library software
6.1.5 eprimer3 and eprimer32
6.2 mEMBOSS
7.0 New EMBASSY applications
8.0 Future

1.0 New in 6.4.0

1.1 Server definitions

Servers can be defined, in a similar style to a database definition,
but covering all databases available from a single server. The server
definition names a cache file describing each database, its format
and its query fields. Cache files for a core set of public servers are
included in the release.

1.2 Access methods

New access methods are provided, including Ensembl, BioMart, DAS, SOAP
web services (EBI wsdbfetch and ebeye), REST web services (EBI
dbfetch), and GMOD/CHADO. Ensembl access uses code contributed by
Michael Schuster in the Ensembl team at EBI. This code is updated
after each Ensembl API release. Some of these access methods were
available but only partly implemented in the previous release. They
now support standard server and database definitions and are open for
further development.

Data access methods have been restructured to use "text" access for
any method which seeks a position in a file and then opens it for
reading. This includes reading from a URL and returning a pointer to
the start of the output. A few datatype-specific access methods remain,
for example reading sequence data from a PIR/NBRF/GCG format database,
or from the NCBI taxonomy files, or access to database systems via SQL
or DAS.

1.3 emboss.standard file

Previous releases depended on a user defining databases in their
emboss.defaults file. Release 6.4.0 provides a new emboss.standard
file defining the core servers and databases, and standard resource
settings for database indexing. The local emboss.default file is only
needed for local database definitions and settings.

The configuration files emboss.standard, emboss.default and
~/.embossrc resolve variable references (e.g. in directory names)
during parsing. Extensions to the syntax of these files include ALIAS
to give secondary names to a database. IF, IFDEF, ELSE and ENDIF
directives allow conditional inclusion of sections of the file
dependent on variable settings. Special variables EMBOSS_AXIS2,
EMBOSS_MYSQL, EMBOSS_POSTGRESQL and EMBOSS_SQL are automatically
created for this purpose.

New variable EMBOSS_STANDARD is automatically defined to be the
share/EMBOSS install directory (or the emboss source code directory if
the package is not installed). This is by default where the
emboss.standard files and server cache files are expected to be
found. The value is reported by "embossversion -full"

1.4 new data types

New data types are available as inputs and outputs or
applications. Each has a simple definition including qualifiers
-iformat for input format and -oformat for output format. The maxreads
attribute defines whether the application expects to read a single
entry (maxreads: 1) or loop over multiple entries (the default). This
is simpler than the sequence and seqall definitions for sequence which
are widely used and will remain unchanged.

* text and outtext: the text of an entry for which EMBOSS has (to
   date) no specialised parser

* obo and oboout: terms in an OBO ontology. Six ontologies are
   included in the release as source and index files (EDAM, GO, SO, RO,
   PW, ECO). We plan to add more and welcome suggestions for inclusion.

* resource and resourceout: entries in the Data Resource Catalogue

* taxon and outtaxon: nodes in the NCBI taxonomy which is indexed and
   included in the release

* url and outurl: a database name from the Data Resource Catalogue, and
   an identifier, converted into a URL which can be pasted into a browser
   to cover cases where the URL does not return simple text or HTML data.

* for future extension, assembly and variation datatypes are defined
   for development and use in a later release.

1.5 New query language

All data types use a common query language. The existing "USA"
(uniform sequence address) syntax is still valid for sequence data,
but is also now used for features, obo terms, data resources, taxons
and plain text data.

In response to comments from our Scientific Advisory Board, we have
extended the query language to cover multiple identifiers, multiple
fields, and operators to combine elements of the query.

* id lists: dbname:{ida,idb,idc} searches for 3 identifiers (id,
   accession, etc.)  in a database

* or operator: dbname-{id:h* | des:hemoglobin} searches for all
   entries with identifiers starting with 'h' plus any others that
   include the word 'hemoglobin' in their descriptions.

* not operator: dbname-{id:h* ! des:hemoglobin} searches for all
   entries with identifiers starting with 'h' that do not include the
   word 'hemoglobin' in their descriptions.

* and operator: dbname-{id:h* & des:hemoglobin} searches for all
   entries with identifiers starting with 'h' that also include the
   word 'hemoglobin' in their descriptions.

* eor operator: dbname-{id:h* ^ des:hemoglobin} searches for all
   entries with identifiers starting with 'h' that do not include the
   word 'hemoglobin' in their descriptions, and all those starting with
   another character that do include the word 'hemoglobin' in their
   description. This is the opposite of the and (&) operator.

Query operators are not supported by all access methods. Where an
operator is invalid an error message gives the list of valid
operators. For example, the query syntax for SRS (srs, srswww access)
does not include the exclusive-or (^) operator but supports the
others as these are standard elements in SRS queries.

The query language only allows a single database name in the
query. This allows EMBOSS to combine query results for a single query
expression. To query multiple databases a list file input with one
database query on each line can be used.

Indexed strings containing non-alphabetic characters including white
space are simplified by converting a run of such characters to a
single underscore. The same transformation is applied to a query
string for the dbx (emboss) access method. This is especially useful
for brackets and other characters in data resource names in DRCAT.

We hope that the extended query language and the index file
compression will increase the use of locally indexed data in EMBOSS
installations, and welcome feedback on further developments of the
query language and indexing.

1.6 Hash table and lists

The new query language is supported by extensions to tables and lists
in the libraries. Tables can now be automatically resized. Merge
operations on two tables combine their contents using the same
operations (or, and, not, eor) as the query language. By resizing the
tables first this operation can be made highly efficient. Destructors
can be defined for list data and for table keys and data to
automatically clean up after use. Tables with string keys can use C
char* or string object queries in all cases.

Lists and tables can now be reference counted, avoiding unnecessary
copying especially in the Ensembl API code.

1.7 Cross-references

Cross-references from UniProt/SwissProt and EMBL/GenBank/DDBJ are
collected by extended parsers. New application seqxref reports the
cross-references. New application seqxrefget creates a script to
retrieve cross-referenced data as the original entries, using entret
for sequence data, feattext for feature data, ontotext for ontology
terms, textget for text and urlget for data where "HTML" is the only
available format.

1.8 URL generation

New application urlget returns a query URL from DRCAT with one or mode
identifiers. Where data is from a UniProt/SwissProt or
EMBL/GenBank/DDBJ entry the DRCAT entry definition of the original
cross-reference is used to select from several possible identifier
terms in EDAM in order to choose the correct query.

1.9 Database index compression

Indexes created by dbxflat or dbxfasta are now, by default, compressed
automatically. These files, especially for secondary text indexes such
as description, taxonomy or keyword, could be very sparse. Up to 95%
space savings were achieved in some cases. The indexes are still
updatable by code which uncompresses, updates, and recompresses
on-the-fly using a copy of the index.

1.10 Database indexing applications

New indexing applications dbxedam (EDAM), dbxresource (DRCAT), dbxtax
(NCBI taxonomy) and dbxobo (any OBO ontology) are added for the new
data resources provided as standard. users can install new releases of
the source data and run these applications to update the index files.

Application dbxflat can now index fastq format. This was included in
6.3.1 as a special addition for one user to test and is now fully
supported.

New applications dbxreport and dbxstat report on the overall and
detailed content of dbx database indexes.

In database indexing applications, the default "resource" name is one
included in the emboss.standard file. Users can continue to define
their own resource files. Indexing "resource" definitions can now
specify the maximum length of any field, and the page size and cache
size for any field, using attributes with the field name as a prefix.

1.11 Generating server cache files

New applications for major access methods query a server (for example,
the DAS registry or Ensembl) to update the server cache file with a
current set of database definitions. When run by the system
administrator these can update the site-wide cache file, but they can
also be run by an individual user to create a user-specific set of
databases. The cache files are time stamped. EMBOSS uses the most
recent system or user file.

1.12 Server and database attributes

New applications showserver and servertell describe all servers or the
attributes of a single named server. We expect to extend these
applications once we have feedback on the most useful information they
should report. New application dbtell similarly reports on the
attributes of a single named database.

Database (and server) definitions can use an attribute more than once
if it is defined as "multiple". These include a new "field:" attribute
which gives the name and description of a query field. A list of
"field:" attributes supersedes the old "fields:" attribute which listed
all query field names but allowed no further annotation.

Database field names are extended from the original fixed set of "SRS
sequence" fields to any name. "id" and "acc" are assumed to be the
names of identifier and accession fields. The "hasaccession" attribute
is set automatically for databases where no "acc" field is found,
avoiding some error messages where the attribute has been omitted.

1.13 HTTP redirection

Data retrieval using HTTP now checks the returned header for redirects
and automatically replaces the results with the output from the
redirected URL. Where redirected URLs were found in standard database
definitions (e.g. the EBI's dbfetch service) these have been replaced
by the current URL. We have also seen redirects from case-sensitive
servers which redirect a lower case accession number to one in upper
case in the same URL.

1.14 EMBOSS version number

The EMBOSS version number now has 4 digits (6.4.0.0). The fourth digit
is only there so that the Windows port (mEMBOSS) shows the same
version number for QA testing. In mEMBOSS the final digit is the build
number. QA tests for mEMBOSS now use the same test definition and
qatest script as on Linux. mEMBOSS file handling and reporting has
been adapted to support POSIX and Windows style paths.

1.15 ACD list 'select all'

In ACD files, a list or selection definition can default to "*" for
"select all" if the "minimum" attribute allows all terms to be
selected.

2.0 EDAM Ontology

EDAM is a new ontology from the EMBRACE project now further developed
by Jon Ison in the EMBOSS team. EDAM describes terms for topics (for
applications and data), operations (algorithms), formats, identifiers
and data (semantic descriptions of data content). EDAM terms are used
throughout this release: to annotate all ACD files at the application,
input, parameter and output levels; to annotate data resources and
their web queries in the Data Resource Catalogue; and to annotate
database and server definitions.

2.1 EDAM in ACD files

ACD files are annotated extensively with EDAM terms using the term id
and the human-readable name. The EMBOSS application groups have been
extended to match the EDAM topic annotations, with some applications
moving to different or new groups. EDAM has been used to validate
these groups by comparing the topics hierarchy with the group
designations.

2.2 EDAM applications

EDAM can be queried within any specific namespace by new applications
edamname and edamdef.

EDAM and other ontologies are supported by new applications (ontoget,
ontotext, ontodown, ontoup, ontgetsibs, ontogetcommon, ontogetroot,
ontogetobsolete, ontoisobsolete, ontocount)

New applications search EDAM term names and definitions, retrieve all
matching terms and their descendants, and compare to: applications
(wosstopic, wossoperation, wossinput, wossoutput, wossdata); data
resources (drfindresource, drfindid, drfindformat, drfinddata); and
related EDAM terms (edamhasinput, edamhasoutput, edamisid,
edamisformat, edamissource).

3.0 DRCAT Data Resource Catalogue

DRCAT, the Data Resource Catalogue, is included in this release. DRCAT
started as a description of databases found as cross-references in
UniProt/SwissProt, extended by adding databases found as
cross-references in EMBL/GenBank/DDBJ, plus others from Nucleic Acids
Research, ELIXIR, and other sources. Any database in DRCAT can be used
by name from an EMBOSS application, returning sequence, feature, or
text if a suitable data format is defined for any query, or creating a
URL which can be pasted into a browser where the results are, for
example, a graphical display using javascript which EMBOSS cannot
interpret. We aim to further extend and improve DRCAT in future
releases.

4.0 NCBI Taxonomy

Taxonomy data from the NCBI taxonomy is included as standard in the
release. New applications retrieve single nodes and their ancestors
and descendants (taxget, taxgetup, taxgetdown, taxgetspecies,
taxgetrank).

5.0 Maintenance

Application digest has been renamed pepdigest to avoid a clash with
another utility. The name is also in keeping with the EMBOSS naming of
other protein analysis applications.

Sequence and features formats have been reviewed and updated,
especially GFF3, GenPept, SAM, BAM and treecon. GFF3 output now more closely
follows the official standard, including the escaping of special
characters in the tag/value final column. GFF3 ID and Parent tags are
supported.

Features with exons are now stored as a list of exon subfeatures.
This change allows easier sorting of features by location, keeping
groups of features together, and has simplified the generation of
several feature output formats.

Graphical output for more than one input sequence have been corrected
and enhanced.

The lindna application has been adjusted to correctly relocate
overlapping text and to generate a clean sequence ruler for any range
of positions. New report formats allow reported hits (-rformat draw)
and restriction sites (-rformat restrict) to be plotted by lindna. We
expect to work further on the views that these outputs generate.

The einverted application had a bug (also in the original version)
when an inverted repeat maximum score was close to the edge of the
search window. This was seen only at low threshold scores. Searches
with low threshold scores can be expected to yield slightly different
choices of hits.

In ACD files, the "gui" and "batch" application attributes are assumed
to be "true" if missing. Previous releases defined them as "false"
internally, but fortunately no parsers seem to have used the internal
default value.

Database indexes created by the dbx programs now include a count of
unique and total keys. The text index files also report the type as
"Identifier" or "Secondary" and whether the index is compressed.

EMBOSS configuration now uses autoheader and has less dependency on
the version of libtool.

6.0 Installation notes

6.1 UNIX

The size of the EMBOSS package has shot up by approximately 60MB
compared with the last major release. This is largely due to to
pre-supplied data and index files for ontology/taxonomy/etc.  A
typical installation size (shared images) is approximately 360MB.

Though not a requirement of EMBOSS there are some associated
packages which may be installed prior to configuration that
will allow you to use some optional access methods.

6.1.1 MySQL

This is used, for example, by the Ensembl access code. It will be
automatically configured if the (MySQL-supplied) 'mysql_config'
application is found in the PATH and if the associated development
files (compiler headers etc) are also installed. As an example, for
Linux systems, both things will be done by installing the mysql-devel
(RPM distributions) or mysql-dev (Debian-based distributions). If your
MySQL installation is in some arbitrary location then you can specify
it using the --with-mysql= compilation switch.

6.1.2 PostgreSQL

This is used by some servers (e.g. flybase/genedb). Similar
considerations apply to those described for MySQL above.
Auto-detection is based on the presence in the PATH of 'pg_config',
dev[el] files must be installed, the --with-postgresql configuration
switch can be used for arbitrary locations.

6.1.3 axis2c

EMBOSS optionally uses the 1.6.0 release of Axis2C for
retrieval from SOAP servers:

  http://axis.apache.org/axis2/c/core/

There is a linux binary distribution but, even so, Linux
users may find themselves having to install from
source (and may need to do an 'autoreconf -fi' prior to
configuration to fix a subsequent compilation error on some
systems).

Auto-detection (by EMBOSS) of this package is based on the
presence of a pkgconfig file that axis2c installs. It is
advised that you install pkgconfig if not already installed
(it usually is pre-installed on Linux systems). EMBOSS has a
--with_axis2c= configure switch if you install axis2c into
a location other than /usr or /usr/local (typically).

6.1.4 Other optional library software

Installation of libraries for PNG (libpng/libgd) and PDF (libhpdf
aka libharu) follow considerations given in previous releases and
should be familiar to EMBOSS administrators by now.

6.1.5 eprimer3 and eprimer32

The Primer3 authors have released a 2.x.x version which differs
significantly from the 1.x.x series. Unfortunately the executable is
called the same for both releases (primer3_core).  EMBOSS 6.4.0
provides two wrappers for these releases; eprimer3 is for the 1.x.x
version and requires the primer3 executable to be called
'primer3_core' (this has always been the case); eprimer32 is for the
2.x.x version and requires the primer3 executable to be called
primer32_core.

This may involve some minor symlinking and/or directory/PATH
reorganisation by administrators.

6.2 mEMBOSS

A typical installation executable is approximately 70MB and results
in an installation size of approximately 570MB.

MySQL, PostgreSQL, Axis2c, libhpdf (etc) come pre-supplied as part of
the mEMBOSS installation.

The QA test suite has been extended to automatically find and test
both developer and end-user installations of mEMBOSS.

Note that, with the new server definitions in place (described above),
the old SRS database definitions have been removed. You can now access
databases using (e.g.) 'dbfetch:uniprotkb:opsd_human' as an ID. Such
retrieval is much faster than the previously supplied SRS definitions.

7.0 New EMBASSY applications:

We have provided a wrapper package for the recently released
clustal omega software which must, of course, also be installed.

We have provided a wrapper package for the recently released clustal
omega software which must, of course, also be installed.  We will add
new releases of MIRA and VIENNA at a later date, when the new versions
of the original packages are released and integrated.

8.0 Future development

EMBOSS is fully funded until the end of December. We have an ambitious
schedule of further developments planned for this period. There will
be a further release of EMBOSS at the end of the year.

We welcome any and all suggestions from our user and developer
communities for immediate needs and future directions.

At the end of this year the EMBOSS team will be leaving EBI. Peter
Rice's maximum 9 year tenure is coming to an end. We do not yet know
where we will be from January and are open to suggestions for ways to
host and/or to fund further EMBOSS development and for potentially
useful partnerships and collaborations to continue the advances we
have made.

We can most certainly guarantee that we will continue to maintain the
existing code base and the latest releases.

Alan