From ajb at ebi.ac.uk Fri Jul 15 04:52:07 2011 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Fri, 15 Jul 2011 09:52:07 +0100 (BST) Subject: [emboss-announce] EMBOSS 6.4.0 released Message-ID: <59971.82.26.12.214.1310719927.squirrel@imap04.ebi.ac.uk> EMBOSS Release 6.4.0 This release is now available on our OBF ftp server. UNIX version: ftp://emboss.open-bio.org/pub/EMBOSS/ mEMBOSS (MS Windows version): ftp://emboss.open-bio.org/pub/EMBOSS/windows/mEMBOSS-6.4.0.0-setup.exe It includes major extensions to the type and number of data resources available to EMBOSS users. In addition, three books are published by Cambridge University Press: EMBOSS User's Guide: Practical Bioinformatics http://www.cambridge.org/gb/knowledge/isbn/item5979294/?site_locale=en_GB EMBOSS Developer's Guide: Bioinformatics Programming http://www.cambridge.org/gb/knowledge/isbn/item5979293/?site_locale=en_GB EMBOSS Administrator's Guide: Bioinformatics Software Management http://www.cambridge.org/gb/knowledge/isbn/item5979238/?site_locale=en_GB They are comprehensive and definitive guides to administering, developing and using EMBOSS. We hope they will prove useful to the EMBOSS community and to anyone providing training courses covering EMBOSS. In addition to these publications we have a new website. http://emboss.open-bio.org Updates for the new features in 6.4.0 will be made available soon on the new EMBOSS website, with tutorials to be developed on the EBI e-Learning Portal. Contents: 1.0 New in 6.4.0 1.1 Server definitions 1.2 Access methods 1.3 emboss.standard file 1.4 new data types 1.5 new query language 1.6 Hash tables and lists 1.7 Cross-references 1.8 URL generation 1.9 Database index compression 1.10 Database indexing applications 1.11 Generating server cache files 1.12 Server and database attributes 1.13 HTTP redirection 1.14 EMBOSS version number 1.15 ACD list 'select all' 2.0 EDAM Ontology 2.1 EDAM in ACD files 2.2 EDAM applications 3.0 DRCAT Data Resource Catalogue 4.0 NCBI Taxonomy 5.0 Maintenance 6.0 Installation Notes 6.1 UNIX 6.1.1 MySQL 6.1.2 PostgreSQL 6.1.3 axis2c 6.1.4 Other optional library software 6.1.5 eprimer3 and eprimer32 6.2 mEMBOSS 7.0 New EMBASSY applications 8.0 Future 1.0 New in 6.4.0 1.1 Server definitions Servers can be defined, in a similar style to a database definition, but covering all databases available from a single server. The server definition names a cache file describing each database, its format and its query fields. Cache files for a core set of public servers are included in the release. 1.2 Access methods New access methods are provided, including Ensembl, BioMart, DAS, SOAP web services (EBI wsdbfetch and ebeye), REST web services (EBI dbfetch), and GMOD/CHADO. Ensembl access uses code contributed by Michael Schuster in the Ensembl team at EBI. This code is updated after each Ensembl API release. Some of these access methods were available but only partly implemented in the previous release. They now support standard server and database definitions and are open for further development. Data access methods have been restructured to use "text" access for any method which seeks a position in a file and then opens it for reading. This includes reading from a URL and returning a pointer to the start of the output. A few datatype-specific access methods remain, for example reading sequence data from a PIR/NBRF/GCG format database, or from the NCBI taxonomy files, or access to database systems via SQL or DAS. 1.3 emboss.standard file Previous releases depended on a user defining databases in their emboss.defaults file. Release 6.4.0 provides a new emboss.standard file defining the core servers and databases, and standard resource settings for database indexing. The local emboss.default file is only needed for local database definitions and settings. The configuration files emboss.standard, emboss.default and ~/.embossrc resolve variable references (e.g. in directory names) during parsing. Extensions to the syntax of these files include ALIAS to give secondary names to a database. IF, IFDEF, ELSE and ENDIF directives allow conditional inclusion of sections of the file dependent on variable settings. Special variables EMBOSS_AXIS2, EMBOSS_MYSQL, EMBOSS_POSTGRESQL and EMBOSS_SQL are automatically created for this purpose. New variable EMBOSS_STANDARD is automatically defined to be the share/EMBOSS install directory (or the emboss source code directory if the package is not installed). This is by default where the emboss.standard files and server cache files are expected to be found. The value is reported by "embossversion -full" 1.4 new data types New data types are available as inputs and outputs or applications. Each has a simple definition including qualifiers -iformat for input format and -oformat for output format. The maxreads attribute defines whether the application expects to read a single entry (maxreads: 1) or loop over multiple entries (the default). This is simpler than the sequence and seqall definitions for sequence which are widely used and will remain unchanged. * text and outtext: the text of an entry for which EMBOSS has (to date) no specialised parser * obo and oboout: terms in an OBO ontology. Six ontologies are included in the release as source and index files (EDAM, GO, SO, RO, PW, ECO). We plan to add more and welcome suggestions for inclusion. * resource and resourceout: entries in the Data Resource Catalogue * taxon and outtaxon: nodes in the NCBI taxonomy which is indexed and included in the release * url and outurl: a database name from the Data Resource Catalogue, and an identifier, converted into a URL which can be pasted into a browser to cover cases where the URL does not return simple text or HTML data. * for future extension, assembly and variation datatypes are defined for development and use in a later release. 1.5 New query language All data types use a common query language. The existing "USA" (uniform sequence address) syntax is still valid for sequence data, but is also now used for features, obo terms, data resources, taxons and plain text data. In response to comments from our Scientific Advisory Board, we have extended the query language to cover multiple identifiers, multiple fields, and operators to combine elements of the query. * id lists: dbname:{ida,idb,idc} searches for 3 identifiers (id, accession, etc.) in a database * or operator: dbname-{id:h* | des:hemoglobin} searches for all entries with identifiers starting with 'h' plus any others that include the word 'hemoglobin' in their descriptions. * not operator: dbname-{id:h* ! des:hemoglobin} searches for all entries with identifiers starting with 'h' that do not include the word 'hemoglobin' in their descriptions. * and operator: dbname-{id:h* & des:hemoglobin} searches for all entries with identifiers starting with 'h' that also include the word 'hemoglobin' in their descriptions. * eor operator: dbname-{id:h* ^ des:hemoglobin} searches for all entries with identifiers starting with 'h' that do not include the word 'hemoglobin' in their descriptions, and all those starting with another character that do include the word 'hemoglobin' in their description. This is the opposite of the and (&) operator. Query operators are not supported by all access methods. Where an operator is invalid an error message gives the list of valid operators. For example, the query syntax for SRS (srs, srswww access) does not include the exclusive-or (^) operator but supports the others as these are standard elements in SRS queries. The query language only allows a single database name in the query. This allows EMBOSS to combine query results for a single query expression. To query multiple databases a list file input with one database query on each line can be used. Indexed strings containing non-alphabetic characters including white space are simplified by converting a run of such characters to a single underscore. The same transformation is applied to a query string for the dbx (emboss) access method. This is especially useful for brackets and other characters in data resource names in DRCAT. We hope that the extended query language and the index file compression will increase the use of locally indexed data in EMBOSS installations, and welcome feedback on further developments of the query language and indexing. 1.6 Hash table and lists The new query language is supported by extensions to tables and lists in the libraries. Tables can now be automatically resized. Merge operations on two tables combine their contents using the same operations (or, and, not, eor) as the query language. By resizing the tables first this operation can be made highly efficient. Destructors can be defined for list data and for table keys and data to automatically clean up after use. Tables with string keys can use C char* or string object queries in all cases. Lists and tables can now be reference counted, avoiding unnecessary copying especially in the Ensembl API code. 1.7 Cross-references Cross-references from UniProt/SwissProt and EMBL/GenBank/DDBJ are collected by extended parsers. New application seqxref reports the cross-references. New application seqxrefget creates a script to retrieve cross-referenced data as the original entries, using entret for sequence data, feattext for feature data, ontotext for ontology terms, textget for text and urlget for data where "HTML" is the only available format. 1.8 URL generation New application urlget returns a query URL from DRCAT with one or mode identifiers. Where data is from a UniProt/SwissProt or EMBL/GenBank/DDBJ entry the DRCAT entry definition of the original cross-reference is used to select from several possible identifier terms in EDAM in order to choose the correct query. 1.9 Database index compression Indexes created by dbxflat or dbxfasta are now, by default, compressed automatically. These files, especially for secondary text indexes such as description, taxonomy or keyword, could be very sparse. Up to 95% space savings were achieved in some cases. The indexes are still updatable by code which uncompresses, updates, and recompresses on-the-fly using a copy of the index. 1.10 Database indexing applications New indexing applications dbxedam (EDAM), dbxresource (DRCAT), dbxtax (NCBI taxonomy) and dbxobo (any OBO ontology) are added for the new data resources provided as standard. users can install new releases of the source data and run these applications to update the index files. Application dbxflat can now index fastq format. This was included in 6.3.1 as a special addition for one user to test and is now fully supported. New applications dbxreport and dbxstat report on the overall and detailed content of dbx database indexes. In database indexing applications, the default "resource" name is one included in the emboss.standard file. Users can continue to define their own resource files. Indexing "resource" definitions can now specify the maximum length of any field, and the page size and cache size for any field, using attributes with the field name as a prefix. 1.11 Generating server cache files New applications for major access methods query a server (for example, the DAS registry or Ensembl) to update the server cache file with a current set of database definitions. When run by the system administrator these can update the site-wide cache file, but they can also be run by an individual user to create a user-specific set of databases. The cache files are time stamped. EMBOSS uses the most recent system or user file. 1.12 Server and database attributes New applications showserver and servertell describe all servers or the attributes of a single named server. We expect to extend these applications once we have feedback on the most useful information they should report. New application dbtell similarly reports on the attributes of a single named database. Database (and server) definitions can use an attribute more than once if it is defined as "multiple". These include a new "field:" attribute which gives the name and description of a query field. A list of "field:" attributes supersedes the old "fields:" attribute which listed all query field names but allowed no further annotation. Database field names are extended from the original fixed set of "SRS sequence" fields to any name. "id" and "acc" are assumed to be the names of identifier and accession fields. The "hasaccession" attribute is set automatically for databases where no "acc" field is found, avoiding some error messages where the attribute has been omitted. 1.13 HTTP redirection Data retrieval using HTTP now checks the returned header for redirects and automatically replaces the results with the output from the redirected URL. Where redirected URLs were found in standard database definitions (e.g. the EBI's dbfetch service) these have been replaced by the current URL. We have also seen redirects from case-sensitive servers which redirect a lower case accession number to one in upper case in the same URL. 1.14 EMBOSS version number The EMBOSS version number now has 4 digits (6.4.0.0). The fourth digit is only there so that the Windows port (mEMBOSS) shows the same version number for QA testing. In mEMBOSS the final digit is the build number. QA tests for mEMBOSS now use the same test definition and qatest script as on Linux. mEMBOSS file handling and reporting has been adapted to support POSIX and Windows style paths. 1.15 ACD list 'select all' In ACD files, a list or selection definition can default to "*" for "select all" if the "minimum" attribute allows all terms to be selected. 2.0 EDAM Ontology EDAM is a new ontology from the EMBRACE project now further developed by Jon Ison in the EMBOSS team. EDAM describes terms for topics (for applications and data), operations (algorithms), formats, identifiers and data (semantic descriptions of data content). EDAM terms are used throughout this release: to annotate all ACD files at the application, input, parameter and output levels; to annotate data resources and their web queries in the Data Resource Catalogue; and to annotate database and server definitions. 2.1 EDAM in ACD files ACD files are annotated extensively with EDAM terms using the term id and the human-readable name. The EMBOSS application groups have been extended to match the EDAM topic annotations, with some applications moving to different or new groups. EDAM has been used to validate these groups by comparing the topics hierarchy with the group designations. 2.2 EDAM applications EDAM can be queried within any specific namespace by new applications edamname and edamdef. EDAM and other ontologies are supported by new applications (ontoget, ontotext, ontodown, ontoup, ontgetsibs, ontogetcommon, ontogetroot, ontogetobsolete, ontoisobsolete, ontocount) New applications search EDAM term names and definitions, retrieve all matching terms and their descendants, and compare to: applications (wosstopic, wossoperation, wossinput, wossoutput, wossdata); data resources (drfindresource, drfindid, drfindformat, drfinddata); and related EDAM terms (edamhasinput, edamhasoutput, edamisid, edamisformat, edamissource). 3.0 DRCAT Data Resource Catalogue DRCAT, the Data Resource Catalogue, is included in this release. DRCAT started as a description of databases found as cross-references in UniProt/SwissProt, extended by adding databases found as cross-references in EMBL/GenBank/DDBJ, plus others from Nucleic Acids Research, ELIXIR, and other sources. Any database in DRCAT can be used by name from an EMBOSS application, returning sequence, feature, or text if a suitable data format is defined for any query, or creating a URL which can be pasted into a browser where the results are, for example, a graphical display using javascript which EMBOSS cannot interpret. We aim to further extend and improve DRCAT in future releases. 4.0 NCBI Taxonomy Taxonomy data from the NCBI taxonomy is included as standard in the release. New applications retrieve single nodes and their ancestors and descendants (taxget, taxgetup, taxgetdown, taxgetspecies, taxgetrank). 5.0 Maintenance Application digest has been renamed pepdigest to avoid a clash with another utility. The name is also in keeping with the EMBOSS naming of other protein analysis applications. Sequence and features formats have been reviewed and updated, especially GFF3, GenPept, SAM, BAM and treecon. GFF3 output now more closely follows the official standard, including the escaping of special characters in the tag/value final column. GFF3 ID and Parent tags are supported. Features with exons are now stored as a list of exon subfeatures. This change allows easier sorting of features by location, keeping groups of features together, and has simplified the generation of several feature output formats. Graphical output for more than one input sequence have been corrected and enhanced. The lindna application has been adjusted to correctly relocate overlapping text and to generate a clean sequence ruler for any range of positions. New report formats allow reported hits (-rformat draw) and restriction sites (-rformat restrict) to be plotted by lindna. We expect to work further on the views that these outputs generate. The einverted application had a bug (also in the original version) when an inverted repeat maximum score was close to the edge of the search window. This was seen only at low threshold scores. Searches with low threshold scores can be expected to yield slightly different choices of hits. In ACD files, the "gui" and "batch" application attributes are assumed to be "true" if missing. Previous releases defined them as "false" internally, but fortunately no parsers seem to have used the internal default value. Database indexes created by the dbx programs now include a count of unique and total keys. The text index files also report the type as "Identifier" or "Secondary" and whether the index is compressed. EMBOSS configuration now uses autoheader and has less dependency on the version of libtool. 6.0 Installation notes 6.1 UNIX The size of the EMBOSS package has shot up by approximately 60MB compared with the last major release. This is largely due to to pre-supplied data and index files for ontology/taxonomy/etc. A typical installation size (shared images) is approximately 360MB. Though not a requirement of EMBOSS there are some associated packages which may be installed prior to configuration that will allow you to use some optional access methods. 6.1.1 MySQL This is used, for example, by the Ensembl access code. It will be automatically configured if the (MySQL-supplied) 'mysql_config' application is found in the PATH and if the associated development files (compiler headers etc) are also installed. As an example, for Linux systems, both things will be done by installing the mysql-devel (RPM distributions) or mysql-dev (Debian-based distributions). If your MySQL installation is in some arbitrary location then you can specify it using the --with-mysql= compilation switch. 6.1.2 PostgreSQL This is used by some servers (e.g. flybase/genedb). Similar considerations apply to those described for MySQL above. Auto-detection is based on the presence in the PATH of 'pg_config', dev[el] files must be installed, the --with-postgresql configuration switch can be used for arbitrary locations. 6.1.3 axis2c EMBOSS optionally uses the 1.6.0 release of Axis2C for retrieval from SOAP servers: http://axis.apache.org/axis2/c/core/ There is a linux binary distribution but, even so, Linux users may find themselves having to install from source (and may need to do an 'autoreconf -fi' prior to configuration to fix a subsequent compilation error on some systems). Auto-detection (by EMBOSS) of this package is based on the presence of a pkgconfig file that axis2c installs. It is advised that you install pkgconfig if not already installed (it usually is pre-installed on Linux systems). EMBOSS has a --with_axis2c= configure switch if you install axis2c into a location other than /usr or /usr/local (typically). 6.1.4 Other optional library software Installation of libraries for PNG (libpng/libgd) and PDF (libhpdf aka libharu) follow considerations given in previous releases and should be familiar to EMBOSS administrators by now. 6.1.5 eprimer3 and eprimer32 The Primer3 authors have released a 2.x.x version which differs significantly from the 1.x.x series. Unfortunately the executable is called the same for both releases (primer3_core). EMBOSS 6.4.0 provides two wrappers for these releases; eprimer3 is for the 1.x.x version and requires the primer3 executable to be called 'primer3_core' (this has always been the case); eprimer32 is for the 2.x.x version and requires the primer3 executable to be called primer32_core. This may involve some minor symlinking and/or directory/PATH reorganisation by administrators. 6.2 mEMBOSS A typical installation executable is approximately 70MB and results in an installation size of approximately 570MB. MySQL, PostgreSQL, Axis2c, libhpdf (etc) come pre-supplied as part of the mEMBOSS installation. The QA test suite has been extended to automatically find and test both developer and end-user installations of mEMBOSS. Note that, with the new server definitions in place (described above), the old SRS database definitions have been removed. You can now access databases using (e.g.) 'dbfetch:uniprotkb:opsd_human' as an ID. Such retrieval is much faster than the previously supplied SRS definitions. 7.0 New EMBASSY applications: We have provided a wrapper package for the recently released clustal omega software which must, of course, also be installed. We have provided a wrapper package for the recently released clustal omega software which must, of course, also be installed. We will add new releases of MIRA and VIENNA at a later date, when the new versions of the original packages are released and integrated. 8.0 Future development EMBOSS is fully funded until the end of December. We have an ambitious schedule of further developments planned for this period. There will be a further release of EMBOSS at the end of the year. We welcome any and all suggestions from our user and developer communities for immediate needs and future directions. At the end of this year the EMBOSS team will be leaving EBI. Peter Rice's maximum 9 year tenure is coming to an end. We do not yet know where we will be from January and are open to suggestions for ways to host and/or to fund further EMBOSS development and for potentially useful partnerships and collaborations to continue the advances we have made. We can most certainly guarantee that we will continue to maintain the existing code base and the latest releases. Alan From ajb at ebi.ac.uk Tue Jul 26 11:24:35 2011 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Tue, 26 Jul 2011 16:24:35 +0100 (BST) Subject: [emboss-announce] mEMBOSS 6.4.0.1 available Message-ID: <53274.82.26.12.214.1311693875.squirrel@imap04.ebi.ac.uk> This is a bugfix release for the MS Windows version of EMBOSS, primarily to fix a problem printing very long ('long long') integers. Though most users would be unlikely to hit this problem an uninstall/reinstall is nevertheless recommended. The release also contains a few minor bugfixes, notably making visible some potentially hidden SOAP server definitions. It is available from the usual place: ftp://emboss.open-bio.org/pub/EMBOSS/windows/mEMBOSS-6.4.0.1-setup.exe Alan From ajb at ebi.ac.uk Fri Jul 15 08:52:07 2011 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Fri, 15 Jul 2011 09:52:07 +0100 (BST) Subject: [emboss-announce] EMBOSS 6.4.0 released Message-ID: <59971.82.26.12.214.1310719927.squirrel@imap04.ebi.ac.uk> EMBOSS Release 6.4.0 This release is now available on our OBF ftp server. UNIX version: ftp://emboss.open-bio.org/pub/EMBOSS/ mEMBOSS (MS Windows version): ftp://emboss.open-bio.org/pub/EMBOSS/windows/mEMBOSS-6.4.0.0-setup.exe It includes major extensions to the type and number of data resources available to EMBOSS users. In addition, three books are published by Cambridge University Press: EMBOSS User's Guide: Practical Bioinformatics http://www.cambridge.org/gb/knowledge/isbn/item5979294/?site_locale=en_GB EMBOSS Developer's Guide: Bioinformatics Programming http://www.cambridge.org/gb/knowledge/isbn/item5979293/?site_locale=en_GB EMBOSS Administrator's Guide: Bioinformatics Software Management http://www.cambridge.org/gb/knowledge/isbn/item5979238/?site_locale=en_GB They are comprehensive and definitive guides to administering, developing and using EMBOSS. We hope they will prove useful to the EMBOSS community and to anyone providing training courses covering EMBOSS. In addition to these publications we have a new website. http://emboss.open-bio.org Updates for the new features in 6.4.0 will be made available soon on the new EMBOSS website, with tutorials to be developed on the EBI e-Learning Portal. Contents: 1.0 New in 6.4.0 1.1 Server definitions 1.2 Access methods 1.3 emboss.standard file 1.4 new data types 1.5 new query language 1.6 Hash tables and lists 1.7 Cross-references 1.8 URL generation 1.9 Database index compression 1.10 Database indexing applications 1.11 Generating server cache files 1.12 Server and database attributes 1.13 HTTP redirection 1.14 EMBOSS version number 1.15 ACD list 'select all' 2.0 EDAM Ontology 2.1 EDAM in ACD files 2.2 EDAM applications 3.0 DRCAT Data Resource Catalogue 4.0 NCBI Taxonomy 5.0 Maintenance 6.0 Installation Notes 6.1 UNIX 6.1.1 MySQL 6.1.2 PostgreSQL 6.1.3 axis2c 6.1.4 Other optional library software 6.1.5 eprimer3 and eprimer32 6.2 mEMBOSS 7.0 New EMBASSY applications 8.0 Future 1.0 New in 6.4.0 1.1 Server definitions Servers can be defined, in a similar style to a database definition, but covering all databases available from a single server. The server definition names a cache file describing each database, its format and its query fields. Cache files for a core set of public servers are included in the release. 1.2 Access methods New access methods are provided, including Ensembl, BioMart, DAS, SOAP web services (EBI wsdbfetch and ebeye), REST web services (EBI dbfetch), and GMOD/CHADO. Ensembl access uses code contributed by Michael Schuster in the Ensembl team at EBI. This code is updated after each Ensembl API release. Some of these access methods were available but only partly implemented in the previous release. They now support standard server and database definitions and are open for further development. Data access methods have been restructured to use "text" access for any method which seeks a position in a file and then opens it for reading. This includes reading from a URL and returning a pointer to the start of the output. A few datatype-specific access methods remain, for example reading sequence data from a PIR/NBRF/GCG format database, or from the NCBI taxonomy files, or access to database systems via SQL or DAS. 1.3 emboss.standard file Previous releases depended on a user defining databases in their emboss.defaults file. Release 6.4.0 provides a new emboss.standard file defining the core servers and databases, and standard resource settings for database indexing. The local emboss.default file is only needed for local database definitions and settings. The configuration files emboss.standard, emboss.default and ~/.embossrc resolve variable references (e.g. in directory names) during parsing. Extensions to the syntax of these files include ALIAS to give secondary names to a database. IF, IFDEF, ELSE and ENDIF directives allow conditional inclusion of sections of the file dependent on variable settings. Special variables EMBOSS_AXIS2, EMBOSS_MYSQL, EMBOSS_POSTGRESQL and EMBOSS_SQL are automatically created for this purpose. New variable EMBOSS_STANDARD is automatically defined to be the share/EMBOSS install directory (or the emboss source code directory if the package is not installed). This is by default where the emboss.standard files and server cache files are expected to be found. The value is reported by "embossversion -full" 1.4 new data types New data types are available as inputs and outputs or applications. Each has a simple definition including qualifiers -iformat for input format and -oformat for output format. The maxreads attribute defines whether the application expects to read a single entry (maxreads: 1) or loop over multiple entries (the default). This is simpler than the sequence and seqall definitions for sequence which are widely used and will remain unchanged. * text and outtext: the text of an entry for which EMBOSS has (to date) no specialised parser * obo and oboout: terms in an OBO ontology. Six ontologies are included in the release as source and index files (EDAM, GO, SO, RO, PW, ECO). We plan to add more and welcome suggestions for inclusion. * resource and resourceout: entries in the Data Resource Catalogue * taxon and outtaxon: nodes in the NCBI taxonomy which is indexed and included in the release * url and outurl: a database name from the Data Resource Catalogue, and an identifier, converted into a URL which can be pasted into a browser to cover cases where the URL does not return simple text or HTML data. * for future extension, assembly and variation datatypes are defined for development and use in a later release. 1.5 New query language All data types use a common query language. The existing "USA" (uniform sequence address) syntax is still valid for sequence data, but is also now used for features, obo terms, data resources, taxons and plain text data. In response to comments from our Scientific Advisory Board, we have extended the query language to cover multiple identifiers, multiple fields, and operators to combine elements of the query. * id lists: dbname:{ida,idb,idc} searches for 3 identifiers (id, accession, etc.) in a database * or operator: dbname-{id:h* | des:hemoglobin} searches for all entries with identifiers starting with 'h' plus any others that include the word 'hemoglobin' in their descriptions. * not operator: dbname-{id:h* ! des:hemoglobin} searches for all entries with identifiers starting with 'h' that do not include the word 'hemoglobin' in their descriptions. * and operator: dbname-{id:h* & des:hemoglobin} searches for all entries with identifiers starting with 'h' that also include the word 'hemoglobin' in their descriptions. * eor operator: dbname-{id:h* ^ des:hemoglobin} searches for all entries with identifiers starting with 'h' that do not include the word 'hemoglobin' in their descriptions, and all those starting with another character that do include the word 'hemoglobin' in their description. This is the opposite of the and (&) operator. Query operators are not supported by all access methods. Where an operator is invalid an error message gives the list of valid operators. For example, the query syntax for SRS (srs, srswww access) does not include the exclusive-or (^) operator but supports the others as these are standard elements in SRS queries. The query language only allows a single database name in the query. This allows EMBOSS to combine query results for a single query expression. To query multiple databases a list file input with one database query on each line can be used. Indexed strings containing non-alphabetic characters including white space are simplified by converting a run of such characters to a single underscore. The same transformation is applied to a query string for the dbx (emboss) access method. This is especially useful for brackets and other characters in data resource names in DRCAT. We hope that the extended query language and the index file compression will increase the use of locally indexed data in EMBOSS installations, and welcome feedback on further developments of the query language and indexing. 1.6 Hash table and lists The new query language is supported by extensions to tables and lists in the libraries. Tables can now be automatically resized. Merge operations on two tables combine their contents using the same operations (or, and, not, eor) as the query language. By resizing the tables first this operation can be made highly efficient. Destructors can be defined for list data and for table keys and data to automatically clean up after use. Tables with string keys can use C char* or string object queries in all cases. Lists and tables can now be reference counted, avoiding unnecessary copying especially in the Ensembl API code. 1.7 Cross-references Cross-references from UniProt/SwissProt and EMBL/GenBank/DDBJ are collected by extended parsers. New application seqxref reports the cross-references. New application seqxrefget creates a script to retrieve cross-referenced data as the original entries, using entret for sequence data, feattext for feature data, ontotext for ontology terms, textget for text and urlget for data where "HTML" is the only available format. 1.8 URL generation New application urlget returns a query URL from DRCAT with one or mode identifiers. Where data is from a UniProt/SwissProt or EMBL/GenBank/DDBJ entry the DRCAT entry definition of the original cross-reference is used to select from several possible identifier terms in EDAM in order to choose the correct query. 1.9 Database index compression Indexes created by dbxflat or dbxfasta are now, by default, compressed automatically. These files, especially for secondary text indexes such as description, taxonomy or keyword, could be very sparse. Up to 95% space savings were achieved in some cases. The indexes are still updatable by code which uncompresses, updates, and recompresses on-the-fly using a copy of the index. 1.10 Database indexing applications New indexing applications dbxedam (EDAM), dbxresource (DRCAT), dbxtax (NCBI taxonomy) and dbxobo (any OBO ontology) are added for the new data resources provided as standard. users can install new releases of the source data and run these applications to update the index files. Application dbxflat can now index fastq format. This was included in 6.3.1 as a special addition for one user to test and is now fully supported. New applications dbxreport and dbxstat report on the overall and detailed content of dbx database indexes. In database indexing applications, the default "resource" name is one included in the emboss.standard file. Users can continue to define their own resource files. Indexing "resource" definitions can now specify the maximum length of any field, and the page size and cache size for any field, using attributes with the field name as a prefix. 1.11 Generating server cache files New applications for major access methods query a server (for example, the DAS registry or Ensembl) to update the server cache file with a current set of database definitions. When run by the system administrator these can update the site-wide cache file, but they can also be run by an individual user to create a user-specific set of databases. The cache files are time stamped. EMBOSS uses the most recent system or user file. 1.12 Server and database attributes New applications showserver and servertell describe all servers or the attributes of a single named server. We expect to extend these applications once we have feedback on the most useful information they should report. New application dbtell similarly reports on the attributes of a single named database. Database (and server) definitions can use an attribute more than once if it is defined as "multiple". These include a new "field:" attribute which gives the name and description of a query field. A list of "field:" attributes supersedes the old "fields:" attribute which listed all query field names but allowed no further annotation. Database field names are extended from the original fixed set of "SRS sequence" fields to any name. "id" and "acc" are assumed to be the names of identifier and accession fields. The "hasaccession" attribute is set automatically for databases where no "acc" field is found, avoiding some error messages where the attribute has been omitted. 1.13 HTTP redirection Data retrieval using HTTP now checks the returned header for redirects and automatically replaces the results with the output from the redirected URL. Where redirected URLs were found in standard database definitions (e.g. the EBI's dbfetch service) these have been replaced by the current URL. We have also seen redirects from case-sensitive servers which redirect a lower case accession number to one in upper case in the same URL. 1.14 EMBOSS version number The EMBOSS version number now has 4 digits (6.4.0.0). The fourth digit is only there so that the Windows port (mEMBOSS) shows the same version number for QA testing. In mEMBOSS the final digit is the build number. QA tests for mEMBOSS now use the same test definition and qatest script as on Linux. mEMBOSS file handling and reporting has been adapted to support POSIX and Windows style paths. 1.15 ACD list 'select all' In ACD files, a list or selection definition can default to "*" for "select all" if the "minimum" attribute allows all terms to be selected. 2.0 EDAM Ontology EDAM is a new ontology from the EMBRACE project now further developed by Jon Ison in the EMBOSS team. EDAM describes terms for topics (for applications and data), operations (algorithms), formats, identifiers and data (semantic descriptions of data content). EDAM terms are used throughout this release: to annotate all ACD files at the application, input, parameter and output levels; to annotate data resources and their web queries in the Data Resource Catalogue; and to annotate database and server definitions. 2.1 EDAM in ACD files ACD files are annotated extensively with EDAM terms using the term id and the human-readable name. The EMBOSS application groups have been extended to match the EDAM topic annotations, with some applications moving to different or new groups. EDAM has been used to validate these groups by comparing the topics hierarchy with the group designations. 2.2 EDAM applications EDAM can be queried within any specific namespace by new applications edamname and edamdef. EDAM and other ontologies are supported by new applications (ontoget, ontotext, ontodown, ontoup, ontgetsibs, ontogetcommon, ontogetroot, ontogetobsolete, ontoisobsolete, ontocount) New applications search EDAM term names and definitions, retrieve all matching terms and their descendants, and compare to: applications (wosstopic, wossoperation, wossinput, wossoutput, wossdata); data resources (drfindresource, drfindid, drfindformat, drfinddata); and related EDAM terms (edamhasinput, edamhasoutput, edamisid, edamisformat, edamissource). 3.0 DRCAT Data Resource Catalogue DRCAT, the Data Resource Catalogue, is included in this release. DRCAT started as a description of databases found as cross-references in UniProt/SwissProt, extended by adding databases found as cross-references in EMBL/GenBank/DDBJ, plus others from Nucleic Acids Research, ELIXIR, and other sources. Any database in DRCAT can be used by name from an EMBOSS application, returning sequence, feature, or text if a suitable data format is defined for any query, or creating a URL which can be pasted into a browser where the results are, for example, a graphical display using javascript which EMBOSS cannot interpret. We aim to further extend and improve DRCAT in future releases. 4.0 NCBI Taxonomy Taxonomy data from the NCBI taxonomy is included as standard in the release. New applications retrieve single nodes and their ancestors and descendants (taxget, taxgetup, taxgetdown, taxgetspecies, taxgetrank). 5.0 Maintenance Application digest has been renamed pepdigest to avoid a clash with another utility. The name is also in keeping with the EMBOSS naming of other protein analysis applications. Sequence and features formats have been reviewed and updated, especially GFF3, GenPept, SAM, BAM and treecon. GFF3 output now more closely follows the official standard, including the escaping of special characters in the tag/value final column. GFF3 ID and Parent tags are supported. Features with exons are now stored as a list of exon subfeatures. This change allows easier sorting of features by location, keeping groups of features together, and has simplified the generation of several feature output formats. Graphical output for more than one input sequence have been corrected and enhanced. The lindna application has been adjusted to correctly relocate overlapping text and to generate a clean sequence ruler for any range of positions. New report formats allow reported hits (-rformat draw) and restriction sites (-rformat restrict) to be plotted by lindna. We expect to work further on the views that these outputs generate. The einverted application had a bug (also in the original version) when an inverted repeat maximum score was close to the edge of the search window. This was seen only at low threshold scores. Searches with low threshold scores can be expected to yield slightly different choices of hits. In ACD files, the "gui" and "batch" application attributes are assumed to be "true" if missing. Previous releases defined them as "false" internally, but fortunately no parsers seem to have used the internal default value. Database indexes created by the dbx programs now include a count of unique and total keys. The text index files also report the type as "Identifier" or "Secondary" and whether the index is compressed. EMBOSS configuration now uses autoheader and has less dependency on the version of libtool. 6.0 Installation notes 6.1 UNIX The size of the EMBOSS package has shot up by approximately 60MB compared with the last major release. This is largely due to to pre-supplied data and index files for ontology/taxonomy/etc. A typical installation size (shared images) is approximately 360MB. Though not a requirement of EMBOSS there are some associated packages which may be installed prior to configuration that will allow you to use some optional access methods. 6.1.1 MySQL This is used, for example, by the Ensembl access code. It will be automatically configured if the (MySQL-supplied) 'mysql_config' application is found in the PATH and if the associated development files (compiler headers etc) are also installed. As an example, for Linux systems, both things will be done by installing the mysql-devel (RPM distributions) or mysql-dev (Debian-based distributions). If your MySQL installation is in some arbitrary location then you can specify it using the --with-mysql= compilation switch. 6.1.2 PostgreSQL This is used by some servers (e.g. flybase/genedb). Similar considerations apply to those described for MySQL above. Auto-detection is based on the presence in the PATH of 'pg_config', dev[el] files must be installed, the --with-postgresql configuration switch can be used for arbitrary locations. 6.1.3 axis2c EMBOSS optionally uses the 1.6.0 release of Axis2C for retrieval from SOAP servers: http://axis.apache.org/axis2/c/core/ There is a linux binary distribution but, even so, Linux users may find themselves having to install from source (and may need to do an 'autoreconf -fi' prior to configuration to fix a subsequent compilation error on some systems). Auto-detection (by EMBOSS) of this package is based on the presence of a pkgconfig file that axis2c installs. It is advised that you install pkgconfig if not already installed (it usually is pre-installed on Linux systems). EMBOSS has a --with_axis2c= configure switch if you install axis2c into a location other than /usr or /usr/local (typically). 6.1.4 Other optional library software Installation of libraries for PNG (libpng/libgd) and PDF (libhpdf aka libharu) follow considerations given in previous releases and should be familiar to EMBOSS administrators by now. 6.1.5 eprimer3 and eprimer32 The Primer3 authors have released a 2.x.x version which differs significantly from the 1.x.x series. Unfortunately the executable is called the same for both releases (primer3_core). EMBOSS 6.4.0 provides two wrappers for these releases; eprimer3 is for the 1.x.x version and requires the primer3 executable to be called 'primer3_core' (this has always been the case); eprimer32 is for the 2.x.x version and requires the primer3 executable to be called primer32_core. This may involve some minor symlinking and/or directory/PATH reorganisation by administrators. 6.2 mEMBOSS A typical installation executable is approximately 70MB and results in an installation size of approximately 570MB. MySQL, PostgreSQL, Axis2c, libhpdf (etc) come pre-supplied as part of the mEMBOSS installation. The QA test suite has been extended to automatically find and test both developer and end-user installations of mEMBOSS. Note that, with the new server definitions in place (described above), the old SRS database definitions have been removed. You can now access databases using (e.g.) 'dbfetch:uniprotkb:opsd_human' as an ID. Such retrieval is much faster than the previously supplied SRS definitions. 7.0 New EMBASSY applications: We have provided a wrapper package for the recently released clustal omega software which must, of course, also be installed. We have provided a wrapper package for the recently released clustal omega software which must, of course, also be installed. We will add new releases of MIRA and VIENNA at a later date, when the new versions of the original packages are released and integrated. 8.0 Future development EMBOSS is fully funded until the end of December. We have an ambitious schedule of further developments planned for this period. There will be a further release of EMBOSS at the end of the year. We welcome any and all suggestions from our user and developer communities for immediate needs and future directions. At the end of this year the EMBOSS team will be leaving EBI. Peter Rice's maximum 9 year tenure is coming to an end. We do not yet know where we will be from January and are open to suggestions for ways to host and/or to fund further EMBOSS development and for potentially useful partnerships and collaborations to continue the advances we have made. We can most certainly guarantee that we will continue to maintain the existing code base and the latest releases. Alan From ajb at ebi.ac.uk Tue Jul 26 15:24:35 2011 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Tue, 26 Jul 2011 16:24:35 +0100 (BST) Subject: [emboss-announce] mEMBOSS 6.4.0.1 available Message-ID: <53274.82.26.12.214.1311693875.squirrel@imap04.ebi.ac.uk> This is a bugfix release for the MS Windows version of EMBOSS, primarily to fix a problem printing very long ('long long') integers. Though most users would be unlikely to hit this problem an uninstall/reinstall is nevertheless recommended. The release also contains a few minor bugfixes, notably making visible some potentially hidden SOAP server definitions. It is available from the usual place: ftp://emboss.open-bio.org/pub/EMBOSS/windows/mEMBOSS-6.4.0.1-setup.exe Alan