From ableasby at hgmp.mrc.ac.uk Wed Jul 13 10:38:06 2005 From: ableasby at hgmp.mrc.ac.uk (Alan Bleasby) Date: Wed, 13 Jul 2005 15:38:06 +0100 (BST) Subject: [emboss-announce] New email lists ready Message-ID: <200507131438.j6DEc6n0027708@bromine.hgmp.mrc.ac.uk> The new email addresses for the EMBOSS lists are now set up and ready (excluding any teething problems). They are: emboss at emboss.open-bio.org emboss-dev at emboss.open-bio.org emboss-bug at emboss.open-bio.org emboss-submit at emboss.open-bio.org You can access the archives, subscribe/unsubscribe and alter the way email is sent to you (e.g. digests) by visiting: http://emboss.open-bio.org/mailman/listinfo/emboss http://emboss.open-bio.org/mailman/listinfo/emboss-dev http://emboss.open-bio.org/mailman/listinfo/emboss-announce http://emboss.open-bio.org/mailman/listinfo/emboss-bug The new FTP server is at: ftp://emboss.open-bio.org/pub/EMBOSS Alan From ableasby at hgmp.mrc.ac.uk Thu Jul 14 19:44:05 2005 From: ableasby at hgmp.mrc.ac.uk (Alan Bleasby) Date: Fri, 15 Jul 2005 00:44:05 +0100 (BST) Subject: [emboss-announce] EMBOSS 3.0.0 released Message-ID: <200507142344.j6ENi5Sd002353@bromine.hgmp.mrc.ac.uk> EMBOSS 3.0.0 is now available for download from: ftp://emboss.open-bio.org/pub/EMBOSS/ and, until the 27th July, from: ftp://ftp.rfcgr.mrc.ac.uk/pub/EMBOSS/ The following text details some of the changes from the previous release. Alan EMBOSS main package: New database indexing programs dbxflat, dbxfasta and dbxgcg. A dbxblast program will be added if we can extract data from the new BLAST formatdb output. These programs allow indexing of files larger than 2Gb. N.B.: Indexes will be created faster if they are written through a different disc controller than that used to read the database being indexed. If that is not possible then reading from and writing to different hard drives on the same controller is recommended. Note that each index can be created independently of the others e.g. you can create keyword and description indexes after you've created the ID and ACC indexes. To support these programs, the emboss.default and .embossrc files can include "resource" definitions. See the documentation of these programs for more information. "resource" definitions are intended to define anything other than environment variables and databases. In the emboss.default and .embossrc files the same name can be used for variables, databases, and resources (we now store them in separate tables). In previous versions a single table was used and name clashes could occur. This becomes an issue with the increasing use of resource definitions. Sequence sets in ACD have a new attribute "aligned" that reports whether the sequences are aligned (reading a multiple alignment in for visualisation) or not (reading a set of sequences into memory for further processing - perhaps for alignment). Sequence formats have been reviewed. "experiment" format is that used by the Staden package. "staden" and "gcg" formats now parse out comments from anywhere in the sequence. "nexus" and "nexusnon" formats now correctly report protein sequence datatypes. "nbrf" or "pir" format data can now be read from an SRSWWW server (for technical reasons, SRS servers are unable to exactly reproduce NBRF/PIR format). "clustal" output no longer writes in blocks of 10. "Phylip3" output is now renamed "phylipnon" for compatibility with other non-interleaved output format names. The "phylip3" name remains valid for back-compatibility. The header record for phylipnon format has been changed to that accepted by phylip 3.6 (no YF on the header line, number of sequences specified). Sequence format information on the web has been updated to reflect these changes. Codon usage table formats can be in these formats (-format qualifier): "emboss", "EMBOSS codon usage file", "All numbers read, #comments for extras" "cut", "EMBOSS codon usage file", "Same as EMBOSS, output default format is 'cut'" "gcg", "GCG codon usage file", "All numbers read, #comments for extras" "cutg", "CUTG codon usage file", "All numbers (cutgaa) read or fraction calculated, extras added" "cutgaa", "CUTG codon usage file with aminoacids", "Cutg with all numbers" "spsum", CUTG species summary file", "Number only, species and CDSs in header" "cherry", "Mike Cherry codonusage database file", "GCG format with species and CDSs in header" "transterm", "TransTerm database file", "GCG format with no extras" "codehop", "FHCRC codehop program codon usage file", "Freq only, extras at end" "staden", "Staden package codon usage file with percentages", "Freq or number only, no extras" "numstaden", "Staden package codon usage file with numbers", "Number only, no extras. Can be read as 'staden'" Any of these formats should be readable by default. Some files are "readable" in more than one format (staden and numstaden for example can both be read as "staden"). The extra names are used so we can reuse them as output format names. For output of codon usage tables, the same formats are available (-oformat qualifier). A new application codcopy (not codret because coderet is already an EMBOSS program name) will convert from one format to another in the same way as seqret converts sequence formats. Coderet reports the number of CDS, mRNA and translation sequences. Correction to sequence numbering for reversed nucleotide sequences in alignments. Correction to sequence alignment functions returning slightly suboptimal alignments. The entrails program reports codon usage formats. Description of report format entrails output improved. Entrails is built by "make check" and is provided so that developers of wrappers can obtain all EMBOSS internal details needed, for example all ACD datatypes and input/output format names and descriptions. Sequence types are explicitly set in cons, sixpack and backtranseq as some output formats failed to recognise them as protein. EMBASSY packages: MYEMBOSS is a new EMBASSY package for developing your own code. Installation requires recent versions of GNU packages autoconf, automake and libtool. To install, you must first build the configure and make files with these commands: aclocal -I m4 autoconf automake -a When you add your own programs, do so by adding source files in myemboss/source and ACD files in myemboss/emboss_acd and add these filenames to the Makefile.am files in each directory. There are "myseq" and "mytest" examples provided to guide you. There is no need to modify configure or Makefile files - these will be automatically updated. To allow MYEMBOSS to be installed by one user, and linked to an EMBOSS installation maintained for the site by someone else, new variables are added to locate the ACD files for any EMBASSY package. If myemboss is not installed in the same place as EMBOSS, define EMBOSS_MYEMBOSSROOT as the location of the myemboss installed ACD files or the myemboss/emboss_acd source directory. This requires that EMBASSY programs call the embInitP function with the name of the package ("myemboss"). For ACD utilities such as acdvalid or acdc to work, as these use the EMBOSS embInit call, another variable EMBOSS_ACDUTILROOT must be defined, pointing to the same directory. PHYLIP is a beta release port of PHYLIP 3.6b. We welcome comments on the EMBOSS interface to the programs. Program names are prefixed by 'f' to avoid clashes with the old PHYLIP EMBASSY package. We still need to work on adding new tree input and output formats, and updating the code to PHYLIP 3.63 (December 2004). We are also considering splitting more of the programs to simplify the ACD interface. In this release seqboot and treedist are already split. seqboot is split by input type into seqboot, restboot, discboot and freqboot. Treedist is split by the number of input files into treedist and treedistpair. Acdvalid objects to the dependencies in other programs, for example the method used by fdnadist. The DOMAINATRIX package of earlier releases has been extended and replaced by 5 EMBASSY packages described below (32 applications in total). These tools were developed as part of a research project and are distinct from other EMBOSS apps in being intended mostly for computational biologists rather than biologist end-users. STRUCTURE The STRUCTURE package is used for parsing the PDB database and generating secondary databases of coordinate and derived data. The tools have the following scope: (i) For parsing PDB files and writing clean coordinate files (CCF files) that "clean-up" many PDB inconsistencies. For example, residue numbers give the correct index into the biological sequence. (ii) To generate CCF files for whole PDB files or individual domains from the SCOP and CATH databases. (iii) To augment CCF files with residue solvent accessibility and secondary structure data. (iv) To generate contact files (CON files) of intra-chain and inter-chain residue-residue contact data. (v) To generate CON files of residue-ligand contact data. (vi) Miscellaneous file handling, e.g. dictionary of heterogen groups. DOMAINATRIX The DOMAINATRIX package is used for handling the SCOP and CATH databases of protein domain classification, the parsable files of which can be inconvenient, e.g. for comparative studies, extending and processing. The tools have the following scope: (i) For parsing raw SCOP and CATH parsable files and writing domain classification files (DCF files) with a single, simple and extensible format. (ii) To add sequence records to a DCF file. (iii) To remove low resolution domains. (iv) To flexibly calculate and remove redundancy. (v) Primitive tools for secondary structure element mapping to domains in a DCF file. DOMALIGN The DOMALIGN package is used for generating alignments for families of domains, especially across large datasets, e.g. the whole of SCOP. The tools have the following scope: (i) For identifying representative structures for different nodes in the SCOP and CATH hierarchies. (ii) For generating annotated, structure-based sequence alignments for these nodes. (iii) For extending these domain alignment files (DAF files) with sequences of unknown structure. (iv) All-versus-all global sequence alignment. DOMSEARCH The DOMSEARCH package is used for deriving extended sequence families, especially from large structural datasets such as the whole of SCOP. The tools have the following scope: (i) To generate domain hits files (DHF files) of sequence relatives to an alignment or other sequences. (ii) To remove fragmentary sequences from a DHF file. (iii) To flexibly calculate and remove redundancy. (iv) To remove hits hits of ambiguous classification and collate sequences into families. SIGNATURE The SIGNATURE package is used for generating, scanning and evaluating sparse signatures and other predictive elements for protein sequence characterisation. The tools have the following scope: (i) To generate sparse signatures for protein families from alignments and residue contact data. (ii) Generate other types of discriminator (e.g. HMMs) from alignments. (iii) Generate ligand-binding signatures from residue-ligand contacts. (iv) Generate domain hits files (DHF files) and ligand hits files (LHF files) of hits (sequences) from signature scans. (v) Interpretation and display of signature performance by using ROC analysis. Where data, files etc are mentioned above or in the application documentation, data structures and functions for manipulating such are usually provided in the AJAX and NUCLEUS C programming libraries. For example, there are objects for handling protein atoms, residues, chains, for SCOP and CATH domains and so on. From pmr at ebi.ac.uk Fri Jul 22 11:00:01 2005 From: pmr at ebi.ac.uk (Peter Rice) Date: Fri, 22 Jul 2005 16:00:01 +0100 Subject: [emboss-announce] EMBOSS in August Message-ID: <42E109F1.9070604@ebi.ac.uk> We know it is close to the end of July, and we have not said what is happening to the EMBOSS team. We do have a solution, but it is not yet officially confirmed. The Rosalind Franklin Centre for Genomic Research will close at the end of next week. The EMBOSS project will move to the European Bioinformatics Institute from August 1st. Development and support will continue as before. The EMBOSS homepage will remain at http://emboss.sourceforge.net/ The FTP server (to download EMBOSS releases and updates) has moved to ftp://emboss.open-bio.org/pub/EMBOSS/ The EMBOSS anonymous CVS server will remain at cvs.open-bio.org hosted by the Open Bio Foundation, who will also continue to host the developers' CVS server. The EMBOSS mailing lists have been moved to the Open Bio Foundation, so the addresses are now: To contact the EMBOSS team: emboss-bug at emboss.open-bio.org Bug reports and support requests emboss-submit at emboss.open-bio.org Code submissions Lists users/developers can subscribe to: emboss at emboss.open-bio.org Users mailing list emboss-dev at emboss.open-bio.org Developers mailing list emboss-announce at emboss.open-bio.org New release announcements list There are obvious gaps in these details ... more news as soon as we have confirmation. regards, Peter Rice, Alan Bleasby and the EMBOSS team. From ableasby at hgmp.mrc.ac.uk Wed Jul 13 14:38:06 2005 From: ableasby at hgmp.mrc.ac.uk (Alan Bleasby) Date: Wed, 13 Jul 2005 15:38:06 +0100 (BST) Subject: [emboss-announce] New email lists ready Message-ID: <200507131438.j6DEc6n0027708@bromine.hgmp.mrc.ac.uk> The new email addresses for the EMBOSS lists are now set up and ready (excluding any teething problems). They are: emboss at emboss.open-bio.org emboss-dev at emboss.open-bio.org emboss-bug at emboss.open-bio.org emboss-submit at emboss.open-bio.org You can access the archives, subscribe/unsubscribe and alter the way email is sent to you (e.g. digests) by visiting: http://emboss.open-bio.org/mailman/listinfo/emboss http://emboss.open-bio.org/mailman/listinfo/emboss-dev http://emboss.open-bio.org/mailman/listinfo/emboss-announce http://emboss.open-bio.org/mailman/listinfo/emboss-bug The new FTP server is at: ftp://emboss.open-bio.org/pub/EMBOSS Alan From ableasby at hgmp.mrc.ac.uk Thu Jul 14 23:44:05 2005 From: ableasby at hgmp.mrc.ac.uk (Alan Bleasby) Date: Fri, 15 Jul 2005 00:44:05 +0100 (BST) Subject: [emboss-announce] EMBOSS 3.0.0 released Message-ID: <200507142344.j6ENi5Sd002353@bromine.hgmp.mrc.ac.uk> EMBOSS 3.0.0 is now available for download from: ftp://emboss.open-bio.org/pub/EMBOSS/ and, until the 27th July, from: ftp://ftp.rfcgr.mrc.ac.uk/pub/EMBOSS/ The following text details some of the changes from the previous release. Alan EMBOSS main package: New database indexing programs dbxflat, dbxfasta and dbxgcg. A dbxblast program will be added if we can extract data from the new BLAST formatdb output. These programs allow indexing of files larger than 2Gb. N.B.: Indexes will be created faster if they are written through a different disc controller than that used to read the database being indexed. If that is not possible then reading from and writing to different hard drives on the same controller is recommended. Note that each index can be created independently of the others e.g. you can create keyword and description indexes after you've created the ID and ACC indexes. To support these programs, the emboss.default and .embossrc files can include "resource" definitions. See the documentation of these programs for more information. "resource" definitions are intended to define anything other than environment variables and databases. In the emboss.default and .embossrc files the same name can be used for variables, databases, and resources (we now store them in separate tables). In previous versions a single table was used and name clashes could occur. This becomes an issue with the increasing use of resource definitions. Sequence sets in ACD have a new attribute "aligned" that reports whether the sequences are aligned (reading a multiple alignment in for visualisation) or not (reading a set of sequences into memory for further processing - perhaps for alignment). Sequence formats have been reviewed. "experiment" format is that used by the Staden package. "staden" and "gcg" formats now parse out comments from anywhere in the sequence. "nexus" and "nexusnon" formats now correctly report protein sequence datatypes. "nbrf" or "pir" format data can now be read from an SRSWWW server (for technical reasons, SRS servers are unable to exactly reproduce NBRF/PIR format). "clustal" output no longer writes in blocks of 10. "Phylip3" output is now renamed "phylipnon" for compatibility with other non-interleaved output format names. The "phylip3" name remains valid for back-compatibility. The header record for phylipnon format has been changed to that accepted by phylip 3.6 (no YF on the header line, number of sequences specified). Sequence format information on the web has been updated to reflect these changes. Codon usage table formats can be in these formats (-format qualifier): "emboss", "EMBOSS codon usage file", "All numbers read, #comments for extras" "cut", "EMBOSS codon usage file", "Same as EMBOSS, output default format is 'cut'" "gcg", "GCG codon usage file", "All numbers read, #comments for extras" "cutg", "CUTG codon usage file", "All numbers (cutgaa) read or fraction calculated, extras added" "cutgaa", "CUTG codon usage file with aminoacids", "Cutg with all numbers" "spsum", CUTG species summary file", "Number only, species and CDSs in header" "cherry", "Mike Cherry codonusage database file", "GCG format with species and CDSs in header" "transterm", "TransTerm database file", "GCG format with no extras" "codehop", "FHCRC codehop program codon usage file", "Freq only, extras at end" "staden", "Staden package codon usage file with percentages", "Freq or number only, no extras" "numstaden", "Staden package codon usage file with numbers", "Number only, no extras. Can be read as 'staden'" Any of these formats should be readable by default. Some files are "readable" in more than one format (staden and numstaden for example can both be read as "staden"). The extra names are used so we can reuse them as output format names. For output of codon usage tables, the same formats are available (-oformat qualifier). A new application codcopy (not codret because coderet is already an EMBOSS program name) will convert from one format to another in the same way as seqret converts sequence formats. Coderet reports the number of CDS, mRNA and translation sequences. Correction to sequence numbering for reversed nucleotide sequences in alignments. Correction to sequence alignment functions returning slightly suboptimal alignments. The entrails program reports codon usage formats. Description of report format entrails output improved. Entrails is built by "make check" and is provided so that developers of wrappers can obtain all EMBOSS internal details needed, for example all ACD datatypes and input/output format names and descriptions. Sequence types are explicitly set in cons, sixpack and backtranseq as some output formats failed to recognise them as protein. EMBASSY packages: MYEMBOSS is a new EMBASSY package for developing your own code. Installation requires recent versions of GNU packages autoconf, automake and libtool. To install, you must first build the configure and make files with these commands: aclocal -I m4 autoconf automake -a When you add your own programs, do so by adding source files in myemboss/source and ACD files in myemboss/emboss_acd and add these filenames to the Makefile.am files in each directory. There are "myseq" and "mytest" examples provided to guide you. There is no need to modify configure or Makefile files - these will be automatically updated. To allow MYEMBOSS to be installed by one user, and linked to an EMBOSS installation maintained for the site by someone else, new variables are added to locate the ACD files for any EMBASSY package. If myemboss is not installed in the same place as EMBOSS, define EMBOSS_MYEMBOSSROOT as the location of the myemboss installed ACD files or the myemboss/emboss_acd source directory. This requires that EMBASSY programs call the embInitP function with the name of the package ("myemboss"). For ACD utilities such as acdvalid or acdc to work, as these use the EMBOSS embInit call, another variable EMBOSS_ACDUTILROOT must be defined, pointing to the same directory. PHYLIP is a beta release port of PHYLIP 3.6b. We welcome comments on the EMBOSS interface to the programs. Program names are prefixed by 'f' to avoid clashes with the old PHYLIP EMBASSY package. We still need to work on adding new tree input and output formats, and updating the code to PHYLIP 3.63 (December 2004). We are also considering splitting more of the programs to simplify the ACD interface. In this release seqboot and treedist are already split. seqboot is split by input type into seqboot, restboot, discboot and freqboot. Treedist is split by the number of input files into treedist and treedistpair. Acdvalid objects to the dependencies in other programs, for example the method used by fdnadist. The DOMAINATRIX package of earlier releases has been extended and replaced by 5 EMBASSY packages described below (32 applications in total). These tools were developed as part of a research project and are distinct from other EMBOSS apps in being intended mostly for computational biologists rather than biologist end-users. STRUCTURE The STRUCTURE package is used for parsing the PDB database and generating secondary databases of coordinate and derived data. The tools have the following scope: (i) For parsing PDB files and writing clean coordinate files (CCF files) that "clean-up" many PDB inconsistencies. For example, residue numbers give the correct index into the biological sequence. (ii) To generate CCF files for whole PDB files or individual domains from the SCOP and CATH databases. (iii) To augment CCF files with residue solvent accessibility and secondary structure data. (iv) To generate contact files (CON files) of intra-chain and inter-chain residue-residue contact data. (v) To generate CON files of residue-ligand contact data. (vi) Miscellaneous file handling, e.g. dictionary of heterogen groups. DOMAINATRIX The DOMAINATRIX package is used for handling the SCOP and CATH databases of protein domain classification, the parsable files of which can be inconvenient, e.g. for comparative studies, extending and processing. The tools have the following scope: (i) For parsing raw SCOP and CATH parsable files and writing domain classification files (DCF files) with a single, simple and extensible format. (ii) To add sequence records to a DCF file. (iii) To remove low resolution domains. (iv) To flexibly calculate and remove redundancy. (v) Primitive tools for secondary structure element mapping to domains in a DCF file. DOMALIGN The DOMALIGN package is used for generating alignments for families of domains, especially across large datasets, e.g. the whole of SCOP. The tools have the following scope: (i) For identifying representative structures for different nodes in the SCOP and CATH hierarchies. (ii) For generating annotated, structure-based sequence alignments for these nodes. (iii) For extending these domain alignment files (DAF files) with sequences of unknown structure. (iv) All-versus-all global sequence alignment. DOMSEARCH The DOMSEARCH package is used for deriving extended sequence families, especially from large structural datasets such as the whole of SCOP. The tools have the following scope: (i) To generate domain hits files (DHF files) of sequence relatives to an alignment or other sequences. (ii) To remove fragmentary sequences from a DHF file. (iii) To flexibly calculate and remove redundancy. (iv) To remove hits hits of ambiguous classification and collate sequences into families. SIGNATURE The SIGNATURE package is used for generating, scanning and evaluating sparse signatures and other predictive elements for protein sequence characterisation. The tools have the following scope: (i) To generate sparse signatures for protein families from alignments and residue contact data. (ii) Generate other types of discriminator (e.g. HMMs) from alignments. (iii) Generate ligand-binding signatures from residue-ligand contacts. (iv) Generate domain hits files (DHF files) and ligand hits files (LHF files) of hits (sequences) from signature scans. (v) Interpretation and display of signature performance by using ROC analysis. Where data, files etc are mentioned above or in the application documentation, data structures and functions for manipulating such are usually provided in the AJAX and NUCLEUS C programming libraries. For example, there are objects for handling protein atoms, residues, chains, for SCOP and CATH domains and so on. From pmr at ebi.ac.uk Fri Jul 22 15:00:01 2005 From: pmr at ebi.ac.uk (Peter Rice) Date: Fri, 22 Jul 2005 16:00:01 +0100 Subject: [emboss-announce] EMBOSS in August Message-ID: <42E109F1.9070604@ebi.ac.uk> We know it is close to the end of July, and we have not said what is happening to the EMBOSS team. We do have a solution, but it is not yet officially confirmed. The Rosalind Franklin Centre for Genomic Research will close at the end of next week. The EMBOSS project will move to the European Bioinformatics Institute from August 1st. Development and support will continue as before. The EMBOSS homepage will remain at http://emboss.sourceforge.net/ The FTP server (to download EMBOSS releases and updates) has moved to ftp://emboss.open-bio.org/pub/EMBOSS/ The EMBOSS anonymous CVS server will remain at cvs.open-bio.org hosted by the Open Bio Foundation, who will also continue to host the developers' CVS server. The EMBOSS mailing lists have been moved to the Open Bio Foundation, so the addresses are now: To contact the EMBOSS team: emboss-bug at emboss.open-bio.org Bug reports and support requests emboss-submit at emboss.open-bio.org Code submissions Lists users/developers can subscribe to: emboss at emboss.open-bio.org Users mailing list emboss-dev at emboss.open-bio.org Developers mailing list emboss-announce at emboss.open-bio.org New release announcements list There are obvious gaps in these details ... more news as soon as we have confirmation. regards, Peter Rice, Alan Bleasby and the EMBOSS team.