[emboss-announce] EMBOSS 6.0.0 released
ajb at ebi.ac.uk
ajb at ebi.ac.uk
Tue Jul 15 17:52:56 UTC 2008
EMBOSS 6.0.0 is now available from:
ftp://emboss.open-bio.org/pub/EMBOSS/EMBOSS-6.0.0.tar.gz
The associated EMBASSY packages are in the same directory. Note that,
as usual, these are specific to the main package so versions downloaded
for a previous release will not work with 6.0.0.
Changes in 6.0.0 include new applications, improvement of existing
applications, library API consistency changes, bugfixes etc. Most are
described in the relevant section of the ChangeLog which is reproduced
below.
mEMBOSS-6.0.0 is available from:
ftp://emboss.open-bio.org/pub/EMBOSS/windows/mEMBOSS-6.0.0-setup.exe
mEMBOSS contains all the EMBOSS changes plus improvements and bugfixes
for the GUI (Jemboss). Also, this release of mEMBOSS contains the C runtime
library files; these had to be installed separately in previous
versions.
Alan
Version 6.0.0
New application aligncopy reads a set of aligned sequences and
prints a report in one of the standard alignment formats that can
accept the same number of sequences. Pairwise alignment formats
can only be used if the input has exactly two sequences.
New application aligncopypair reads a set of aigned sequences and
prints a report or each pair of aligned sequences in one of the
standard alignment formats.
New application featreport reads a sequence and a feature table,
and writes a report in and of the standard report formats.
New application featcopy reads and writes a feature table to
convert feature formats.
New applications maskambignuc and maskambigprot replace ambiguity
characters in nucleotide sequences with 'N' and in protein
sequences with 'X'.
New application consambig reports an alignment consensus sequence
using ambiguity characters. The intended use cases are sequencing
reads and SNP reporting.
New application sizeseq sorts sequences in ascending or descending
order of length. This is a port of the application seqsort from
the domsearch EMBASSY package.
New application skipredundant uses pairwise sequence matches to
exclude sequences that are similar from an input set. This is a
modified version of the application seqnr from the domsearch
EMBASSY package.
New applications provide utility functions for former GCG users:
nohtml removes HTML tags, notab replaces tabs with spaces,
nospace removes all whitespace from a file, skipspace removes
extra whitespace from a file.
Older EMBOSS applications can now generate a warning message
stating that they are marked as 'obsolete' with an explanation and
an indication of alternative programs in EMBOSS or in an EMBASSY
package. This warning can be turned off by defining environment
variable EMBOSS_WARNOBSOLETE with a value of "N" or by defining
the same variable in the emboss.defaults or ~/.embossrc files. We
will begin to mark applications as 'obsolete' in future releases.
A new EMBASSY package "myembossdemo" contains the demonstration
applications demoalign, demofeatures, demolist, demoreport,
demosequence, demostring, demostringnew and demotable that
illustrate how to use EMBOSS data types in your own
applications. The myembossdemo package allows novice developers to
try simple EMBOSS programming. The myemboss package is available
for adding your own applications. The demo applications are no
longer distributed with the main EMBOSS package. They were not
installed and were only built with the "make check" option.
Application short descriptions have been revised. The minimum
length of application one line descriptions is increased from 60
to 70 characters. The descriptions are easier to write. Output
from wossname can now be 90 characters wide. Interfaces that use
the description in menus may need to allow some extra space.
Function names in ajfile.c have been standardised. Old names are
still accepted but are marked as "deprecated" and will generate
warnings with the gcc compiler (see ajstr below). Other compilers
will see no difference. New source files ajfiledata.c and
ajfileio.c have been added. The buffered file data structures are
renamed internally to be more consistent (AjPFileBuff to AjPFilebuff).
notseq was unable to search for IDs containing '|' characters
but uses string matching (not regular expressions) and these
characters are valid in NCBI-style FASTA files if read with the
"pearson" format which accepts the whole ID string without parsing.
The sequence alignment code has been updated. Sequence alignments
with low gap penalties failed to allow two gaps (one in each
sequence) without a match in between. The embAlign functions are
now simplified. Scores are returned by the PathCalc functions. The
Walk functions that walk through the path and return the aligned
sequences are faster and need fewer parameters. Profile alignments
occasionally duplicated residues in the sequence around gap
positions. Fast alignments around a limited width include
additional residues at each end and require an offset rather than
separate start positions. The offset if the difference between the
two start positions used in 5.0.0 and earlier releases.
Eprimer3 citations are corrected in the help text (from the ACD
file) and in the documentation. The citation errors were traced to
the original primer3_core documentation which has now been
corrected.
Wordmatch could confuse overlapping matches. It occasionally
extended the wrong match and missed a corresponding new match.
Seqmatchall results were correct with the default output
format which reports match positions, but gave incorrect results
with some other local alignment formats that include the sequence.
Seqmatchall now stores alignments in the same way as other local
alignment applications, and the alignment internals are corrected
to ensure other applictaiopns will not have the same problem.
Emma was officially supporting clustalw 1.83. Issues with clustalw
2.0 are now resolved and this version is supported if clustalw2 is
installed. Emma executes an applications called clustalw (not
clustalw2) so version 2.0 must be installed under this name or an
environment variable EMBOSS_CLUSTALW needs to be defined to point
to the executable clustalw2 file.
Sequence format "selex" allows invalid sequence data files to be
accepted as input. Selex format is still available but is no
longer included in the formats that can be automatically
detected. When reading selex format data, users need to put
"-sformat selex" on the command line, or specify "selex::" at the
from of the USA. See the HMMER (old version EMBASSY package)
documentation for examples. HMMERNEW (recommended) examples use
Stockholm format and so are unchanged.
Program dbxfasta now defaults to a filename of "*.fasta"
The previous default "*.dat" is not commonly used for FASTA format
databases.
Program msbar block mutations were 1 longer than the specified
block and may crash if the block size was fixed (minimum and
maximum block sizes the same). This off-by-one error is now
corrected.
In GenBank output format, multiple line KEYWORD sections were not
formatted correctly.
ACD list and select values (the menus that appear in the user
prompt) can now have ACD variables. Although useful for local
application development these are not used in EMBOSS distributed
ACD files because the variables are difficult for web and GUI
interfaces to resolve when presenting the menu text.
List and Table internal data structures are now cached so that
creating and deleting temporary lists and tables is more efficient.
In emboss.default database definitions the filename and exclude
values can be delimited by spaces, commas or semicolons. Previous
releases used only spaces. Parsing is now consistent with the
fields definition which allowed all the above characters.
Protein sequences with pyrrolysine ('O') had 'O' converted to a
gap because this was a gap character in early versions of
Phylip. This was patched in 5.0.0 to allow 'O' in UniProt release
13. The gap character is upper case only, so 'o' was correctly
read as pyrrolysine.
Wordfinder used the same descriptions for two pairs of qualifiers.
The descriptions are changed to make their meaning clear in
commandline help and in web interfaces.
New function ajTimeDiff returns the difference in seconds between
two time values.
Profiling tests showed that file reading and string handling can
be made faster. String handling called functions many levels
deep. Making this code inline and using macro versions improved
performance for applications (e.g. database indexing) that use
many string calls. File input requires each input line to be
copied. Using copy-by-reference (ajStrAssignRef) often makes this
more efficient. Existing macros now test for undefined strings:
MAJSTRGETLEN, MAJSTRGETPTR, MAJSTRGETRES and MAJSTRGETUSE. New
macros are added for string handling: MAJSTRDEL,
MAJSTRGETUNIQUESTR, MAJSTRCMPC and MAJSTRCMPS.
Memory management includes new macros AJCRESIZE0 and AJRESIZE0
provide resize functions that guarantee new memory is set to
zero. The functions must be given the original allocated size.
Using the GNU C run-time library, calls to mcheck and mprobe are
available to test for memory corruption by examining the bytes
before and after an address allocated by malloc. This can be
turned on for any application, including Unix commands, with the
environment variable MALLOC_CHECK_ which has values 0, 1, 2 or
3. 1 writes to standard error when a problem is found, 2 aborts
the programs, 3 does both and 0 ignores errors. No recompilation
is needed for this simple method. EMBOSS now has a ./configure
option --enable-mprobe which enables two new
functions. ajMemProbe, passed an address from malloc (AJNEW0,
AJCNEW0, etc.) tests the bytes before and after and reports any
errors. The advantage of using ajMemProbe rather than mprobe is
that a macro MAJMEMPROBE also reports the file and line number
where ist was called. To avoid large numbers of messages (when
code has problems) a limit can be set with ajMemCheckSetLimit
after which the program will exit. Note that enable-mprobe is
incompatible with using valgrind to test for memory leaks - as
mprobe and mcheck have to look at illegal bytes before and after
allocated memory blocks. Memory checking is turned on by a call to
mcheck, passing the function ajMemCheck, in ajnam.c before the
first memory allocation. If any program calls malloc before
calling embInit or embInitP this call will fail and issue a
warning (if compiled with --enable-mprobe). A special call
ajStrProbe tests any string with mprobe. Special calls ajListProbe
and ajListProbeData test lists and their contents. For more
details see http://www.gnu.org/software/libc/manual/
Protein sequences from the Staden package were read as nucleotide
because they were missing information on the ID line to identify
EMBL of SWISSPROT format. The sequences are now tested and
correctly typed.
Wordcount now accepts protein sequences as input. Previous
releases only allowed nucleotide sequences.
Wordfinder options had the same information prompt. These have
been changed from "limit" to "minimum" and "maximum" to make their
function clear.
Prompting for values from the user now includes a test for
standard input in use as an input file. If standard input is open,
the default response is accepted and a message is written to the
user. This is to avoid problems with command lines that use
"stdin" as an input and do not include -auto.
The acdpretty utility can now preserve comments in ACD files.
Comments are maintained in blocks with blank lines before and
after. Inline comments are started in column 50 unless they are
exceptionally long. Comments themselves have white space cleaned
up but otherwise are not reformatted.
A new function ajAcdGetValueDefault is added to return the default
value of an ACD qualifier. This can be combined with
ajAcdIsUserdefined in wrappers to test for values changed by the
user.
Infile qualifiers in ACD have a new attribute "trydefault" which
allows the default filename to fail. Any filename provided by the
user has to exist. This was added to support the behaviour of the
MIRA EMBASSY package. To allow an infile to fail the attribute
"nullok" also must be set to "Y"
Applications which produce an output file or graphics often
created an empty output file when the plot was selected.
The ACD files have been corrected to only create the file if it
will be written to. Applications changed are charge, dan,
freak, hmoment, iep and tcode.
Whichdb only writes to its output file if -get is false.
With -get it creates sequences. The outfile is no longer created
when whichdb is in -get mode.
String functions corrected so that Case in the name always means
case-insensitive and works by converting to upper case. Some
functions were defined the wrong way, with "Case" for the
case-insensitive form.
GFF3 format is now the default feature output.
A new function ajFeatIsCds identifies protein coding nucleotide
features (CDS) using the SO identifier. A new function
ajFeattagIsNote identifies feature tags that are for the default
feature tag.
Protein features now use the new Sequence Ontology terms defined
by BioSapiens. These are not yet accepted by GFF3 validators. The
new SO identifiers are added to protein feature definitions and
used internally.
Feature format definitions (the Efeatures and Etags files)
now allow #include references to other files. This allows a
standard EMBL and Swissprot feature table definition to be
included by the internal and GFF definitions. Redefinitions are
allowed using + and - prefxes to add and remove tags for existing
feature types.
GFF3 format feature (and report) output is added.
A new application "density" has been added. This reports the
A+C+G+T and AT+GC densities of nucleic acid sequences within
an adjustable sliding window. Plots of A+C+G+T or AT+GC are
optionally produced.
Molecular weight programs (e.g. digest, mowse) now have a
-mono switch to allow use of monoisotopic weights.
By default, average molecular weights are used.
The Eamino.dat format has changed. Molecular weight information
has been removed and put in its own Emolwt.dat file. This latter
now allows specification of average and monoisotopic weights. Values
for hydrogen and oxygen are specified as well as the amino acid weights.
The library representation of amino acid property information
has been changed. The EmbPropTable global table has been
removed and replaced with EmbPPropAmino and EmbPPropMolwt objects.
Pepcoil now produces a report (replacing a text output) in "motif"
format. The default is changed to not report non coiled-coil
regions as they are hard to distinguish in this format.
The "motif" report format is extended to allow two score positions
marked with "*" and "+" and labelled internally as "pos" and
"pos2". No application uses pos2 (it was added for pepcoil, but
both score maximum positions are always the same)
A new function ajAcdIsUserdefined allows wrappers to test which
qualifiers have values changed by the user so that they can use
shorter command lines to launch the wrapped application.
jaspscan application added. Scans sequences for transcription
factors using the JASPAR matrices.
jaspextract application added to move the JASPAR matrices into the
EMBOSS data area subdirectories.
Alignment format "trace" used to display internal data content, is
renamed to "debug" to be consisten with other formats. A "debug"
format is added for feature output.
Application documentation has been updated to remove obsolete
references to EMBL database identifiers. These are replaced with
the correct accession numbers.
Two new entries have been added to the "tembl" test EMBL database
for use in the QA tests.
Report output now checks the sequence and feature table type. Is
the sequence is not a valid protein, protein-only formats (pir,
swiss) will fail with an error message. Similarly, if the sequence
is not a valid nucleotide sequence then nucleotide-only formats
(embl, genbank) will fail with an error message.
Garnier now uses the correct SwissProt and internal feature keys
for protein secondary structure. The results will appear much
better for example as a swissprot feature table. This required
rewriting of the internals by recoding the secondary structure
features with a "garnier" tag replacing the previous "helix",
"sheet", "turns" and "coil" tags. The default output is
unchanged. The results in other report formats will be changed.
Silent no longer reports the "Dir" column. This is replaced by the
new "Strand" column which reports "+" for a forward feature and
"-" for a reverse feature.
The following programs have changed default report output, with
the strand included for nucleotide sequences: equicktandem,
etandem, fuzznuc, fuzztran, recoder, restrict, silent, tcode,
twofeat. The strand column can be removed with the new commandline
associated qualifier -norstrandshow.
Reports for nucleotide sequences have confusing ways to represent
the start and end positions for features on the complementary
strand. A strand column has been added to these reports,
controlled by a new -rstrandshow qualifier and attribute. By
default the strand is shown for all nucleotide reports (see a list
of changed program outputs above). The start position is always
lower than the end position for features on the complementary
strand indicating the region that should be reversed. In past
releases the seqtable report format (fuzznuc, dreg, dan)
confusingly reversed start and end positions to indicate the
unreported strand. For all report formats (nametable, table) the
start and end positions are now consistent with nucleotide feature
formats (gff, embl, genbank).
Reports from dreg incorrectly reported sequences reversed with the
-sreverse qualifier.
Report headers now include the text "(Reversed)" when the input
sequence(s) are reverse complemented.
Phylogenetic trees in newick format are now parsed into internal
trees and converted back for use by Phylip. This allows us to
read other tree formats and pass them to Phylip (e.g. Nexus)
Some ACD data types did not allow the input to be NULL because
extra tests were carried out on the results. These are all cleaned
up and tested so that they can safely be set to nullok and missing
in local applications.
New sequence reading formats for PDB files. By default the ATOM
records are used (format "pdb"). An alternative format "pdbseq"
will read the SEQRES records which give the original sequence. The
ATOM records give the sequence determined from the structure.
Improved the help text for the -stdout and -filter options to
explain output files are written to standard output. Some users
expected graphics output (from plplot) to be controlled.
More information about the emboss-announce
mailing list