questions about ACD format

Andrew Dalke dalke at dalkescientific.com
Mon Aug 6 00:52:59 UTC 2001


[Brief summary: I'm trying to integrate Emboss with Biopython and
found that 1) not enough sequence type information is available
in the ACD file for Biopython's AlphabetStrict code to work, so
I have a proposal to fix the, 2) I have questions about how to
interpret some of the documentation, 3) there are places where
the Emboss ACD parser doesn't appear to work correctly, and 4)
general observations on the ACD format and on the implementation.]

Hello,

First off, my apologies if this is the wrong email address for
this topic.  I couldn't find any archives to scan for verification.
I am also not a member of this list, so please cc me on any
replies.

Based on the feedback I got from some people at ISMB, I've started a
Python interface to EMBOSS.  The goal is to be able to do something
like:

>>> from Bio import Seq
>>> from Bio.Alphabet import IUPAC
>>> from Bio.Emboss import apps
>>>
>>> seq = Seq.Seq("AATCCATCGATGCAC", IUPAC.unambiguous_dna)
>>> results = apps.revseq(sequence = seq)
>>> results["outseq"]
Emboss.EmbossSeq("GTGCATCGATGGATT", IUPAC.ambiguous_dna)
>>>

I can almost, but not quite do this, for some reasons I'll describe
shortly.  Here are the questions and problems I had in doing this,
as well as some specific feature I would like to see added, which
I feel may make it easier to integrate EMBOSS with other systems.

======

** Topic 1

As you can see in the above example, there is some automatic
conversion going on.  One is to convert the Biopython 'Seq' object
to a temporary file, so it can be used with the '-sequence'
parameter needed by revseq.  This is done by knowing how to convert
the Seq object to a 'seqall' Emboss type, including looking at the
'type' field to ensure that the input sequence is really DNA.

The conversion step requires that I do a verification of the Biopython
Seq Alphabet to the Emboss sequence 'type'.  There is a description
of the types in the syntax document, at
  http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Acd/syntax.html
but it doesn't describe:
  1) what is used as a gap character? (I assume '-')
  2) what is used for a stop character? (I assume '*')
  3) are selenocysteines encoded with a U?  (the pureprotein definition
      says it excludes "BZ or X", so I'm guessing selenocysteines aren't
      allowed - or are they encoded as X?)
  4) shouldn't there be a gapstopprotein?

** Topic 2

Another conversion is to create a temporary filename for the -outseq
parameter, based on the 'seqoutall' Emboss type.  I would like to read
the contents of this file into a Biopython Seq object, however, the
ACD description does not contain enough information for me to do that.
Instead, I can only create the tempfile and store the filename in
the "outseq" parameter.

Could a new 'type' parameter be added to 'seqoutall'?  This would
change revseq's "outseq" definition to be

seqoutall: outseq  [
  parameter: "Y"
  type: "dna"
  extension: "rev"
]

For applications like 'notseq' this would require using an operation:

seqoutall: outseq [
  param: Y
  type: "@($(sequence.protein)? protein : nucleic)"
]

The goal of this is to let researchers use EMBOSS from Python
without having to worry about an implementation detail - the
existance of a file system.

(BTW, I have heard there may be support in 3.0 for XML output
and the ability for all the output to be streamed to stdout.
I didn't find any details about this on my scan of the web
pages - what is the status and plans for this?)

** Topic 3:

The Emboss sequence data type contains the calculated attributes of
'protein', 'nucleic' and 'type'.  Is it that:
  - protein is true when the sequence type is
      'protein', 'gapprotein', 'pureprotein' or 'stopprotein'
  - nucleic is true when the sequence type is
      'dna', 'rna', 'puredna', 'purerna', 'nucleotide', 'purenucleotide'
      'gapnucleotide', 'gapdna', 'gaprna',
  - protein and nucleic are false for any other case
  ?

** Topic 4:

How do I force a sequence type?  The -sprotein and -snucleotide
command-line qualifiers are only boolean values, so there doesn't seem
to be any way to say an input is really a pureprotein.  Eg, there
could be a '-stype' qualifier, so I can do '-stype pureprotein'.

** Topic 5:

Given the existence of sequence.type, shouldn't most operations of the
form
    "@($(sequence.protein)? protein : nucleic)"
really be
    "$(sequence.type)"
?

This should allow better propogation of proper type information
through Emboss.

==

Okay, that's the sequence type related topic.  Now for some others,
first on parsing ACD files.  To get the parameter information I read
the ACD files.  There are actually two possible files to read: the
".acd" file and the file produced from the "-acdpretty" option.


** Topic 6:

Which is the prefered mechanism for getting ACD configuration
information?  There are advantages and disadvantages to either one.

  - The .acd file does not require executing a possibly arbitrary
  program to get its parameter information.  This can be a subtle
  security problem because the mechanism I'm using just does a
  system() call to see if the program exists, and has no qualms in
  running "rm-rf / && echo", which expands to the valid command
  "rm -rf / && echo -acdpretty".  By checking the acd file first,
  it eliminates that possibility, although it does require that
  the directory containing the .acd definitions be well-known.
  Is this well-known directory $EMBOSS_ACDROOT or is that a 1.x
  location?

  (The other possibility is to require that all Emboss executables and
  only Emboss executables be in a well-defined directory.  Looking at
  the standard 'configure', the is not usually the case - they get put
  into /usr/local/bin )

  - a problem with using the .acd file is that it may be out of synch
  with the actual exectuable

  - the -acdpretty option is problematical in that it writes its
  information to a file in the local directory.  My Python code cannot
  guarantee that the local directory is writeable, so I need to mkdir
  a temp directory then "cd $(tmpdir) && $(program) -acdpretty" then
  read "$(tmpdir)/$(program).acdpretty" then remove the directory.  It
  would be so much easier if -acdpretty option could write to stdout.
  (Eg, as when used as '-acdpretty -stdout')

  - the .acd file may use abbreviated names.  For example, it may have
  a qualifier as "param" instead of "parameter".  So the -acdpretty
  text is easier to parse.

I would prefer getting the ACD data directly from the executable.  Is
is possible to allows -stdout as an option to -acdpretty to make it
dump to stdout?  The other issues I can work around.

** Topic 7:

The ACD syntax definition is incomplete.  Here are some problems I ran
across.

> Comments start with "#" and continue to the end of the line.

Must the '#' be in the first character position?

The function ajacd.c:acdNoComment looks like it truncates the line at
the first '#', no matter where it is in the string, so the '#' doesn't
need to be the first character.

On the other hand, it looks like that bit of code doesn't understand
quoted strings.  Consider

% cat foo.acd
appl: foo [
        doc: "Who is #1?"
        groups: "Edit"
]

% ../acdc foo
Who is groups: "Edit
%

> Each line is parsed into tokens delimited by spaces

What is the definition of a token?  We also have that

> Parameters and qualifiers are defined by a single token followed by
> either a colon ':' (preferred) [1] or an equal sign '=' which in
> turn is followed by a second token.

This means a token cannot end in a ':' or a '='.  But it can contain a
':' outside of quotes, as in

   opt: @($(showall)?N:Y)

Or consider

% cat foo.acd
appl: foo [
        doc: A:
]

% ../acdc foo
A:
%

This means the ':' is not part of the first token in a
parameter/qualifier but is part of the second token.

Spaces aren't really the token delimiter.  The file 'wordcount.acd'
contains

  sequence: sequence [ param: Y type: dna]

so the token 'dna' is not space delimited before the ']'.  Also,
checktrans.acd uses 'min:1' which is not space delimited.


I'm trying to figure out how ajacd.c does it, but I'm getting lost in
the code.

To make thing even more confusing

% cat foo.acd
appl: f"oo [
        doc:A]B
]


% ../acdc foo
A]B
%

Also, the term 'space' in the documentation should be 'whitespace'
since it can skip '\t' characters.  Hmm, and looking at the code,
there's problems with how it skips the ':' characters.

% cat foo.acd
appl:::: foo [
        doc: "This is the doc."
]

% ../acdc foo
This is the doc.
%

And using a NUL character
% od -c foo.acd
0000000   a   p   p   l   :       f   o   o       [  \n
0000020                   d   o   c  \0   :       "   H   a   s       a
0000040       N   U   L       c   h   a   r   a   c   t   e   r   "  \n
0000060                                   S   t   r   a   n   g   e  \n
0000100   ]  \n
0000102
% ../acdc foo
Strange
%

So the parser code does not fully validate that the input data is in
the correct format.

> After the name, definitions are in mandatory square brackets, [],
> which can make a definition span multiple lines.

seqretallfeat.acd contains the following two lines

  endsection: secoutseq
  endsection: secinseq

which don't have the [].  My parse ends up special casing the
'endsection' declarations.  Would it be possible to use, say,

  endsection: secoutseq []

instead?  (Also, section and endsection are not defined anywhere in
that syntax document.)

> Tokens representing data types can be abbreviated up to the point
> where they are not ambiguous

That's a VMS-help-style shortcut.  As I recall, that has a
forward-compatibility problem.  For example, if a new data type called
'apple' is added, then 'a', 'ap', 'app', and 'appl' are no longer
unambiguous.

Has there been any consideration on how to deal with that?

> Values can be delimited (i.e. treated as one token) by any of the
> following pairs, which are stripped as the value is parsed :
>
>      '' {} () [] <>

It's not clear what a "value" means?  In this section there is

token: token [
  definition
]

But later on this the word 'attributes' is used instead of
'definition':

data_type: parameter_name

    attributes
]

and only then does it say what a value is:

> A defining attribute must have a second token representing the value
> of the attribute.

So perhaps there should be some cleanup of the definition.  (The
reason I needed to figure this out was to check that

appl: foo [
  "multiword attribute": N,
]

was indeed supposed to be illegal.)

There doesn't appear to be any way to escape a quote character inside
of a quoted token.  At least, not that I could see in the code.  So there's
no way to write something like

appl: foo [
        doc: "Remove the characters ""{}<>()'"
]

for the string
  Remove the characters "{}<>()'

Also, the doc says the valid characters are
    '' {} () [] <>
but that should include "double quotes"

And just why are there so many quote characters?

** Topic 8:

It took me a while to figure out that ajacd.c did the ACD parsing.
The file ajnam.c parses the .embosssrc and emboss.defaults which is
described in
http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Usa/databases.html and is
*almost* in ACD format.  The difference is that it doesn't have an
'application' term and the 'DB' needs to be 'DB:'.

I can tweak my parser to handle the 'DB' term, but why can't those two
files really be in ACD format?  ... Although implementation-wise the
ajacd.c file uses static variables so it can only be used to parse one
file.

I noticed a couple problems with how ajname.c works.

   - It only understands a comment as a '#' in the first character
   position (while ajacd.c recognizes it anywhere).

   - The code uses "fgets(line, 512, file)" which looks like it can
   fail if the line is more than 511 characters, as with long file
   names.

(Actually, since this is a completely different implementation of the
parser, the failure conditions are different.  For example, there is a
namNoColon in ajname.c, but nothing to strip a '='.)

** Topic 9

There needs to be some clarification on the license.  When I looked at
the code I read the top-level "COPYING" file, which is the GPL.  I
have a policy not to look at GPL'ed code too closely, since I worry
that it may contaminate my ability to write equivalent non-GPL code,
like the BSDed Biopython code.  LGPL is not quite as bad, but even
then I write the non-FSF-licensed code first then if needed for
verification I look at the LGPL'ed code.

Since the top-level COPYING file is the GPL, that put me off looking
at any of the source code, even for verifying format requirements.  It
had to be pointed out to me that the ajax and nucleus codes are
covered under the LGPL.  I would not have discovered that on my own,
because it the multiple license use wasn't mentiond in the README.

In addition, I noticed the ./LICENSE is slightly different than the
current Version 2, June 1991 one from the FSF.  The FSF address is
wrong, and there are formatting changes.  I cannot tell if there are
any text changes.

I also noticed ./COPYING file is the GPL, except for a change in the
address and the exclusion of the section "How to Apply These Terms to
Your New Programs"

Shouldn't these be identical files, and match the current FSF GPL?


** Topic 10

What does the 'warnrange' attribute of an integer do?  (I've only
lightly scanned the table of data types so will likely have more
questions about the other fields in the future.)


** Topic 11

In scanning the code I noticed there is an indirection layer, which I
assume is to isolate the programmer from changes in the OS and C
library.  It isn't used everywhere.  For example, there's an
ajNamGetenv but several places call getenv directl.

I also did a scan looking for possible overflows and other security
problems.  Because of my inexperience with the indirection layer I
couldn't do an in-depth check, but I did notice that ajStrFromFloat
and ajStrFromDouble can fail on Inf, -Inf and NaN, for a couple of
reasons:

% cat inf.c
#include <stdio.h>
#include <math.h>
main() {
  /* float val = -1.0/0.0; */
  float val = strtod("-inf", NULL);
  char s[100];
  int precision = 0, ival, i;

  sprintf(s, "val == >>%.0f<<", val);
  puts(s);

  ival = abs((int) val);
  printf("ival = %d\n", ival);

  if (ival)
    i = precision + (int) log10((double)ival) + 4;
  else
    i = precision + 4;

  printf("i == %d\n", i);
}
% cc inf.c -lm
% ./a.out
val == >>-inf<<
ival = -2147483648
i == -2147483644
%

** Topic 12

Here's my first pass of the BNF for the ACD file.  There are various
things to fix, some of which are noted.  This can be used for every
file in the emoss/acd directory except qatest.acd (which contains a
syntax error that acdc doesn't catch -- the "int bint" field) and
testplot.acd (contains an '=' instead of a ':', which I don't yet
handle).

Lexer:
  colon = ":"
  open_block = "\["
  close_block = "\]"
  endsection = "endsection"
  key = "(?!endsection)[a-zA-Z0-9_]+(?=[\s:\][])"
  value = "[a-zA-Z0-9_]+(?![\s:\][])[^\s\]]* |
           [^\000-\037a-zA-Z0-9_:[\]\s][^\s\]]*"
  quoted = '"[^"]*"'   (only handles double quotes - need to fix)
  comment = "[#][^\n\r]*(\r|\r?\n)" SKIPPED
  whitespace = "\s+" SKIPPED

Parser:  (need to update the names to match the syntax doc)
  application ::= widget_list

  widget_list ::= widget |
                  widget widget_list

  widget ::= key colon key open_block arglist close_block
           | key colon key key
           | key colon key value
           | endsection colon key

  arglist ::= arg
           |  arg arglist

  arg ::= key colon key
        | key colon value

** Topic 13

One last thing.  The parameter information for the different ACD
data types is hard coded in ajacd.c.  If it was stored in an external
data file (in ACD format with well-defined fields :) then my Python
code could read that meta-information to build up its tables, rather
than me having to code it all by hand.


Hope this wasn't too much at once :)

    Andrew
    dalke at dalkescientific.com





More information about the EMBOSS mailing list