ACD file and emboss.default file syntax

James Bonfield jkb at mrc-lmb.cam.ac.uk
Thu Feb 20 12:04:39 UTC 2003


On Thu, Feb 20, 2003 at 11:34:13AM +0000, Peter Rice wrote:
> I am cleaning up the parsing of both ACD files and the emboss.default 
> files. This includes adding diagnostic messages to say what problems 
> were found and to report the line number (and filename).

Diagnostic messages are a definite help.

> At the same time, some of the syntax can be tightened. For example, ACD 
> files allowed some strange characters that were never used (parentheses 
> instead of quotes, "=" instead of ":"). These will be removed.

For what it's worth the ACD parser in Spin (staden package) copes with these
things already, and also the variations in spaces.

However I freely admit that it may have been easier to develop with the
changes you propose and so it sounds like a sensible way of promoting more
interfaces. I'm not 100% convinced on that though; by far the easiest way of
helping people is to provide a full and complete BNF grammer. The
documentation was not sufficiently clear as I recall. I ended up writing my
own version of lex in Tcl and a hand coded parser. Eg I use regexp matching
for identifers and it's no harder to match regexp "[^ \t\n:=]+[ \t\n]*[:=]"
than "[^:]+:", although the latter is obviously more readable.

> There are also differences in the definitions of comments. In ACD files 
> any text after a "#" is ignored. In emboss.default comments must start 
> at the beginning of the line. This seems preferable as occasionally a 
> "#" character could be useful in a definition.

This is one change which could cause problems. Existing parsers should, in
theory, already be handling the complexities of : vs =, different quoting
syntaxes, and varying whitespace. So the changes to these will help new code
and not have any effect on existing parsers.

Changing comments though will make existing parsers parse incorrectly on files 
where # is used in a definition. However I guess the change needs to be made
due to the points you make (it possibly being a useful character).

> 6. Should the ACD attribute names (required, information, ...) be 
> abbreviated (see question 4)?

My approach was to specify the grammer at a higher level of ID, STRING, etc
and then use code for matching ID against a known database of
words. Literally:

                foreach word {application information default required \
                              optional expected documentation outfile \
                              parameter needed delimiter codedelimiter \
                              values selection minimum maximum dirlist} {
                    if {[string match -nocase ${id_v}* $word]} {
                        set id_v $word
                        break
                    }
                }

Having full names though makes the 'lex' type part easier as the tokenising
can break things down into more specific words: APPLICATION, INTEGER, etc
rather than just ID. Although it's possible to do this with regexps right now
if you're willing to put up with regexps like "var(i(a(b(l(e)?)?)?)?)?".

Oddly I dealt with types and attributes in a slightly different way, so I
can only deal with "int" and "integer" and not "integ". I'm not sure why I did 
it that way though; sloppiness it seems.

James

-- 
James Bonfield (jkb at mrc-lmb.cam.ac.uk)   Fax: (+44) 01223 213556
Medical Research Council - Laboratory of Molecular Biology,
Hills Road, Cambridge, CB2 2QH, England.
Also see Staden Package WWW site at http://www.mrc-lmb.cam.ac.uk/pubseq/



More information about the emboss-dev mailing list