Corrections to ACD Syntax manual
James Bonfield
jkb at mrc-lmb.cam.ac.uk
Thu Feb 8 16:05:06 UTC 2001
On Thu, Feb 08, 2001 at 03:48:03PM +0000, Peter Rice wrote:
> Should be fixed - at least backslash support should be added.
The only tricky bit is dealing with the existing backslash mechanism which is
used for adding newline characters (eg see transeq.acd). (Which incidently I
couldn't find documented either...)
For what it's worth, my own parser (written in vanilla tcl) uses the following
regular expressions for strings:
set tlist {
{^.(.*).$} {\1}
{\\[ \n\r]+} {\\n}
{[ \n\r]+} { }
{\\n} "\n"
{\\(.)} {\1}
}
set rules [format {
# ...
{"(\\.|[^"\\])*"} STRING {%s}
{'(\\.|[^'\\])*'} STRING {%s}
{<(\\.|[^>\\])*>} STRING {%s}
{\{(\\.|[^\}\\])*\}} STRING {%s}
# ...
} $tlist $tlist $tlist $tlist
This is my own lex-style hack. The rules are matched one at a time in the
order they are listed. If a rule matches then the token type (STRING) is added
to the token list and the token value is edited, if appropriate, based on a
series of 'regsub' calls (listed in the $tlist variable here).
My STRING definition does include backslashing already, which is of course not
strictly correct at the moment (but I don't know how often it really
matters). The substitutions (tlist) are a way to simulate the emboss acd
parser mechanism of squashing multiple white-space into a single space
character, and adding newlines with a single backslash.
I'm not suggesting for a minute anyone copies my code, but the regular
expressions may be handy for other people trying to parse ACD.
James
--
James Bonfield (jkb at mrc-lmb.cam.ac.uk) Tel: 01223 402499 Fax: 01223 213556
Medical Research Council - Laboratory of Molecular Biology,
Hills Road, Cambridge, CB2 2QH, England.
Also see Staden Package WWW site at http://www.mrc-lmb.cam.ac.uk/pubseq/
More information about the emboss-dev
mailing list