Corrections to ACD Syntax manual

James Bonfield jkb at mrc-lmb.cam.ac.uk
Thu Feb 8 16:05:06 UTC 2001


On Thu, Feb 08, 2001 at 03:48:03PM +0000, Peter Rice wrote:
> Should be fixed - at least backslash support should be added.

The only tricky bit is dealing with the existing backslash mechanism which is
used for adding newline characters (eg see transeq.acd). (Which incidently I
couldn't find documented either...)

For what it's worth, my own parser (written in vanilla tcl) uses the following 
regular expressions for strings:

set tlist { 
    {^.(.*).$} {\1} 
    {\\[ \n\r]+} {\\n} 
    {[ \n\r]+} { } 
    {\\n} "\n" 
    {\\(.)} {\1} 
} 

set rules [format { 
    # ...
    {"(\\.|[^"\\])*"}           STRING          {%s} 
    {'(\\.|[^'\\])*'}           STRING          {%s} 
    {<(\\.|[^>\\])*>}           STRING          {%s} 
    {\{(\\.|[^\}\\])*\}}        STRING          {%s} 
    # ...
} $tlist $tlist $tlist $tlist

This is my own lex-style hack. The rules are matched one at a time in the
order they are listed. If a rule matches then the token type (STRING) is added 
to the token list and the token value is edited, if appropriate, based on a
series of 'regsub' calls (listed in the $tlist variable here).

My STRING definition does include backslashing already, which is of course not 
strictly correct at the moment (but I don't know how often it really
matters). The substitutions (tlist) are a way to simulate the emboss acd
parser mechanism of squashing multiple white-space into a single space
character, and adding newlines with a single backslash.

I'm not suggesting for a minute anyone copies my code, but the regular
expressions may be handy for other people trying to parse ACD.

James

-- 
James Bonfield (jkb at mrc-lmb.cam.ac.uk)   Tel: 01223 402499   Fax: 01223 213556
Medical Research Council - Laboratory of Molecular Biology,
Hills Road, Cambridge, CB2 2QH, England.
Also see Staden Package WWW site at http://www.mrc-lmb.cam.ac.uk/pubseq/






More information about the emboss-dev mailing list