[BioRuby] Parsing line-based formats with Ragel

Marjan Povolni marian.povolny at gmail.com
Sat Jun 2 16:58:39 UTC 2012


Well, not really. Currently my parser only creates a D struct named Record,
which contains only slices of the original string. So in basic conditions
there is only a string for the whole line (which can be a slice of a bigger
string), a Record which can also be located on the stack, and a dynamic
associative array for the attributes, which again maps string slices to
slices.

Additional string objects are only created when an escaped character has to
be replaced with the original. Otherwise all operations are simply handling
slices (which are start address+length, and when copied behave and cost
more like an int then an object), and most if not all operations are using
the stack. And this can be improved upon with an array which is not
immutable, which would make it possible to replace the escaped characters
in place. Given lines are not that long, everything is still in CPU cache.
Linking into trees could have problems with cache, but it depends on how
many lines the parser will keep in memory at every moment. 3MB is a lot of
lines.

And the current parser should be handling escaped characters ok now, except
for the ones with values over 0x1F.

I would certainly like to make a version of the Parser with Ragel. There
are a range of checks which I do manually with multiple functions, and
Ragel might be more optimized for those checks. And the result could be
much cleaner.

On Sat, Jun 2, 2012 at 6:24 PM, Pjotr Prins <pjotr.public14 at thebird.nl>wrote:

> On Sat, Jun 02, 2012 at 04:18:40PM +0200, Marjan Povolni wrote:
> > Cool, definitely something worth checking out for GFF3.
>
> One reason the state-machine is fast is because it does not create
> objects in memory (avoiding so called death by object creation ;).
> Data will be in the CPU cache, rather than main memory. Be interesting
> to see if Artem can run parsers on multi-core.
>
> With GFF3 line parsing, a really simple format, we immediately create
> a range of objects. Of course, this can happen on the stack too, so
> the speed advantage may not be that important.
>
> Still, I think especially for escape characters and character encodings
> this could be interesting for GFF3. Because that is the most
> complicated to get right.
>
> For now, we choose to assume GFF3 is plain ASCII. So, I guess we file
> this under 'enhancements'. Right?
>
> Pj.
>



More information about the BioRuby mailing list