[BioRuby] Parsing line-based formats with Ragel

Peter Cock p.j.a.cock at googlemail.com
Mon Jun 4 09:27:10 UTC 2012


On Mon, Jun 4, 2012 at 6:17 AM, Pjotr Prins <pjotr.public14 at thebird.nl> wrote:
> On Mon, Jun 04, 2012 at 12:56:18AM +0000, Fields, Christopher J wrote:
>> Have to agree, and in cases where a Bio* might run into problems
>> with Ragel (Perl or Python) we can at least look at the grammar and
>> use something for those languages that is similar in concept (e.g.
>> Marpa for Perl), or go a little more roundabout and bind to
>> C-generated ones from Ragel.
>
> Also agree. Parsing is a common theme in Bio*. A state engine would
> be a great abstraction, targetting C or D, and even the interpreted
> languages. The SAM parser would be a great proof-of-concept. I am
> also very interested to see how it will perform against samtools.
>
> The spanner in the works may be that we tend to be very sloppy
> about standards. So relaxed parsers may also be needed.

When I read Artem's post about Ragel and formal grammars for
parsing bioinformatics file formats I was intrigued, but cautious.

Biopython used to have a lot of its parsers written in Martel, a
home grown regular expression on steroids parsing framework.
On significant downside was even minor tweaks to the format
description required a good knowledge of regular expressions
and how the Martel grammar worked. This created a significant
barrier to entry, e.g. inserting a new optional line type at a
particular point in a file format was initially quite daunting,
leaving parser maintenance in the hands of a few people.

(The reasons we ended up dropping Martel was a combination
of poor scaling with large datasets, problems with a third party
library API change, and lack of time from the original author to
work on it. Most of our parsers are now 'pure Python').

It would not surprise me that over half the time spent on writing
a parser goes on dealing with corner cases/border line invalid
inputs, and that a formal grammar may not be the best way to
deal with 'messy' data. But I would hope SAM/BAM files would
be well enough behaved to make this worth trying.

Regards,

Peter



More information about the BioRuby mailing list