[Bioperl-l] Porting Entrez Gene parser to Biojava, Biopython, Biophp, even C++

Sun Mar 13 16:44:57 EST 2005

Andrew Dalke wrote:

> When I wrote my grammars I did so in strict mode, and reported
> a bunch of errors to the database providers.  The advantage
> is that wrong formats aren't accidently parsed.  The disadvantage
> is that minor changes break the parser.
>
> I don't see any solution to this other than having someone
> track the file formats over time.
>
Sure. If there's arbitrary and drastic changes to file format, there 
must be someone watching the change .  But one of my points was that my 
parser would likely stay valid even if NCBI changes their data 
definitions because it's very unlikely that NCBI changes their file 
structure/format, although they'd change data definitions (recall that I 
said my parser doesn't care about data content?)

> I looked at the regexps.  The ones that Python doesn't
> support are \G and the compilation flags /cg .  They won't
> be in Python because the start/end positions are available
> as local variables and not as implicit globals.  It
> uses a different stylism.
>
You're right.  The /cg modifiers are exactly the ones I was talking 
about.  \G is actually supprted by PCRE, so very likely in Python too 
since Python uses PCRE (please check again).  Nonetheless, without /cg, 
\G means little.  That's why I said there's gonna be a performance hit.

> The first of these lists some tasks that can't be done
> with your approach, like being able to index all the
> records in a file by byte position.
>
Not really.  If you really want those, my parser code can be easily 
modified to record the file byte position of each token.

> Parsers can also get better performance by assuming the
> file format is correct.  Eg, your EntrezGene.pm doesn't
> detect if the file was truncated (I fed it only the first
> 1000 lines of the human genome file) while the context-free
> parsers you have will at least generate an error that
> the parenthesis are unbalanced.

Yeah, my parser does not give much warnings at current stage.  I 
certainly wouldn't mind someone taking my code and add exception 
handling.  But frankly many parsers do not excel in this department.  
Even some XML parsers only warn when something breaks the parser.

>
> One thing I note, investigating a question of Hilmar's,
> is that your tokenization of strings isn't quite complete.
> Double-quoted "strings" that contain a double quote are
> escaped ""with doubled"" double quotes.  Your tokenizer
> doesn't convert the double quotes into single ones.  My
> Martel code has the same problem.  It needed another
> layer to describe how to unescape strings and handle
> word spilling.
>
You caught me.  I was just being lazy - I noticed this a while ago, but 
decided to delay a bit since I have 4 different parsers that need to be 
modified.  Then I forgot. (it's probably my fault that actually last 
night I remembered this too, and I just uploaded the files anyway 'cause 
it's so simple to fix by anybody anyway).

I'd say you're really exaggerating when you said my tokenization of 
string isn't complete based on this.  Not unescaping the "" escape has 
nothing to do with tokenization (it's a post-processing step after 
tokenization).  It simply take one simple regex to fix it, no other 
layer needed.

Thanks for your suggestions.  I think problems specific to Martel might 
not apply in this case since Entrez Gene file structure/format is really 
simple, and they are likely to stay very stable.  That's why I was 
proposing sharing this code base across languages.

Thanks,

Mingyi