[Bioperl-l] Porting Entrez Gene parser to Biojava, Biopython,
Biophp, even C++
Mingyi Liu
mingyi.liu at gpc-biotech.com
Sun Mar 13 21:44:57 EST 2005
Andrew Dalke wrote:
> Python used to use pcre but that was replaced with sre some years
> back, in part to support Unicode-based regexps.
>
I see. Doesn't matter anyway. I do want to note that this \G /cg is
purely for parser efficiency, so s/// would work just fine except at
least an order of magnitude slower with large Entrez Gene records. So
just as I said, porting is fine, but performance will take a hit. Then
again, any parser relying on regex would need \G /cg for performance,
and would be hit when ported over.
> The code I looked at took a string and there was outer
> scaffolding to identify the record locations.
>
> The actual record extraction was not part of the EntrezGene
> library so I don't see what you could modify. Perhaps add
> an "offset" field to the parse method?
>
Seems what you're looking for in a parser is a do-it-all text
processor. It parses, it indexes, and it adapts (read on for my comment
on this one). But I strictly said my parser is parser only. Now with
that out of the way, let me address your question: Yes, since my parser
is parser only, if you want to use it for indexing purpose, then you'd
have to keep position in outer scaffolding or custom programs, and make
simple changes like calling pos function after token generation to
record position of token in input string (a truncated Entrez Gene
record). It's all doable, but I just wouldn't put the indexing code
into a parser.
> If you do get the byte positions of terms in the ASN.1
> (eg to report "syntax error at line 1234 column 56") then
> you would need to use the $` and $' fields, which perlvar
> warns is slow, so your timings would change.
Yeah, I know. If my parser tries to do more, sure it'd get slower. ;-)
> There are several layers to parsing. ...
>
> But if you define that your parse tree returns the raw text
> representation then it is complete. My question - which I
> haven't been able to resolve for Martel - is how should code
> like this, which tries to be cross-platform, handle what
> is semantically one item when it's represented as multiple
> components in the input format?
>
> Here are two examples to show how tricky that is
>
> url "http://www.ncbi.nlm.nih.gov/sutils/evv.cgi?taxid=9606&conti
> g=NT_009714.16&gene=A2M&lid=2&from=1979284&to=2027463"
>
> text "There is a significant genetic association of the 5 bp
> deletion
> and two novel polymorphisms in alpha-2-macroglobulin
> alpha-2-macroglobulin
> precursor with AD",
>
> In the first the "\n" should be removed while in the second
> it should be replaced with a space.
>
> It would be nice if this behavior was also the same cross-platform.
>
I think the phrase you were looking for instead of "what is semantically
one item when it's represented as multiple components in the input
format?" is simply "context-sensitive rules". Context-sensitivity can
be cross-platform, but my parser does not need to deal with it (note
that how to replace the "\n" really is user's preference and none of
parser's business. You might want to replace the 2nd one with space, but
another person might want it to be replaced with "<br>"). Even if you
find a better example, I could suggest you look to my Parse::RecDescent
based parser, since Parse::RecDescent allows context-senstive grammar.
And also one should know that coding context-sensitivity in regex is
also not that hard, but you do need to have a well defined set of
scenarios and rules.
>
> It's post tokenization and pre parse tree assembly. For this
> case it's a simple regexp search/replace but 1) how is that handled
> in a cross platform manner
My parser is regex based. Any change in the perl parser could be
reflected in other languages (I still prefer language instead of
platforms though, since this is really the point. My parsers are
already cross-platform, they're supported by any platform that supports
Perl). There could be changes that are needed, like unsupported
modifiers, but you wouldn't think that porting across languages should
not ask developers to do anything, right? What needs to be done should
be determined on a case-by-case manner. I can't think of a generic
response that is panacea for all porting cases.
> and 2) for the general problem it's
> not as simple as a regexp.
>
Exactly. If you read my comments on my parsers, I mentioned that when
things get more complex, use those grammar-based tools instead. Right
now, for Entrez Gene, regex works and it works best, that's why I mostly
talk about this one. But you're very welcome to check other ones out
for completeness.
> Indeed some of the problems don't apply. But speaking solely for
> myself and not for the Biopython project I would rather use a
> validating parser that reported at least imbalanced parens,
> roughly equivalent to checking for well-formed XML.
Of course. I could suggest that such checking can easily be added to my
parser, with one variable tracking depth - that's all that's needed
since Entrez Gene only has one type of block delimiter. I'll probably
do it when I have time next week since it's only 3 lines of code or so.
But then again, I start to realize that you would rather use some other
parser ranyway.
>
> For example, in reading the ASN.1 spec at
> http://asn1.elibel.tm.fr/en/standards/index.htm#x680
> I see that ASN.1 could include a real number but the
> Homo_sapiens file doesn't have one and your parser doesn't
> handle it (it looks for [\w-]). Mmm, and there are many
> more data types in full ASN.1.
>
Mmm, you really tried hard to let me know that my parser can not do it
all. ;-) Well, read on for my response.
> As far as I can tell, if NCBI does add a new data type that
> your code doesn't support then it's very hard to tell that
> the code is ignoring problems.
Good point. I'll add one line in the _parse function to do a catch-all
error reporting.
>
> Consider a floating point date value (not legal according toe
> NCBI but legal ASN.1. .. I think - just testing the idea)
> ...
> year 2003.43,
> ...
>
> Your code converts that into
> ...
> '2003' => [
> undef
> ]
> ...
> That doesn't seem like the behavior it should do.
>
Well, you point that my parser is not a general ASN.1 parser is well
taken, especially since I never claimed it to be one. If you're looking
for an ASN.1 perl parser, I heard that on the mailing list someone
already made one, and it could be of help to you.
>
> BTW, looking at what you do, I don't understand why you handle
> the explicit types fields as you do. Why does
>
> tag id 9606
> turn into
> 'tag' => [
> {
> 'id' => '11'
> }
> ],
>
> As far as I can tell there's only a single data type
> there so what about omitting the list reference?
>
> 'tag' => {
> 'id' => '11'
> },
>
> But I don't know enough about ASN.1.
>
This has nothing to do with ASN. It is all about how uniform the data
structure could be. In fact, consider when NCBI decides to do
{
tag id 12345,
tag str "whatever"
}
which is far more possible than the cases you considered in earlier
criticisms, then the data sturcture would need to become:
'tag' => [
{
'id' => '12345',
'str' => 'whatever'
}
],
With your suggested approach, this would force the user to test what
type of reference $hash{'tag'} is before dealing with it either as a
hash or an array. With my approach, user always knows to deal with it
as an array. This is also exactly the reason (I guess) why XML::Simple
has option 'ForceArray', if you recall.
Now the promised response to the criticism that my parser doesn't do: 1.
Indexing of EntrezGene file. 2. Adaptive behavior when new format comes
out. 3. (semi-?)Automatic cross-language porting. 4. Full support for
ASN.1 parsing.
It's really simple - if you haven't already known - my parser is just an
Entrez Gene parser. It is not designed to do those things. You really
went out of your way to show me that my parser simply doesn't do
everything, but failed to show me that why my parser cannot be a
reasonable Entrez Gene parser, which is your main point. Also I don't
understand why you just dispatch my parser right away as a candidate for
porting to other language while I could address your valid concern next
week with a few lines. Why? I can understand that you were possibly
offended by my may-seem-naive enthusiasm of thinking about the prospect
of porting this fast parser to other languages. But I was pretty happy
with the parser I made, simply because:
1. There are plenty of people talking about that they have a parser
working for Entrez Gene, but probably due to various reasons like IP
issues or specific projects, no one posted one yet (at least I couldn't
find it after plenty of searching). Mine's the first one I could find
that's in public domain and in Perl.
2. My parser is so short, and not written in guru-style (since I'm far
from a Perl guru), so it's easy to understand.
3. It's OO with pod and example scripts, so very easy to use.
4. Most importantly, it's freakishly fast without making mistakes with
the NCBI Entrez Gene downloads.
My enthusiasm is based on the belief that there's not a Perl parser out
there that's better than mine overall when points 2-4 are considered.
And point 1 is just a trump card. I thought it'd be helpful to many who
want to get a GPL-ed Entrez Gene parser.
Nonetheless, if you just don't want to use my parser, you can simply say
so (or tell me why it doesn't work as a portable Entrez Gene parser).
Frankly, reading your emails, initially I was glad that we had a useful
discussion on parsers, but the endless picking on the progressively
absurd tasks for an Entrez Gene parser to do (like it's unable to index,
adapt to arbitrary changes, auto-port, parse full ASN.1 specifications)
just really changed my opinion, particularly because I doubt anyone
using any language would be looking for those in an Entrez Gene parser.
Again, FYI, it's only a parser, and I repeatedly said it's only a parser
that only constructures a data structure.
But I certainly welcome good suggestions, and I'll add some basic error
reporting next week. I didn't think it was needed since again, I
already parsed and checked results on human, mouse and rat. But it's
still a good idea & thanks for the suggestion! If someday you work out
a fast parser and/or one that does it all in either python or perl, I'd
like to know too. I'm always thrilled to learn useful things.
Thanks,
Mingyi
BTW, I realized that I was a bit overly broad in my last email in my
criticism of early attitude that users have to do work to use their
software. I should say it's just some of the early softwares that gave
such impression, even though it's only a few, the impression could be
big. If that's what's thrown you off, I apologize.
More information about the Bioperl-l
mailing list