[Bioperl-l] Porting Entrez Gene parser to Biojava, Biopython, Biophp, even C++

Andrew Dalke dalke at dalkescientific.com
Sun Mar 13 19:34:11 EST 2005


Mingyi Liu wrote:
> Sure. If there's arbitrary and drastic changes to file format, there 
> must be someone watching the change .  But one of my points was that 
> my parser would likely stay valid even if NCBI changes their data 
> definitions because it's very unlikely that NCBI changes their file 
> structure/format,

Ah, I was mixing two topics - using this set of regexps to parse
this file format and the general topic of using regexps portably
to parse a range of file formats.

> \G is actually supprted by PCRE, so very likely in Python too since 
> Python uses PCRE (please check again).  Nonetheless, without /cg, \G 
> means little.  That's why I said there's gonna be a performance hit.

Python used to use pcre but that was replaced with sre some years
back, in part to support Unicode-based regexps.

It looks like Java's java.util.regex does support the \G
flag, says
   http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html

Personally I don't like the lack of thread safety because
that value depends on previous interactions of the pattern.
I think perl solved it by making those values be thread local,
but I'm not sure.

>> The first of these lists some tasks that can't be done
>> with your approach, like being able to index all the
>> records in a file by byte position.
>>
> Not really.  If you really want those, my parser code can be easily 
> modified to record the file byte position of each token.

The code I looked at took a string and there was outer
scaffolding to identify the record locations.

   my $parser = GI::Parser::EntrezGene->new();
   open(IN, "Homo_sapiens") || die "...";
   $/ = "Entrezgene ::= {";
   while(<IN>)
   {
     chomp;
     next unless /\S/;

     my $text = (/^\s*Entrezgene ::= ({.*)/si)? $1 : "{" . $_;
     my $value = $parser->parse($text, 2);
      .. do something with $value ....
   }

The actual record extraction was not part of the EntrezGene
library so I don't see what you could modify.  Perhaps add
an "offset" field to the parse method?

If you do get the byte positions of terms in the ASN.1
(eg to report "syntax error at line 1234 column 56") then
you would need to use the $` and $' fields, which perlvar
warns is slow, so your timings would change.

> Yeah, my parser does not give much warnings at current
> stage.  I certainly wouldn't mind someone taking my code
> and add exception handling.  But frankly many parsers do
> not excel in this department.  Even some XML parsers only
> warn when something breaks the parser.

Sadly the fun part for most people is making the parser
work correctly with correct data.  Few people like making
parsing code correctly handle incorrect data.  Hence all
the parsers which "do not excel in this department."


> You caught me.  I was just being lazy - I noticed this a while ago, 
> but decided to delay a bit since I have 4 different parsers that need 
> to be modified.
   ...
> I'd say you're really exaggerating when you said my tokenization of 
> string isn't complete based on this.

There are several layers to parsing.  One is identifying the lexical
components, which can be done with regular expressions.  The lexer
should convert these into tokens that the parser can use, which
may include things like unescaping quotes, concatenating strings,
normalizing different numeric representations (0xa == 10 == 012
  -> the integer 10).

I don't actually know how to distinguish between these two
parts of the lexer.  One is the LHS of the pattern definition
and the other is the result of applying the RHS actions to the
matched components.  If the actions were a null-op then there
is no difference.

Your parser though doesn't return a token stream, it returns
a parse tree, so you've already passed the step where any
sort of data conversion / normalization should take place.

But if you define that your parse tree returns the raw text
representation then it is complete.  My question - which I
haven't been able to resolve for Martel - is how should code
like this, which tries to be cross-platform, handle what
is semantically one item when it's represented as multiple
components in the input format?

Here are two examples to show how tricky that is

      url "http://www.ncbi.nlm.nih.gov/sutils/evv.cgi?taxid=9606&conti
g=NT_009714.16&gene=A2M&lid=2&from=1979284&to=2027463"

       text "There is a significant genetic association of the 5 bp 
deletion
  and two novel polymorphisms in alpha-2-macroglobulin 
alpha-2-macroglobulin
  precursor with AD",

In the first the "\n" should be removed while in the second
it should be replaced with a space.

It would be nice if this behavior was also the same cross-platform.

>   Not unescaping the "" escape has nothing to do with tokenization 
> (it's a post-processing step after tokenization).  It simply take one 
> simple regex to fix it, no other layer needed.

It's post tokenization and pre parse tree assembly.  For this
case it's a simple regexp search/replace but 1) how is that handled
in a cross platform manner and 2) for the general problem it's
not as simple as a regexp.


> Thanks for your suggestions.  I think problems specific to Martel 
> might not apply in this case since Entrez Gene file structure/format 
> is really simple, and they are likely to stay very stable.  That's why 
> I was proposing sharing this code base across languages.

Indeed some of the problems don't apply.  But speaking solely for
myself and not for the Biopython project I would rather use a
validating parser that reported at least imbalanced parens,
roughly equivalent to checking for well-formed XML.

One question I have is that while I know the file format is stable,
given that it's based on ASN.1, what are the chances of new tags
being added which are still valid ASN.1 but which are not yet
present in the existing files?

For example, in reading the ASN.1 spec at
  http://asn1.elibel.tm.fr/en/standards/index.htm#x680
I see that ASN.1 could include a real number but the
Homo_sapiens file doesn't have one and your parser doesn't
handle it (it looks for [\w-]).  Mmm, and there are many
more data types in full ASN.1.

As far as I can tell, if NCBI does add a new data type that
your code doesn't support then it's very hard to tell that
the code is ignoring problems.

Consider a floating point date value (not legal according toe
NCBI but legal ASN.1. .. I think - just testing the idea)

   track-info {
     geneid 1,
     status live,
     create-date std {
       year 2003.43,
       month 8,
       day 28,
       hour 20,
       minute 30,
       second 0
     },


Your code converts that into

     'track-info' => [
              {
               'geneid' => '1',
               'create-date' => [
                      {
                       'std' => [
                            {
                             'year' => [
                                    {
                                     '2003' => [
                                         undef
                                         ]
                                     }
                               ]
                             }
                           ]
                      }
                   ],
               'status' => 'live'
           }
       ]

That doesn't seem like the behavior it should do.


BTW, looking at what you do, I don't understand why you handle
the explicit types fields as you do.  Why does

           tag id 9606
turn into
          'tag' => [
             {
               'id' => '11'
             }
           ],

As far as I can tell there's only a single data type
there so what about omitting the list reference?

          'tag' => {
               'id' => '11'
             },

But I don't know enough about ASN.1.


					Andrew
					dalke at dalkescientific.com



More information about the Bioperl-l mailing list