[Bioperl-l] Entrez Gene and bioperl-db

Fri Feb 4 18:07:36 EST 2005

Hi Peter,

On Mon, 2005-01-17 at 04:06, Peter Robinson wrote:
> Hi list,

<snip>

> 3) In the meantime I have also gotten a lex/yacc parser in C to parse
> the species-specific Gene files (which is by far the most interesting
> file in the Entrez gene system). In principle this approach could be
> done in Perl -- straightforward but a lot of detail work. I will be
> needing this kind of thing for my work, so I will continue to work on
> this, and once it is bug-free in C I will think about ways of porting it
> to Bioperl (this might take a while). As I mentioned before on this
> list, if anybody else can do this more quickly please go ahead (but drop
> me a line); on the other hand, collaborators who like the idea of
> writing a grammer in the style of lex/yacc or ANTLR are also welcome.

I've written a script that uses Parse::RecDescent and an associated
grammar to parse the EntrezGene ASN.1 files.  Actually, I've only tested
it on the human file, but I assume it will work for the rest as well. 
According to my (admittedly shallow) understanding, Parse::RecDescent
grammars work in a fundamentally different way than yacc grammars do. 
However, it is pure Perl.

The script does not create bioperl objects; it simply converts the
records into large data structures that more or less mirror the ASN.1. 
I take these and store the bits I want in a database.   It would be easy
enough to convert to bioperl objects.  However, you may not want to take
this approach as the parser itself is pretty slow (some examples
below).  My familiarity with the bioperl object model is a little rusty,
but *a lot* of instantiation would need to be done to fully encapsulate
the data represented in an EntrezGene record.  I'm guessing that the
additional time required would be considerable.

The parser takes a second or two for most genes, however this goes up
dramatically for larger records.  Here are some examples from a little
test file run on a box with fairly fast processors (2.8GHz/1MB Cache
Xeon):
  Parsing Record 1 (439656 bytes)
  Success for gene BRCA1 (LocusID 672).  Time: 2 minutes and 6 seconds.
  Parsing Record 2 (224148 bytes)
  Success for gene CFTR (LocusID 1080).  Time: 33 seconds.
  Parsing Record 3 (45261 bytes)
  Success for gene CNR1 (LocusID 1268).  Time: 1 second.
  Parsing Record 4 (570419 bytes)
  Success for gene COX2 (LocusID 4513).  Time: 5 minutes and 30 seconds.
  Parsing Record 5 (40860 bytes)
  Success for gene CYP1B1 (LocusID 1545).  Time: 1 second.
  Parsing Record 6 (42362 bytes)
  Success for gene SRY (LocusID 6736).  Time: 2 seconds.
  Parsing Record 7 (110754 bytes)
  Success for gene TRPV1 (LocusID 7442).  Time: 7 seconds.

It may very well be possible to speed this thing up.  This was my first
foray into Parse::RecDescent land, and it was somewhat, um, painful to
get it working at all.  At this point, I'm not inclined to spend any
more time on it.  It works for my purposes.

At any rate, if you (or anyone else on the list) are interested I'd be
happy to post the code.

-Steve
-- 
(   Stephen L. Mathias, Ph.D.  (   s m a t h i a s  (
 )  Office of Biocomputing      )  @ p o b l a n o   )
(   UNM School of Medicine     (   . h e a l t h .  (
 )                              )  u n m . e d u     )
(           http://poblano.health.unm.edu/          (
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://portal.open-bio.org/pipermail/bioperl-l/attachments/20050204/d856bf81/attachment.bin