[Bioperl-l] Entrezgene parser: new parser available from sourceforge (attached too) (fwd)

Tue Apr 5 15:06:47 EDT 2005

As often happens, NCBI introduced some small, but interesting changes to
their ASN entrezgene format. Therefore Mingyi had to change the underlying
low level parser. Anyone who uses his parser direcly will have to update.
This will also delay the release of the first version of the bioperl
entrezgene parser, which I anticipated to be on Thursday. I still hope I
will commit the code on Friday.
Stefan

---------- Forwarded message ----------
Date: Tue, 05 Apr 2005 13:35:57 -0400
From: Mingyi Liu <mingyi.liu at gpc-biotech.com>
To: Stefan A Kirov <skirov at utk.edu>
Subject: new parser available from sourceforge (attached too)

Hi, Stefan,

I attached the new version to this email.  Unfortunately as expected,
the new version is much slower due to the use of lookahead regexes
(needed to accomodate the meaningless/buggy changes NCBI introduced).
About 30% slower in my test.  Still can't find a good reason why they
introduced those 3 different types of changes.  Can't be one bug.

Anyways.  Thanks for letting me know so early!

Mingyi
-------------- next part --------------
=head1 NAME

GI::Parser::EntrezGene - Regular expression-based Perl Parser for NCBI Entrez Gene.

=head1 SYNOPSIS

  use GI::Parser::EntrezGene;

  my $parser = GI::Parser::EntrezGene->new();
  open(IN, "Homo_sapiens") || die "can't open the Entrez Gene human genome ASN.1 file! -- $!\n";
  $/ = "Entrezgene ::= {";
  while(<IN>)
  {
    chomp;
    next unless /\S/;
    # parse the entry
    my $text = (/^\s*Entrezgene ::= ({.*)/si)? $1 : "{" . $_;
    my $value = $parser->parse($text, 2); # $value contains data structure for the 
                                # record being parsed. 2 indicates the recommended 
                                # trimming mode of the data structure
  }

=head1 PREREQUISITE

GI::Parser::EntrezGene requires a utitility module GI::Parser::Util, which can be
downloaded at the same location as this module 
( http://sourceforge.net/projects/egparser/ ).

=head1 INSTALLATION

Put EntrezGene.pm, Util.pm into your perl module directory (for example, if your 
Perl modules are located in /usr/lib/perl5/site_perl/5.6.1, then you should put 
EntrezGene.pm, Util.pm into /usr/lib/perl5/site_perl/5.6.1/GI/Parser directory).

=head1 DESCRIPTION

GI::Parser::EntrezGene is a regular expression-based Perl Parser for NCBI Entrez
Gene genome databases ( http://www.ncbi.nih.gov/entrez/query.fcgi?db=gene ).  It
parses an ASN.1-formatted Entrez Gene record and returns a data structure that 
contains all data items from the gene record.

As of March 7th, 2005, the parser version 1.0 was tested on Entrez Gene human, 
mouse and rat genome annotation files (which took around 660, 520, 195 seconds 
repectively to parse on one 2.4 GHz Intel Xeon processor).  Note that the addition
of validation and error reporting in 1.03 slows parser down to needing around
12 minutes (instead of the previous 11 minutes) to parse Human genome.  V1.03 can
process the "All_Data" file that contains all EntrezGene genomes in 98 minutes.

=head1 SEE ALSO

The parse_entrez_gene_example.pl is a very important and complete demo on using
this module to extract all data items from Entrez Gene records.  Do check it out
in the package (included since an update to V1.04 release)!  In fact, this script
took me about 3-4 times more time to make for my project than the parser itself.
Note that the included example script was edited to leave out project-specific
stuff.

For details on various parsers I generated for Entrez Gene, example scripts that
uses/benchmarks the modules, please see
http://sourceforge.net/projects/egparser/ or refer to our paper on them
(see CITATION section).

Note that GI::Parser::EntrezGene is the fastest module in the bunch.
GI::Parser::EntrezGenePRD is the slowest (by far! when parsing long records).

=head1 AUTHOR

Dr. Mingyi Liu <mingyi.liu at gpc-biotech.com>

=head1 COPYRIGHT

The GI::Parser::EntrezGene module and its related modules and scripts are copyright
(c) 2005 Mingyi Liu, GPC Biotech AG and Altana Research Institute. All rights reserved.
I created these modules when working on a collaboration project between these two
companies. Therefore a special thanks for the two companies to allow the release of
the code into public domain.

You may use and distribute them under the terms of the GNU General Public
License (GPL, http://www.gnu.org/copyleft/gpl.html ).

=head1 CITATION

Mingyi Liu and Andrei Grigoriev (2005) "Fast Parsers for Entrez Gene" Bioinformatics. 
Submitted (status subject to change until publication journal/issue is finalized).

=head1 OPERATION SYSTEMS SUPPORTED

Any OS that Perl runs on.

=head1 CHANGE LOG

=over

=item *
version 1.05: added support to parse the NCBI 4/5/2005 download, which
              inexplicably added a useless space before ',' on all lines, broke
              some lines into two yet condensed others (brackets) to one line.
              This unfortunately slows down my parser because I have to use
              lookahead regexes to fix the parser for this weird new format.
              I also fixed a minor mistake in error reporting function

=item *
version 1.04: added attempt at opening large file (2 GB) on Perl that does
              not support it; added 'file' option to new(); added file
              name in error reporting message; updated documentation

=item *
version 1.03: added validating capability such that anything that does not
              conform to the current NCBI Entrez Gene ASN.1 format would
              raise error and stops program. Position of the offending
              data item would be reported.

=item *
version 1.02: added input_file function that accepts filename input, and
              next_seq function that returns the next record

=item *
version 1.01: unescaped double quote escapes in double quoted strings

=item *
version 1.0: released

=back

=head1 METHODS

=cut

package GI::Parser::EntrezGene;

use strict;
use Carp qw(carp croak);
use GI::Parser::Util;
use vars qw ($VERSION);

$VERSION = '1.05';

=head2 new

  Parameters: maxerrstr => 20 (optional) - maximum number of characters after
                offending element, used by error reporting, default is 20
              file => $filename (optional) - name of the file to be parsed. 
                call next_seq to parse!
  Example:    my $parser = GI::Parser::EntrezGene->new();
  Function:   Instantiate a parser object
  Returns:    Object reference
  Notes:

=cut

sub new
{
  my $class = shift;
  $class = ref($class) if(ref($class));
  my $self = { maxerrstr => 20, @_ };
  bless $self, $class;
  $self->input_file($self->{file}) if($self->{file});
  return $self;
}

=head2 maxerrstr

  Parameters: $maxerrstr (optional) - maximum number of characters after 
                offending element, used by error reporting, default is 20
  Example:    $parser->maxerrstr(20);
  Function:   get/set maxerrstr.
  Returns:    maxerrstr.
  Notes:

=cut

sub maxerrstr
{
  my ($self, $value) = @_;
  $self->{maxerrstr} = $value if $value > 0;
  return $self->{maxerrstr};
}

=head2 parse

  Parameters: $string that contains Entrez Gene record,
              $trimopt (optional) that specifies how the data structure
                returned should be trimmed. 2 is recommended
              $noreset (optional) that species that line number should not
                be reset
  Example:    my $value = $parser->parse($text, 2);
  Function:   Takes in a string representing Entrez Gene record, parses
                the record and returns a data structure.
  Returns:    A data structure containing all data items from the Entrez
                Gene record.
  Notes:      DEPRECATED as external function!!! Do not call this function
                directly!
              $string should not contain 'EntrezGene ::=' at beginning!
              For details on how to use the $trimopt data trimming option
                please see comment for the GI::Parser::Util::compactds
                method. An option of 2 is recommended.

=cut

sub parse
{
  my ($self, $input, $compact, $noreset) = @_;
  $input || croak "must have input!\n";
  $self->{input} = $input;
  $self->{filename} = "input string" unless $self->{filename};
  $self->{linenumber} = 1 unless $self->{linenumber} && $noreset;
  $self->{depth} = 0;
  my $result;
  eval
  {
    $result = $self->_parse(); # no need to reset $self->{depth} or linenumber
  };
  if($@)
  {
    if($@ !~ /^Data Error:/)
    {
      croak "non-conforming data broke parser on line $self->{linenumber} in $self->{filename}\n".
            "possible cause includes randomly inserted brackets in input file before line $self->{linenumber}\n".
            "first $self->{maxerrstr} (or till end of input) characters including the non-conforming data:\n" .
            substr($self->{input}, pos($self->{input}), $self->{maxerrstr}) . "\nRaw error mesg: $@\n";
    }
    else { die $@ }
  }
  compactds($result, $compact) if($compact && defined $result);
  return $result;
}

=head2 input_file

  Parameters: $filename for file that contains Entrez Gene record(s)
  Example:    $parser->input_file($filename);
  Function:   Takes in name of a file containing Entrez Gene records.
              opens the file and stores file handle
  Returns:    none.
  Notes:      Attemps to open file larger than 2 GB even on Perl that
                does not support 2 GB file (accomplished by calling
                "cat" and piping output. On OS that does not have "cat"
                error message will be displayed)

=cut

sub input_file
{
  my ($self, $filename) = @_;
  # in case user's Perl system can't handle large file. Assuming Unix, otherwise raise error
  open($self->{fh}, $filename) ||
  ($! =~ /too large/i && open($self->{fh}, "cat $filename |")) ||
    croak "can't open $filename! -- $!\n";
  $self->{filename} = $filename;
}

=head2 next_seq

  Parameters: $trimopt (optional) that specifies how the data structure
                returned should be trimmed. 2 is recommended
  Example:    my $value = $parser->next_seq(2);
  Function:   Use the file handle generated by input_file, parses the next
                the record and returns a data structure.
  Returns:    A data structure containing all data items from the Entrez
                Gene record.
  Notes:      Must pass in a filename through new() or input_file() first!
              For details on how to use the $trimopt data trimming option
                please see comment for the GI::Parser::Util::compactds
                method. An option of 2 is recommended.

=cut

sub next_seq
{
  my ($self, $compact) = @_;
  local $/ = "Entrezgene ::= {"; # set record separator
  $self->{fh} || croak "you must pass in a file name through new() or input_file() first before calling next_seq!\n";
  if($_ = readline $self->{fh})
  {
    chomp;
    next unless /\S/;
    my $tmp = (/^\s*Entrezgene ::= ({.*)/si)? $1 : "{" . $_; # get rid of the 'Entrezgene ::= ' at the beginning of Entrez Gene record
    return $self->parse($tmp, $compact, 1); # 1 species no resetting line number
  }
}

# NCBI's Apr 05, 2005 format change forced much usage of lookahead, which would for
# sure slows parser down.  But can't code efficiently without it.
sub _parse
{
  my ($self, $flag) = @_;
  my $data;
  while(1)
  {
    # changing orders of regex if/elsif statements made little difference. current order is close to optimal
    if($self->{input} =~ /\G[ \t]*,?[ \t]*\n/cg) # cleanup leftover
    {
      $self->{linenumber}++;
      next;
    }
    if($self->{input} =~ /\G[ \t]*}/cg)
    {
      if(!($self->{depth}--) && $self->{input} =~ /\S/)
      {
        croak "Data Error: extra (mismatched) '}' found on line $self->{linenumber} in $self->{filename}!\n";
      }
      return $data
    }
    elsif($self->{input} =~ /\G[ \t]*{/cg)
    {
      $self->{depth}++;
      push(@$data, $self->_parse())
    }
    elsif($self->{input} =~ /\G[ \t]*([\w-]+)(\s*)/cg)
    {
      my ($id, $lines) = ($1, $2);
      # we're prepared for NCBI to make the format even worse:
      $self->{linenumber} += $lines =~ s/[\r\n]+//g;
      my $tmp;
      if(($self->{input} =~ /\G"((?:[^"]|"")*)"(?=\s*[,}])/cg && ++$tmp) ||
         $self->{input} =~ /\G([\w-]+)(?=\s*[,}])/cg)
      {
        my $value = $1;
        if($tmp) # slight speed optimization, not really necessary since regex is fast enough
        {
          $value =~ s/""/"/g;
          $self->{linenumber} += $value =~ s/[\r\n]+//g;
        }
        if(ref($data->{$id})) { push(@{$data->{$id}}, $value) } # hash value is not a terminal (or have multiple values), create array to avoid multiple same-keyed hash overwrite each other
        elsif($data->{$id}) { $data->{$id} = [$data->{$id}, $value] } # hash value has a second terminal value now!
        else { $data->{$id} = $value } # the first terminal value
      }
      elsif($self->{input} =~ /\G{/cg)
      {
        $self->{depth}++;
        push(@{$data->{$id}}, $self->_parse());
      }
      elsif($self->{input} =~ /\G(?=[,}])/cg) { push(@$data, $id) }
      else # must be "id value value" format
      {
        $self->{depth}++;
        push(@{$data->{$id}}, $self->_parse(1))
      }
      if($flag)
      {
        if(!($self->{depth}--) && $self->{input} =~ /\S/)
        {
          croak "Data Error: extra (mismatched) '}' found on line $self->{linenumber} in $self->{filename}!\n";
        }
        return $data;
      }
    }
    elsif($self->{input} =~ /\G[ \t]*"((?:[^"]|"")*)"(?=\s*[,}])/cg)
    {
      my $value = $1;
      $value =~ s/""/"/g;
      $self->{linenumber} += $value =~ s/[\r\n]+//g;
      push(@$data, $value)
    }
    else # end of input
    {
      my ($pos, $len) = (pos($self->{input}), length($self->{input}));
      if($pos != $len && $self->{input} =~ /\G\s*\S/cg) # problem with parsing, must be non-conforming data
      {
        croak "Data Error: none conforming data found on line $self->{linenumber} in $self->{filename}!\n" .
              "first $self->{maxerrstr} (or till end of input) characters including the non-conforming data:\n" .
              substr($self->{input}, $pos, $self->{maxerrstr}) . "\n";
      }
      elsif($self->{depth} > 0)
      {
        croak "Data Error: missing '}' found at end of input in $self->{filename}!";
      }
      elsif($self->{depth} < 0)
      {
        croak "Data Error: extra (mismatched) '}' found at end of input in $self->{filename}!";
      }
      return $data;
    }
  }
}

1;