[Bioperl-l] Entrezgene parser: new parser available from
sourceforge (attached too) (fwd)
Stefan A Kirov
skirov at utk.edu
Tue Apr 5 15:06:47 EDT 2005
As often happens, NCBI introduced some small, but interesting changes to
their ASN entrezgene format. Therefore Mingyi had to change the underlying
low level parser. Anyone who uses his parser direcly will have to update.
This will also delay the release of the first version of the bioperl
entrezgene parser, which I anticipated to be on Thursday. I still hope I
will commit the code on Friday.
Stefan
---------- Forwarded message ----------
Date: Tue, 05 Apr 2005 13:35:57 -0400
From: Mingyi Liu <mingyi.liu at gpc-biotech.com>
To: Stefan A Kirov <skirov at utk.edu>
Subject: new parser available from sourceforge (attached too)
Hi, Stefan,
I attached the new version to this email. Unfortunately as expected,
the new version is much slower due to the use of lookahead regexes
(needed to accomodate the meaningless/buggy changes NCBI introduced).
About 30% slower in my test. Still can't find a good reason why they
introduced those 3 different types of changes. Can't be one bug.
Anyways. Thanks for letting me know so early!
Mingyi
-------------- next part --------------
=head1 NAME
GI::Parser::EntrezGene - Regular expression-based Perl Parser for NCBI Entrez Gene.
=head1 SYNOPSIS
use GI::Parser::EntrezGene;
my $parser = GI::Parser::EntrezGene->new();
open(IN, "Homo_sapiens") || die "can't open the Entrez Gene human genome ASN.1 file! -- $!\n";
$/ = "Entrezgene ::= {";
while(<IN>)
{
chomp;
next unless /\S/;
# parse the entry
my $text = (/^\s*Entrezgene ::= ({.*)/si)? $1 : "{" . $_;
my $value = $parser->parse($text, 2); # $value contains data structure for the
# record being parsed. 2 indicates the recommended
# trimming mode of the data structure
}
=head1 PREREQUISITE
GI::Parser::EntrezGene requires a utitility module GI::Parser::Util, which can be
downloaded at the same location as this module
( http://sourceforge.net/projects/egparser/ ).
=head1 INSTALLATION
Put EntrezGene.pm, Util.pm into your perl module directory (for example, if your
Perl modules are located in /usr/lib/perl5/site_perl/5.6.1, then you should put
EntrezGene.pm, Util.pm into /usr/lib/perl5/site_perl/5.6.1/GI/Parser directory).
=head1 DESCRIPTION
GI::Parser::EntrezGene is a regular expression-based Perl Parser for NCBI Entrez
Gene genome databases ( http://www.ncbi.nih.gov/entrez/query.fcgi?db=gene ). It
parses an ASN.1-formatted Entrez Gene record and returns a data structure that
contains all data items from the gene record.
As of March 7th, 2005, the parser version 1.0 was tested on Entrez Gene human,
mouse and rat genome annotation files (which took around 660, 520, 195 seconds
repectively to parse on one 2.4 GHz Intel Xeon processor). Note that the addition
of validation and error reporting in 1.03 slows parser down to needing around
12 minutes (instead of the previous 11 minutes) to parse Human genome. V1.03 can
process the "All_Data" file that contains all EntrezGene genomes in 98 minutes.
=head1 SEE ALSO
The parse_entrez_gene_example.pl is a very important and complete demo on using
this module to extract all data items from Entrez Gene records. Do check it out
in the package (included since an update to V1.04 release)! In fact, this script
took me about 3-4 times more time to make for my project than the parser itself.
Note that the included example script was edited to leave out project-specific
stuff.
For details on various parsers I generated for Entrez Gene, example scripts that
uses/benchmarks the modules, please see
http://sourceforge.net/projects/egparser/ or refer to our paper on them
(see CITATION section).
Note that GI::Parser::EntrezGene is the fastest module in the bunch.
GI::Parser::EntrezGenePRD is the slowest (by far! when parsing long records).
=head1 AUTHOR
Dr. Mingyi Liu <mingyi.liu at gpc-biotech.com>
=head1 COPYRIGHT
The GI::Parser::EntrezGene module and its related modules and scripts are copyright
(c) 2005 Mingyi Liu, GPC Biotech AG and Altana Research Institute. All rights reserved.
I created these modules when working on a collaboration project between these two
companies. Therefore a special thanks for the two companies to allow the release of
the code into public domain.
You may use and distribute them under the terms of the GNU General Public
License (GPL, http://www.gnu.org/copyleft/gpl.html ).
=head1 CITATION
Mingyi Liu and Andrei Grigoriev (2005) "Fast Parsers for Entrez Gene" Bioinformatics.
Submitted (status subject to change until publication journal/issue is finalized).
=head1 OPERATION SYSTEMS SUPPORTED
Any OS that Perl runs on.
=head1 CHANGE LOG
=over
=item *
version 1.05: added support to parse the NCBI 4/5/2005 download, which
inexplicably added a useless space before ',' on all lines, broke
some lines into two yet condensed others (brackets) to one line.
This unfortunately slows down my parser because I have to use
lookahead regexes to fix the parser for this weird new format.
I also fixed a minor mistake in error reporting function
=item *
version 1.04: added attempt at opening large file (2 GB) on Perl that does
not support it; added 'file' option to new(); added file
name in error reporting message; updated documentation
=item *
version 1.03: added validating capability such that anything that does not
conform to the current NCBI Entrez Gene ASN.1 format would
raise error and stops program. Position of the offending
data item would be reported.
=item *
version 1.02: added input_file function that accepts filename input, and
next_seq function that returns the next record
=item *
version 1.01: unescaped double quote escapes in double quoted strings
=item *
version 1.0: released
=back
=head1 METHODS
=cut
package GI::Parser::EntrezGene;
use strict;
use Carp qw(carp croak);
use GI::Parser::Util;
use vars qw ($VERSION);
$VERSION = '1.05';
=head2 new
Parameters: maxerrstr => 20 (optional) - maximum number of characters after
offending element, used by error reporting, default is 20
file => $filename (optional) - name of the file to be parsed.
call next_seq to parse!
Example: my $parser = GI::Parser::EntrezGene->new();
Function: Instantiate a parser object
Returns: Object reference
Notes:
=cut
sub new
{
my $class = shift;
$class = ref($class) if(ref($class));
my $self = { maxerrstr => 20, @_ };
bless $self, $class;
$self->input_file($self->{file}) if($self->{file});
return $self;
}
=head2 maxerrstr
Parameters: $maxerrstr (optional) - maximum number of characters after
offending element, used by error reporting, default is 20
Example: $parser->maxerrstr(20);
Function: get/set maxerrstr.
Returns: maxerrstr.
Notes:
=cut
sub maxerrstr
{
my ($self, $value) = @_;
$self->{maxerrstr} = $value if $value > 0;
return $self->{maxerrstr};
}
=head2 parse
Parameters: $string that contains Entrez Gene record,
$trimopt (optional) that specifies how the data structure
returned should be trimmed. 2 is recommended
$noreset (optional) that species that line number should not
be reset
Example: my $value = $parser->parse($text, 2);
Function: Takes in a string representing Entrez Gene record, parses
the record and returns a data structure.
Returns: A data structure containing all data items from the Entrez
Gene record.
Notes: DEPRECATED as external function!!! Do not call this function
directly!
$string should not contain 'EntrezGene ::=' at beginning!
For details on how to use the $trimopt data trimming option
please see comment for the GI::Parser::Util::compactds
method. An option of 2 is recommended.
=cut
sub parse
{
my ($self, $input, $compact, $noreset) = @_;
$input || croak "must have input!\n";
$self->{input} = $input;
$self->{filename} = "input string" unless $self->{filename};
$self->{linenumber} = 1 unless $self->{linenumber} && $noreset;
$self->{depth} = 0;
my $result;
eval
{
$result = $self->_parse(); # no need to reset $self->{depth} or linenumber
};
if($@)
{
if($@ !~ /^Data Error:/)
{
croak "non-conforming data broke parser on line $self->{linenumber} in $self->{filename}\n".
"possible cause includes randomly inserted brackets in input file before line $self->{linenumber}\n".
"first $self->{maxerrstr} (or till end of input) characters including the non-conforming data:\n" .
substr($self->{input}, pos($self->{input}), $self->{maxerrstr}) . "\nRaw error mesg: $@\n";
}
else { die $@ }
}
compactds($result, $compact) if($compact && defined $result);
return $result;
}
=head2 input_file
Parameters: $filename for file that contains Entrez Gene record(s)
Example: $parser->input_file($filename);
Function: Takes in name of a file containing Entrez Gene records.
opens the file and stores file handle
Returns: none.
Notes: Attemps to open file larger than 2 GB even on Perl that
does not support 2 GB file (accomplished by calling
"cat" and piping output. On OS that does not have "cat"
error message will be displayed)
=cut
sub input_file
{
my ($self, $filename) = @_;
# in case user's Perl system can't handle large file. Assuming Unix, otherwise raise error
open($self->{fh}, $filename) ||
($! =~ /too large/i && open($self->{fh}, "cat $filename |")) ||
croak "can't open $filename! -- $!\n";
$self->{filename} = $filename;
}
=head2 next_seq
Parameters: $trimopt (optional) that specifies how the data structure
returned should be trimmed. 2 is recommended
Example: my $value = $parser->next_seq(2);
Function: Use the file handle generated by input_file, parses the next
the record and returns a data structure.
Returns: A data structure containing all data items from the Entrez
Gene record.
Notes: Must pass in a filename through new() or input_file() first!
For details on how to use the $trimopt data trimming option
please see comment for the GI::Parser::Util::compactds
method. An option of 2 is recommended.
=cut
sub next_seq
{
my ($self, $compact) = @_;
local $/ = "Entrezgene ::= {"; # set record separator
$self->{fh} || croak "you must pass in a file name through new() or input_file() first before calling next_seq!\n";
if($_ = readline $self->{fh})
{
chomp;
next unless /\S/;
my $tmp = (/^\s*Entrezgene ::= ({.*)/si)? $1 : "{" . $_; # get rid of the 'Entrezgene ::= ' at the beginning of Entrez Gene record
return $self->parse($tmp, $compact, 1); # 1 species no resetting line number
}
}
# NCBI's Apr 05, 2005 format change forced much usage of lookahead, which would for
# sure slows parser down. But can't code efficiently without it.
sub _parse
{
my ($self, $flag) = @_;
my $data;
while(1)
{
# changing orders of regex if/elsif statements made little difference. current order is close to optimal
if($self->{input} =~ /\G[ \t]*,?[ \t]*\n/cg) # cleanup leftover
{
$self->{linenumber}++;
next;
}
if($self->{input} =~ /\G[ \t]*}/cg)
{
if(!($self->{depth}--) && $self->{input} =~ /\S/)
{
croak "Data Error: extra (mismatched) '}' found on line $self->{linenumber} in $self->{filename}!\n";
}
return $data
}
elsif($self->{input} =~ /\G[ \t]*{/cg)
{
$self->{depth}++;
push(@$data, $self->_parse())
}
elsif($self->{input} =~ /\G[ \t]*([\w-]+)(\s*)/cg)
{
my ($id, $lines) = ($1, $2);
# we're prepared for NCBI to make the format even worse:
$self->{linenumber} += $lines =~ s/[\r\n]+//g;
my $tmp;
if(($self->{input} =~ /\G"((?:[^"]|"")*)"(?=\s*[,}])/cg && ++$tmp) ||
$self->{input} =~ /\G([\w-]+)(?=\s*[,}])/cg)
{
my $value = $1;
if($tmp) # slight speed optimization, not really necessary since regex is fast enough
{
$value =~ s/""/"/g;
$self->{linenumber} += $value =~ s/[\r\n]+//g;
}
if(ref($data->{$id})) { push(@{$data->{$id}}, $value) } # hash value is not a terminal (or have multiple values), create array to avoid multiple same-keyed hash overwrite each other
elsif($data->{$id}) { $data->{$id} = [$data->{$id}, $value] } # hash value has a second terminal value now!
else { $data->{$id} = $value } # the first terminal value
}
elsif($self->{input} =~ /\G{/cg)
{
$self->{depth}++;
push(@{$data->{$id}}, $self->_parse());
}
elsif($self->{input} =~ /\G(?=[,}])/cg) { push(@$data, $id) }
else # must be "id value value" format
{
$self->{depth}++;
push(@{$data->{$id}}, $self->_parse(1))
}
if($flag)
{
if(!($self->{depth}--) && $self->{input} =~ /\S/)
{
croak "Data Error: extra (mismatched) '}' found on line $self->{linenumber} in $self->{filename}!\n";
}
return $data;
}
}
elsif($self->{input} =~ /\G[ \t]*"((?:[^"]|"")*)"(?=\s*[,}])/cg)
{
my $value = $1;
$value =~ s/""/"/g;
$self->{linenumber} += $value =~ s/[\r\n]+//g;
push(@$data, $value)
}
else # end of input
{
my ($pos, $len) = (pos($self->{input}), length($self->{input}));
if($pos != $len && $self->{input} =~ /\G\s*\S/cg) # problem with parsing, must be non-conforming data
{
croak "Data Error: none conforming data found on line $self->{linenumber} in $self->{filename}!\n" .
"first $self->{maxerrstr} (or till end of input) characters including the non-conforming data:\n" .
substr($self->{input}, $pos, $self->{maxerrstr}) . "\n";
}
elsif($self->{depth} > 0)
{
croak "Data Error: missing '}' found at end of input in $self->{filename}!";
}
elsif($self->{depth} < 0)
{
croak "Data Error: extra (mismatched) '}' found at end of input in $self->{filename}!";
}
return $data;
}
}
}
1;
More information about the Bioperl-l
mailing list