[Bioperl-l] Homologene parser...
Andrew Macgregor
andrew@anatomy.otago.ac.nz
Wed, 20 Feb 2002 10:40:50 +1300
In case this is useful to anyone on the list, this is what I have
come up with to parse the homologene file hmlg.trip.ftp. Having
attended Damian Conway's one day tutorial at the O'Reilly
Bioinformatics conference I was pretty keen to try out
Parse::RecDescent, so it uses that - I wasn't disappointed.
I'm pretty sure this works, it seems to parse the entire file without
any problems. At the moment the script below simply prints out what
it parses. To be useful you need to replace the action parts of the
grammar with whatever you want to do with the data. I don't think the
"...end_of_record" is needed now - I was originally passing the
parser the entire file. The script only works on the triplet file but
it is pretty easy to adapt the grammar to work on the hmlg.ftp file
as well (i.e remove delimiter, title, change how text is feed to
parser).
I'm pretty sure this works fine, but I haven't had time to really
check it so use with care. I'm keen to get any feedback at all,
including perceived merits, demerits of using Parse::RecDescent for
this sort of thing.
I'm not sure I can see where this would fit into bioperl apart
perhaps from scripts central, but if someone does (Jason, Ewan?) and
wants to point me in the right direction I could work on something.
Cheers, Andrew.
#!/usr/bin/perl -W
# Homologene (hmlg.trip.ftp) parser
# Andrew Macgregor andrew.macgregor@stonebow.otago.ac.nz
# parser only tested against hmlg.trip.ftp
# provided "as is" without any warranty of any kind
use strict;
use Parse::RecDescent;
my $grammar = q {
record : delimiter ortholog(s) title(s) ...end_of_record
| <error>
delimiter : /^>/ {print ">\n"; }
ortholog : organism1 "|" organism2 "|" similarity_type "|"
locuslink_id_org1(?) "|"
unigene_id_org1(?) "|" accession_org1(?) "|"
locuslink_id_org2(?) "|"
unigene_id_org2(?) "|" accession_org2(?) "|"
percentage(?)
{
print "$item{organism1}|";
print "$item{organism2}|";
print "$item{similarity_type}|";
print "@{$item{locuslink_id_org1}}|";
print "@{$item{unigene_id_org1}}|";
print "@{$item{accession_org1}}|";
print "@{$item{locuslink_id_org2}}|";
print "@{$item{unigene_id_org2}}|";
print "@{$item{accession_org2}}|";
print "@{$item{percentage}}\n";
}
title : "TITLE" unigene "=" gene_symbol description(?)
{
print "TITLE
$item{unigene}=$item{gene_symbol}\t@{$item{description}}\n";
}
end_of_record : /\Z/
organism1 : organism
organism2 : organism
similarity_type : /t|f|b|B|c/
locuslink_id_org1 : locuslink_id
unigene_id_org1 : unigene_id
accession_org1 : accession
locuslink_id_org2 : locuslink_id
unigene_id_org2 : unigene_id
accession_org2 : accession
percentage : <skip: qr/[ \t]*/>/.+/
unigene : organism "." unigene_id { $return =
"$item{organism}.$item{unigene_id}" }
| "Dm." locuslink_id | locuslink_id
gene_symbol : /[\w-]+/
description : <skip: qr/[ \t]*/>/.+/
organism : /At|Bt|Dm|Dr|Hs|Hv|Mm|Os|Rn|Ta|Xl|Zm/
locuslink_id : /LL.[0-9]+/
unigene_id : /[0-9]+/
| locuslink_id
accession : /\w+/
};
my $parser = new Parse::RecDescent ($grammar);
open (HOMOLOGENE, "hmlg.trip.ftp") or die "Can't open hmlg.trip.ftp: $!";
# read from the homologene file building up a record then passing it
to the parser
my ($record, $complete);
while (my $text = <HOMOLOGENE>) {
if ($text =~ /^>/) {
$parser->record($record) if defined $complete;
$complete = 1;
$record = "";
$record .= $text;
}
else {
$record .= $text
}
}
$parser->record($record) if defined $complete; # takes care of
the last record