[Bioperl-l] Homologene parser...

Wed, 20 Feb 2002 10:40:50 +1300

In case this is useful to anyone on the list, this is what I have 
come up with to parse the homologene file hmlg.trip.ftp. Having 
attended Damian Conway's one day tutorial at the O'Reilly 
Bioinformatics conference I was pretty keen to try out 
Parse::RecDescent, so it uses that - I wasn't disappointed.

I'm pretty sure this works, it seems to parse the entire file without 
any problems. At the moment the script below simply prints out what 
it parses. To be useful you need to replace the action parts of the 
grammar with whatever you want to do with the data. I don't think the 
"...end_of_record" is needed now - I was originally passing the 
parser the entire file. The script only works on the triplet file but 
it is pretty easy to adapt the grammar to work on the hmlg.ftp file 
as well (i.e remove delimiter, title, change how text is feed to 
parser).

I'm pretty sure this works fine, but I haven't had time to really 
check it so use with care. I'm keen to get any feedback at all, 
including perceived merits, demerits of using Parse::RecDescent for 
this sort of thing.

I'm not sure I can see where this would fit into bioperl apart 
perhaps from scripts central, but if someone does (Jason, Ewan?) and 
wants to point me in the right direction I could work on something.

Cheers, Andrew.

#!/usr/bin/perl -W

#   Homologene (hmlg.trip.ftp) parser
#   Andrew Macgregor andrew.macgregor@stonebow.otago.ac.nz
#   parser only tested against hmlg.trip.ftp
#   provided "as is" without any warranty of any kind

use strict;
use Parse::RecDescent;

my $grammar = q {

     record              :   delimiter ortholog(s) title(s) ...end_of_record
                             | <error>

     delimiter           :   /^>/ {print ">\n"; }

     ortholog            :   organism1 "|" organism2 "|" similarity_type "|"
                             locuslink_id_org1(?) "|" 
unigene_id_org1(?) "|" accession_org1(?) "|"
                             locuslink_id_org2(?) "|" 
unigene_id_org2(?) "|" accession_org2(?) "|"
                             percentage(?)

                         {
                             print "$item{organism1}|";
                             print "$item{organism2}|";
                             print "$item{similarity_type}|";
                             print "@{$item{locuslink_id_org1}}|";
                             print "@{$item{unigene_id_org1}}|";
                             print "@{$item{accession_org1}}|";
                             print "@{$item{locuslink_id_org2}}|";
                             print "@{$item{unigene_id_org2}}|";
                             print "@{$item{accession_org2}}|";
                             print "@{$item{percentage}}\n";
                         }

     title               :   "TITLE" unigene "=" gene_symbol description(?)
                             {
                                 print "TITLE 
$item{unigene}=$item{gene_symbol}\t@{$item{description}}\n";
                             }

     end_of_record       :   /\Z/

     organism1           :   organism

     organism2           :   organism

     similarity_type     :   /t|f|b|B|c/

     locuslink_id_org1   :   locuslink_id

     unigene_id_org1     :   unigene_id

     accession_org1      :   accession

     locuslink_id_org2   :   locuslink_id

     unigene_id_org2     :   unigene_id

     accession_org2      :   accession

     percentage          :   <skip: qr/[ \t]*/>/.+/

     unigene             :   organism "." unigene_id { $return = 
"$item{organism}.$item{unigene_id}" }
                             | "Dm." locuslink_id | locuslink_id

     gene_symbol         :   /[\w-]+/

     description         :   <skip: qr/[ \t]*/>/.+/

     organism            :   /At|Bt|Dm|Dr|Hs|Hv|Mm|Os|Rn|Ta|Xl|Zm/

     locuslink_id        :   /LL.[0-9]+/

     unigene_id          :   /[0-9]+/
                             | locuslink_id

     accession           :   /\w+/

};

my $parser = new Parse::RecDescent ($grammar);
open (HOMOLOGENE, "hmlg.trip.ftp") or die "Can't open hmlg.trip.ftp: $!";

# read from the homologene file building up a record then passing it 
to the parser
my ($record, $complete);

while (my $text = <HOMOLOGENE>) {

     if ($text =~ /^>/) {
         $parser->record($record) if defined $complete;
         $complete = 1;
         $record = "";
         $record .= $text;
     }
     else {
         $record .= $text
     }
}
$parser->record($record) if defined $complete;      # takes care of 
the last record