[Bioperl-l] Homologene parser...

Wed, 20 Feb 2002 08:15:26 +0000

Andrew Macgregor wrote:
> 
> In case this is useful to anyone on the list, this is what I have
> come up with to parse the homologene file hmlg.trip.ftp. Having
> attended Damian Conway's one day tutorial at the O'Reilly
> Bioinformatics conference I was pretty keen to try out
> Parse::RecDescent, so it uses that - I wasn't disappointed.
>
>
> I'm pretty sure this works, it seems to parse the entire file without
> any problems. At the moment the script below simply prints out what
> it parses. To be useful you need to replace the action parts of the
> grammar with whatever you want to do with the data. I don't think the
> "...end_of_record" is needed now - I was originally passing the
> parser the entire file. The script only works on the triplet file but
> it is pretty easy to adapt the grammar to work on the hmlg.ftp file
> as well (i.e remove delimiter, title, change how text is feed to
> parser).
> 
> I'm pretty sure this works fine, but I haven't had time to really
> check it so use with care. I'm keen to get any feedback at all,
> including perceived merits, demerits of using Parse::RecDescent for
> this sort of thing.

Now this is the way to parse text databases! ... says someone with a few
years experience in parsing files using icarus language in SRS. Recoursive
parsing is the cleanest, most robust way of parsing semmi-structured
biologial databases. For some time now I've been wanting to start using
Parse::RecDescent, but have not had time.

I am not sure but there should be a homologene parser written in icarus
somewhere at the EBI SRS server. Can not find it though... 

Generally, if you want to parse a biological database flat file with
Parse::RecDescent, have look at icarus parsers. They are visible in every
database's info page.

> I'm not sure I can see where this would fit into bioperl apart
> perhaps from scripts central, but if someone does (Jason, Ewan?) and
> wants to point me in the right direction I could work on something.

Have a look at Bio::SeqIO/* parsers. They have  a read_seq() method. The
code below would go in there (grammar into the BEGIN block?) and instead of
printing values out, the code should create relevant objects (In case of
homologene, you'd have to write them first.). 

I am not quite sure how the flow of code fits into that model, but revision
is being considered right now, so this came at the right time.

	-Heikki

> Cheers, Andrew.
> 
> #!/usr/bin/perl -W
> 
> #   Homologene (hmlg.trip.ftp) parser
> #   Andrew Macgregor andrew.macgregor@stonebow.otago.ac.nz
> #   parser only tested against hmlg.trip.ftp
> #   provided "as is" without any warranty of any kind
> 
> use strict;
> use Parse::RecDescent;
> 
> my $grammar = q {
> 
>      record              :   delimiter ortholog(s) title(s) ...end_of_record
>                              | <error>
> 
>      delimiter           :   /^>/ {print ">\n"; }
> 
>      ortholog            :   organism1 "|" organism2 "|" similarity_type "|"
>                              locuslink_id_org1(?) "|"
> unigene_id_org1(?) "|" accession_org1(?) "|"
>                              locuslink_id_org2(?) "|"
> unigene_id_org2(?) "|" accession_org2(?) "|"
>                              percentage(?)
> 
>                          {
>                              print "$item{organism1}|";
>                              print "$item{organism2}|";
>                              print "$item{similarity_type}|";
>                              print "@{$item{locuslink_id_org1}}|";
>                              print "@{$item{unigene_id_org1}}|";
>                              print "@{$item{accession_org1}}|";
>                              print "@{$item{locuslink_id_org2}}|";
>                              print "@{$item{unigene_id_org2}}|";
>                              print "@{$item{accession_org2}}|";
>                              print "@{$item{percentage}}\n";
>                          }
> 
>      title               :   "TITLE" unigene "=" gene_symbol description(?)
>                              {
>                                  print "TITLE
> $item{unigene}=$item{gene_symbol}\t@{$item{description}}\n";
>                              }
> 
>      end_of_record       :   /\Z/
> 
>      organism1           :   organism
> 
>      organism2           :   organism
> 
>      similarity_type     :   /t|f|b|B|c/
> 
>      locuslink_id_org1   :   locuslink_id
> 
>      unigene_id_org1     :   unigene_id
> 
>      accession_org1      :   accession
> 
>      locuslink_id_org2   :   locuslink_id
> 
>      unigene_id_org2     :   unigene_id
> 
>      accession_org2      :   accession
> 
>      percentage          :   <skip: qr/[ \t]*/>/.+/
> 
>      unigene             :   organism "." unigene_id { $return =
> "$item{organism}.$item{unigene_id}" }
>                              | "Dm." locuslink_id | locuslink_id
> 
>      gene_symbol         :   /[\w-]+/
> 
>      description         :   <skip: qr/[ \t]*/>/.+/
> 
>      organism            :   /At|Bt|Dm|Dr|Hs|Hv|Mm|Os|Rn|Ta|Xl|Zm/
> 
>      locuslink_id        :   /LL.[0-9]+/
> 
>      unigene_id          :   /[0-9]+/
>                              | locuslink_id
> 
>      accession           :   /\w+/
> 
> };
> 
> my $parser = new Parse::RecDescent ($grammar);
> open (HOMOLOGENE, "hmlg.trip.ftp") or die "Can't open hmlg.trip.ftp: $!";
> 
> # read from the homologene file building up a record then passing it
> to the parser
> my ($record, $complete);
> 
> while (my $text = <HOMOLOGENE>) {
> 
>      if ($text =~ /^>/) {
>          $parser->record($record) if defined $complete;
>          $complete = 1;
>          $record = "";
>          $record .= $text;
>      }
>      else {
>          $record .= $text
>      }
> }
> $parser->record($record) if defined $complete;      # takes care of
> the last record
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l

-- 
______ _/      _/_____________________________________________________
      _/      _/                      http://www.ebi.ac.uk/mutations/
     _/  _/  _/  Heikki Lehvaslaiho          heikki@ebi.ac.uk
    _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
   _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
  _/  _/  _/  Cambs. CB10 1SD, United Kingdom
     _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
___ _/_/_/_/_/________________________________________________________