[Bioperl-l] bad entries in interpro

Tue Nov 23 19:30:21 EST 2004

Hi everyone,

A few days ago, Mikko Arvas sent an e-mail to this list asking how to
ignore bad entries in the matches.xml file from the InterPro database.
Hilmar Lapp answered asking him to locate the position in the file that
raises the error message 

>> not well-formed (invalid token) at line 2, column 53, byte 131 at 
>> /usr/lib/perl5/site_perl/5.8.0/i586-linux-thread-multi/XML/Parser.pm 
>> line 187

Well, I saw no answers on the list, therefore I'm sending the problemtic
entry below:

<protein id="O00408" name="CN2A_HUMAN" length="941" 
 crc64="9797609B487FD64E">
    <interpro id="IPR002073" name="3&apos;5&apos;-cyclic nucleotide
    phosphodiesterase" type="Domain" parent_id="IPR003607">

The problem seems to be the "&apos;" annotation at the second line.

I also tested if an eval clause could be used to bypass such entries
without crashing a script. The example script below worked fine and
reported a problem with the entry above without crashing.

Would it be too dificult to make interpro.pm able to parse names like
the one above?

Robson

##################################################
#!/usr/bin/perl -w

use strict;
use Bio::SeqIO;

my $in = Bio::SeqIO->new(-file=>$ARGV[0],
     -format=>"interpro");

my $i=1;
while (1) {
   my $seq;
   eval {
     $seq = $in->next_seq;
   };
   last if (!defined $seq);
   if ($@) { print STDERR "Problem parsing sequence $i..."; next };
     print STDERR $seq->id,"\n";
     print "<=== ",$seq->id,"===>\n";
    foreach my $f ($seq->get_all_SeqFeatures) {
      print $f->gff_string,"\n";
      foreach my $key ($f->annotation->get_all_annotation_keys) {
        foreach my $value ($f->annotation->get_Annotations($key)) {
          print $key,":",$value->as_text,"\n";
        }
      }
    }
    $i++;
}

exit 0;