[Bioperl-l] Interpro parsing problems - and solutions

Holland, Richard Richard.Holland at agresearch.co.nz
Tue Jan 6 21:17:12 EST 2004


InterProParser.t has the following line:

my $ipp = Bio::OntologyIO->new( -format => 'interpro',
                                -file =>
 
$io->catfile('t','data','interpro_short.xml'),
                                -ontology_engine => 'simple' );

As you can see it is specifying 'simple' for the ontology_engine - even
if it did not, we should get a simple engine anyway, but until
yesterday's fix we got an exception instead if this option was not
specified. However I have left the test untouched as I'm not sure
whether we should be testing default behaviour or not.

I have tested all my changes from yesterday with this script, they
passed, and I have committed them to the bioperl-live repository.

cheers,
Richard

---
Richard Holland
Bioinformatics Database Developer
ITS, Agresearch Invermay x3279



-----Original Message-----
From: Hilmar Lapp [mailto:hlapp at gmx.net] 
Sent: Wednesday, 7 January 2004 5:56 a.m.
To: Holland, Richard
Subject: Re: [Bioperl-l] Interpro parsing problems - and solutions


Thanks for the fixes Richard. It is tremendously helpful if people 
submit patches.

Do you have a cvs account? Would you be interested in applying those 
fixes (and possibly more :) yourself?

As for the engine default, I thought the test would run without the 
type set. Maybe I'm mistaken on this one. Have you already tested all 
your changes against the InterProParser.t test script?

	-hilmar

On Monday, January 5, 2004, at 03:58  PM, Holland, Richard wrote:

> Hi all. A long one this but I hope it's worth the read.
>
> I found a possible bug in Bio/OntologyIO/InterProParser.pm. The 
> default behaviour for OntologyEngineI implementors is supposed to be a

> 'simple' OntologyEngine. However, the code in InterProParser does not 
> default to anything if no engine is specified - it tries to set up an 
> 'undef' engine, which fails with a not-yet-implemented message. The 
> other parsers based on dagflat.pm do exhibit the default behaviour, so

> to make InterProParser.pm do the same I added the following line at 
> line 130 (just before "if(lc($eng_type) eq 'simple')"):
>
> 	$eng_type = 'simple' unless $eng_type;
> 	
> Another possible bug in the same module - InterProParser.pm assumes
> that
> it has been passed a filename not a filehandle. As OntologyIO allows 
> for
> both cases, I have adapted InterProParser.pm to suit:
>
> 	sub parse{
> 	   my $self = shift;
>
> 	   my $ret;
> 	   if ($self->file()) {
> 	        $ret = $self->{_parser}->parse( Source => {
> 	                SystemId => $self->file() } );
> 	   } elsif ($self->_fh()) {
> 	        $ret = $self->{_parser}->parse( Source => {
> 	                ByteStream => $self->_fh() } );
> 	   } else {
> 	        $ret = undef;
> 	        $self->throw("You have not provided any InterPro XML to 
> parse.\n");
> 	   }
> 	   $self->_is_parsed(1);
> 	   return $ret;
> 	}
>
> Also, in Bio/OntologyIO/Handlers/InterProHandler.pm (note different 
> module!), a definite bug this time, it does not know about two of the 
> interpro types - Active_site and Binding_site. Adding lines to 
> start_element as follows seems to work:
>
> 	sub start_element {
> 	  my ($self, $element) = @_;
> 	  my $ont = $self->ontology();
> 	  my $fact = $self->term_factory();
>
> 	  if ($element->{Name} eq 'interprodb') {
> 	    # TWO NEW LINES FOLLOW
> 	    $ont->add_term($fact->create_object(-identifier =>
"Active_site",
> 	                                        -name => "Active Site")
> );
> 	    $ont->add_term($fact->create_object(-identifier =>
> "Binding_site",
> 	                                        -name => "Binding Site")
> );
> 	    # CONTINUE WITH ORIGINAL CODE
> 	    $ont->add_term($fact->create_object(-identifier => "Family",
> 	                                        -name => "Family") );
> 	    $ont->add_term($fact->create_object(-identifier => "Domain",
> 	                                        -name => "Domain") );
> 	    $ont->add_term($fact->create_object(-identifier => "Repeat",
> 	                                        -name => "Repeat") );
> 	    $ont->add_term($fact->create_object(-identifier => "PTM",
> 	                                 -name => "post-translational
> modification"));
>
> There is a problem with the relation loader in 
> Bio/Ontology/SimpleOntologyEngine.pm as well - get_term_by_identifier 
> returns an array of @terms, but the get_relationships subroutine calls

> it twice and both times expects it to only return a single term. Other

> calls to get_term_by_identifier in this module are correct. I have 
> adapted the two incorrect references in get_relationships to loop 
> through all terms returned by get_term_by_identifier instead, as
> follows:
>
>         # if a term is supplied, add a relationship for the parent to 
> the term
>         # except if the parent is the term itself (we added that one
> before)
>         if($term && ($parent_id ne $term->identifier())) {
>             my @parent_terms = 
> $self->get_term_by_identifier($parent_id);
>             foreach my $parent_term (@parent_terms) {
>                push(@rels,
>                     $relfact->create_object(-object_term    =>
> $parent_term,
>                                             -subject_term   => $term,
>                                             -predicate_term =>
>
> $parent_entry->{$term->identifier},
>                                             -ontology       =>
> $term->ontology()
>                                             )
>                     );
>             }
>
>         } else {
>             # otherwise, i.e., no term supplied, or the parent equals 
> the
>             # supplied term
>             my @parent_terms = $term ?
>                 ($term) : $self->get_term_by_identifier($parent_id);
>             foreach my $child_id (keys %$parent_entry) {
>                 my $rel_info = $parent_entry->{$child_id};
>
>                 foreach my $parent_term (@parent_terms) {
>                    push(@rels,
>                         $relfact->create_object(-object_term    =>
> $parent_term,
>                                                 -subject_term   =>
>
> $self->get_term_by_identifier(
>
> $child_id),
>                                                 -predicate_term => 
> $rel_info,
>                                                 -ontology 
> =>$parent_term->ontology
>                                                 )
>                         );
>                 }
>             }
>         }
>
> There is also a bug in Bio/OntologyIO/Handlers/InterProHandler.pm 
> which does not set the current ontology of new temporary terms when 
> used in relationships. This causes "cannot call method 'name' on an 
> undefined value" errors later on when get_relationships tries to find 
> out the name of the ontology of each relationship. It also does not 
> set the 'name' values for these terms, so that you get 'cannot insert 
> null into SG_TERM.NAME' errors when the terms are later committed to 
> the database,
> because the name column of sg_term has a 'not null' constraint, at 
> least
> in the Oracle implementation of BioSQL. These problems can both be 
> fixed
> by altering the call to the create_object function, in sub
> _create_relationship, to include the -ontology and -name parameters:
>
>     $term_temp = $ont->engine->add_term( $fact->create_object( 
> -InterPro_id => $ref_id,
> 							          -name
> => $ref_id,
> 	
> -ontology => $ont ) );
>
> And in sub start_element in the same module the same fix needs 
> applying (no need for -ontology here as the existing code following 
> this statement does this for us explicitly using a term->ontology() 
> call):
>
>     $self->_term(
>                  (!defined $term_temp)
>                  ? $ont->add_term( $fact->create_object(-InterPro_id 
> => $id,
>
> -name => $id ) )
>                  : $term_temp
>                 );
>
> Note that this call defaults the name to be the same as the InterPro
> ID.
> Most temporary terms will be overridden with permanent data later in 
> the
> process, but for those that slip through without permanent definitions
> (possibly a bug in the InterPro XML datafile?) the above change is
> necessary to allow them to be inserted into the database and their
> relations to other terms kept valid.
>
> Lastly, the definition field of the term table is too short to take,
> for
> example, IPR001967 - I have extended it to the Oracle varchar2 maximum
> of 4000 chars (up from 2000) in my BioSQL instance, but found it still
> to be too short, so I added a quick fix to Bio/Ontology/Term.pm to
> truncate definitions to be no more than this length, but would like to
> know if this is a more general problem:
>
> 	sub definition {
> 	    my $self = shift;
>
> 	    return $self->{'definition'} = substr(shift,0,4000) if @_;
> 	    return $self->{'definition'};
> 	} # definition
>
> Finally, Bio/OntologyIO/Handlers/InterProHandler.pm fails to set the 
> ontology for the base IS_A, CONTAINS and FOUND_IN relationship types. 
> This causes a null foreign key error on SG_TERM.ONT_OID when it tries 
> to insert them into the database. After checking
> Bio/Ontology/SimpleGOEngine.pm and finding that it does set the 
> ontology
> names for these relationships, I modified InterProHandler to do the 
> same
> by adding the following three lines in the constructor (new) method
> directly after the instantiation of the three relationship types:
>
>   $is_a_rel->ontology($self->ontology());
>   $contains_rel->ontology($self->ontology());
>   $found_in_rel->ontology($self->ontology());
>
> With all these changes in place, 
> "bioperl-db/scripts/biosql/load_ontology.pl --format interpro 
> interpro.xml" works correctly for the current release of interpro.xml.

> It _does_ give many "unique constraint violated" errors on 
> TERM_RELATIONSHIP while inserting the relationships into BioSQL, but I

> suspect this is because the get_relationships function in 
> InterProHandler.pm does not check to see if a relation has already 
> been pushed before pushing it onto the array of relations it returns. 
> To fix this would slow down the code considerably so it's probably 
> best to just live with the messages.
>
> Are these acceptable modifications or is there a better way to fix
> these
> problems? I admit my Perl may not be the best but hopefully you get
the
> idea of how my fixes work and may be able to code them in a better
way.
>
> If any of these changes are acceptable could someone implement them 
> for me? I do not have access to CVS.
>
> Finally, if you think that I have been spending too much time trying 
> to fix these bugs when in fact there is an alternative and fully 
> functional way to load InterPro XML data into BioPerl/BioSQL, please 
> feel free to slap me around a bit with a wet haddock. This is my first

> ever contribution to an open source project so I could do with some 
> advice on
> how to do things better next time.
>
> cheers,
> Richard
>
> PS. Totally unrelated but also a database field length problem - the 
> description field in bioentry is too short to hold SwissProt P05067 - 
> in my case I have lengthened the field to 1024 chars which does the 
> trick.
>
> PPS. Certain annotations in RefSeqs cause a similar overflow in 
> seqfeature_qualifier_value.value. I'm guessing that without using 
> CLOBs for all cases where free text is allowed, we'll never see the 
> end of these kinds of problems in the Oracle version of BioSQL?
>
> ---
> Richard Holland
> Bioinformatics Database Developer
> ITS, Agresearch Invermay x3279
>
>
> ======================================================================
> =
> Attention: The information contained in this message and/or
attachments
> from AgResearch Limited is intended only for the persons or entities
> to which it is addressed and may contain confidential and/or
privileged
> material. Any review, retransmission, dissemination or other use of,
or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipients is prohibited by
AgResearch
> Limited. If you have received this message in error, please notify the
> sender immediately.
>
=======================================================================
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org 
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------


=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================



More information about the Bioperl-l mailing list