[Bioperl-l] Re: bad interpro entries
Dave Howorth
dhoworth at mrc-lmb.cam.ac.uk
Wed Dec 8 07:32:17 EST 2004
Mikko Arvas wrote:
> thank you so much for everybody for your help! But still no progress.
> I have Suse8.1, bioperl 1.4., XML::Parser.pm is 2.34 and latest
> match.xml from:
> ftp://ftp.ebi.ac.uk/pub/databases/interpro
> match.xml.gz 2004-11-29
>
> Like Dave suggested just parsing with XML::Parser works fine with:
> But if do this:
> my $infeat = Bio::SeqIO->new('-file' => "<$opt_i",
> '-format' => 'interpro' );
> while (my $feat = $infeat->next_seq) {print
> $feat->accession_number()."\n";}
>
> I still get:
> not well-formed (invalid token) at line 2, column 53, byte 131 at
> /usr/lib/perl5/site_perl/5.8.0/i586-linux-thread-multi/XML/Parser.pm
> line 187
>
> from protein id o00408.
>PS. Here is still the whole entry just in case:
Well, I tested this entry with the validator and with that little test
script and it appears to be good data. How did you obtain it? Was it as
Hilmar suggested?:
> There is no other editing of the chunks going on though except for a
> haphazard substitution of certain double-quotes. In order to see the
> chunk before it gets sent to the parser instance edit
> Bio/SeqIO/interpro.pm and before the line
>
> $self->parse_xml($xml_fragment);
>
> put a print statement that prints out the content of $xml_fragment.
> That should also give the exact source XML that trips up the parser.
If you printed it another way, I'd suggest trying what Hilmar suggested
next. If you did print it that way, call in the wizards!
Cheers, Dave
--
Dave Howorth
MRC Centre for Protein Engineering
Hills Road, Cambridge, CB2 2QH
01223 252960
More information about the Bioperl-l
mailing list