[Bioperl-l] Re: bad entries in interpro again
dhoworth at mrc-lmb.cam.ac.uk
Thu Dec 2 09:04:46 EST 2004
Mikko Arvas wrote:
> Sorry about that I should have tested it before mailing. The problem is
> not non-ascii characters it seems to be specifically the combination of
> two & inside individual <>. I tried various combinations and other
> non-ascii characters (even in abundance) don't break it and a single &
> does neither.
> Here is again the problematic line:
> <interpro id="IPR002073" name="3'5'-cyclic nucleotide
> phosphodiesterase" type="Domain" parent_id="IPR003607">
> And its error:
> not well-formed (invalid token) at line 2, column 54, byte 132 at
> line 187
> So which way to proceed?
I think some extra details might make it easier to see what is going on.
Which file are you scanning? Since your original post a new version of
Interpro has been released so I suggest giving a URL on the Interpro FTP
site so everybody can be sure of looking at the same file. I have just
run the Sun XML validator on
ftp://ftp.ebi.ac.uk/pub/databases/interpro/match.xml.gz (after unpacking
it) and it validates as correct XML.
What version of XML::Parser are you using? I have just parsed that file
with no errors using XML::Parser V2.34 on Suse 9.1 and this test script:
my $pl = new XML::Parser();
So on the surface, the problem doesn't seem to be with either the
Interpro data or the XML parser.
The file contains many lines identical to the one cited, which are all
valid XML in accordance with the Interpro DTD, but none are line 2! So
it looks like different data has been passed to XML::Parser.
MRC Centre for Protein Engineering
Hills Road, Cambridge, CB2 2QH
More information about the Bioperl-l