[Bioperl-l] (no subject)
Mikko Arvas
Mikko.Arvas at vtt.fi
Wed Dec 8 07:01:40 EST 2004
Hi,
thank you so much for everybody for your help! But still no progress.
I have Suse8.1, bioperl 1.4., XML::Parser.pm is 2.34 and latest
match.xml from:
ftp://ftp.ebi.ac.uk/pub/databases/interpro
match.xml.gz 2004-11-29
Like Dave suggested just parsing with XML::Parser works fine with:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Parser;
my $pl = new XML::Parser;
$pl->parsefile('match.xml');
But if do this:
my $infeat = Bio::SeqIO->new('-file' => "<$opt_i",
'-format' => 'interpro' );
while (my $feat = $infeat->next_seq) {print $feat->accession_number()."\n";}
I still get:
not well-formed (invalid token) at line 2, column 53, byte 131 at
/usr/lib/perl5/site_perl/5.8.0/i586-linux-thread-multi/XML/Parser.pm line 187
from protein id o00408.
And I can still remove this problem by taking the 2nd & out from line
<interpro id="IPR002073" name="3'5'-cyclic nucleotide
phosphodiesterase" type="Domain" parent_id="IPR003607">
I can see no difference in the quoting of this entry and the new and old
version of match.xml.
There are about 2286 lines in match.xml with a two & and if I simply:
tr "&" "_" <match.xml>match_user_friendly.xml
I can parse match_user_friendly.xml untill the script above happily fills
all the available memory and crashes (but that is an other story then).
So is this my system only or does somebody else have the same problem too?
If it is I'll just be lazy and use tr, enough time spent already.
Cheers,
mikko
PS. Here is still the whole entry just in case:
<protein id="O00408" name="CN2A_HUMAN" length="941" crc64="9797609B487FD64E">
<interpro id="IPR002073" name="3'5'-cyclic nucleotide
phosphodiesterase" type="Domain" parent_id="IPR003607">
<match id="PF00233" name="PDEase_I" dbname="PFAM">
<location start="655" end="892" status="T" evidence="HMMPfam" score="0.0" />
</match>
<match id="PR00387" name="PDIESTERASE1" dbname="PRINTS">
<location start="651" end="664" status="T" evidence="FPrintScan"
score="7.399999999999999E-30" />
<location start="682" end="695" status="T" evidence="FPrintScan"
score="7.399999999999999E-30" />
<location start="696" end="711" status="T" evidence="FPrintScan"
score="7.399999999999999E-30" />
<location start="724" end="740" status="T" evidence="FPrintScan"
score="7.399999999999999E-30" />
<location start="804" end="817" status="T" evidence="FPrintScan"
score="7.399999999999999E-30" />
<location start="821" end="837" status="T" evidence="FPrintScan"
score="7.399999999999999E-30" />
</match>
<match id="PS00126" name="PDEASE_I" dbname="PROSITE">
<location start="696" end="707" status="T" evidence="AddProsite"
score="8.0E-5" />
</match>
<match id="SSF48547" name="PDEase" dbname="SSF">
<location start="573" end="898" status="T" evidence="HMMPfam"
score="4.38E-43" />
</match>
</interpro>
<interpro id="IPR003018" name="GAF" type="Domain">
<match id="PF01590" name="GAF" dbname="PFAM">
<location start="241" end="377" status="T" evidence="HMMPfam"
score="5.7E-10" />
<location start="409" end="548" status="T" evidence="HMMPfam"
score="1.3E-25" />
</match>
<match id="PS50813" name="GAF" dbname="PREFILE">
<location start="396" end="550" status="T" evidence="PrfScan"
score="11.073" />
</match>
<match id="SM00065" name="GAF" dbname="SMART">
<location start="241" end="387" status="T" evidence="Smart" score="7.3E-18" />
<location start="409" end="558" status="T" evidence="Smart" score="6.1E-38" />
</match>
</interpro>
<interpro id="IPR003607" name="Metal-dependent phosphohydrolase, HD region"
type="Domain">
<match id="SM00471" name="HDc" dbname="SMART">
<location start="653" end="822" status="T" evidence="Smart" score="1.0E-6" />
</match>
</interpro>
</protein>
Mikko Arvas
VTT Biotechnology
e-mail: mikko.arvas at vtt.fi
tel: +358-(0)9-456 5827
mobile: +358-(0)44-381 0502
fax: +358-(0)9-455 2103
mail: Tietotie 2, Espoo
P.O. Box 1500
FIN-02044 VTT, Finland
More information about the Bioperl-l
mailing list