[Bioperl-l] Bio::SeqIO and bad entries in uniprot and interpro

Hilmar Lapp hlapp at gmx.net
Tue Nov 23 01:29:27 EST 2004


On Monday, November 22, 2004, at 12:58  PM, Jason Stajich wrote:

>> Same goes for interpro:
>>
>> my $infeat = Bio::SeqIO->new('-file' => '<match.xml',
>>                                             '-format' => 'interpro' );
>> while (my $feat = $infeat->next_seq) { store features etc. in here}
>>
>> After happily processing a lot of features it gives:
>> not well-formed (invalid token) at line 2, column 53, byte 131 at 
>> /usr/lib/perl5/site_perl/5.8.0/i586-linux-thread-multi/XML/Parser.pm 
>> line 187

Can you locate the position that raises the error? I have seen error 
like this thrown on non-ASCII characters.

>>
>> I guess its no wonder that such big DBs have errors or are out of 
>> sync with perl modules etc. and I don't mind losing one seq or 
>> feature here or there. The files are rather big so fixing them 
>> manually is a bit painful. But I need to somehow get most things 
>> processed, is there a way to skip these bad entries or would you have 
>> some other smart ideas?
>>

XML::Parser being built on top of expat, there is really no way of 
recovering from an XML violation that would let you resume parsing of 
the document.

	-hilmar

--
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------




More information about the Bioperl-l mailing list