[Bioperl-l] Bio::SeqIO and bad entries in uniprot and interpro

Thu Nov 18 06:53:47 EST 2004

Hi,

I want to get all available Interpro matches for S. cerevisiae and some 
other species. So I need to parse Uniprot files to find a set of IDs for a 
given species and then get the Interpro matches from them. But the Uniprot 
release uniprot_trembl.dat gives an error towards the end of the file in 
next_seq call:

my $inseq = Bio::SeqIO->new('-file' => '<uniprot_trembl.dat',
                                       '-format' => 'swiss');
while (my $seq = $inseq->next_seq) { check species etc. in here}

After happily processing a lot of sequences it gives:
Invalid [] range "6-1" in regex; marked by <-- HERE in m/^Tomato severe 
leaf curl virus-[Guatemala 96-1 <-- HERE ]$/

Same goes for interpro:

my $infeat = Bio::SeqIO->new('-file' => '<match.xml',
                                             '-format' => 'interpro' );
while (my $feat = $infeat->next_seq) { store features etc. in here}

After happily processing a lot of features it gives:
not well-formed (invalid token) at line 2, column 53, byte 131 at 
/usr/lib/perl5/site_perl/5.8.0/i586-linux-thread-multi/XML/Parser.pm line 187

I guess its no wonder that such big DBs have errors or are out of sync with 
perl modules etc. and I don't mind losing one seq or feature here or there. 
The files are rather big so fixing them manually is a bit painful. But I 
need to somehow get most things processed, is there a way to skip these bad 
entries or would you have some other smart ideas?

I have bioperl 1.4. and latest Bio::SeqIO (for swiss.pm to work correctly) 
from CVS on SuSe8.1.

Thanks a milloin for any help!
Cheers,
mikko
Mikko Arvas
VTT Biotechnology

e-mail:            mikko.arvas at vtt.fi
tel:                 +358-(0)9-456 5827
mobile:           +358-(0)44-381 0502
fax:                +358-(0)9-455 2103
mail:               Tietotie 2, Espoo
                       P.O. Box 1500
                       FIN-02044 VTT, Finland