[Bioperl-l] Bio::SeqIO and bad entries in uniprot and interpro
Mikko Arvas
Mikko.Arvas at vtt.fi
Thu Nov 18 06:53:47 EST 2004
Hi,
I want to get all available Interpro matches for S. cerevisiae and some
other species. So I need to parse Uniprot files to find a set of IDs for a
given species and then get the Interpro matches from them. But the Uniprot
release uniprot_trembl.dat gives an error towards the end of the file in
next_seq call:
my $inseq = Bio::SeqIO->new('-file' => '<uniprot_trembl.dat',
'-format' => 'swiss');
while (my $seq = $inseq->next_seq) { check species etc. in here}
After happily processing a lot of sequences it gives:
Invalid [] range "6-1" in regex; marked by <-- HERE in m/^Tomato severe
leaf curl virus-[Guatemala 96-1 <-- HERE ]$/
Same goes for interpro:
my $infeat = Bio::SeqIO->new('-file' => '<match.xml',
'-format' => 'interpro' );
while (my $feat = $infeat->next_seq) { store features etc. in here}
After happily processing a lot of features it gives:
not well-formed (invalid token) at line 2, column 53, byte 131 at
/usr/lib/perl5/site_perl/5.8.0/i586-linux-thread-multi/XML/Parser.pm line 187
I guess its no wonder that such big DBs have errors or are out of sync with
perl modules etc. and I don't mind losing one seq or feature here or there.
The files are rather big so fixing them manually is a bit painful. But I
need to somehow get most things processed, is there a way to skip these bad
entries or would you have some other smart ideas?
I have bioperl 1.4. and latest Bio::SeqIO (for swiss.pm to work correctly)
from CVS on SuSe8.1.
Thanks a milloin for any help!
Cheers,
mikko
Mikko Arvas
VTT Biotechnology
e-mail: mikko.arvas at vtt.fi
tel: +358-(0)9-456 5827
mobile: +358-(0)44-381 0502
fax: +358-(0)9-455 2103
mail: Tietotie 2, Espoo
P.O. Box 1500
FIN-02044 VTT, Finland
More information about the Bioperl-l
mailing list