[Bioperl-l] Bio::SeqIO and bad entries in uniprot and
interpro
Mikko Arvas
Mikko.Arvas at vtt.fi
Fri Nov 26 08:32:46 EST 2004
Thanks a lot! Its fine now.
Mikko
At 15:58 22.11.2004 -0500, Jason Stajich wrote:
>On Nov 18, 2004, at 6:53 AM, Mikko Arvas wrote:
>
>>Hi,
>>
>>I want to get all available Interpro matches for S. cerevisiae and some
>>other species. So I need to parse Uniprot files to find a set of IDs for
>>a given species and then get the Interpro matches from them. But the
>>Uniprot release uniprot_trembl.dat gives an error towards the end of the
>>file in next_seq call:
>>
>>my $inseq = Bio::SeqIO->new('-file' => '<uniprot_trembl.dat',
>> '-format' => 'swiss');
>>while (my $seq = $inseq->next_seq) { check species etc. in here}
>>
>>After happily processing a lot of sequences it gives:
>>Invalid [] range "6-1" in regex; marked by <-- HERE in m/^Tomato severe
>>leaf curl virus-[Guatemala 96-1 <-- HERE ]$/
>>
>>Same goes for interpro:
>>
>>my $infeat = Bio::SeqIO->new('-file' => '<match.xml',
>> '-format' => 'interpro' );
>>while (my $feat = $infeat->next_seq) { store features etc. in here}
>>
>>After happily processing a lot of features it gives:
>>not well-formed (invalid token) at line 2, column 53, byte 131 at
>>/usr/lib/perl5/site_perl/5.8.0/i586-linux-thread-multi/XML/Parser.pm line 187
>>
>>I guess its no wonder that such big DBs have errors or are out of sync
>>with perl modules etc. and I don't mind losing one seq or feature here or
>>there. The files are rather big so fixing them manually is a bit painful.
>>But I need to somehow get most things processed, is there a way to skip
>>these bad entries or would you have some other smart ideas?
>
>I think this has to do with some unsafe code the swiss.pm module which
>compares the species name against a list of Unknown species name values
>and is trying to interpret the 96-1 as a range in a regexp. Putting a \Q
>in front of the variable where this is being compared should be enough to
>fix it. This is the grep on line 986.
>
>- return if grep { /^$binomial$/ } @Unknown_names;
>+ return if grep { /^\Q$binomial$/ } @Unknown_names;
>
>There was one more place in the code that did this as well which I think I
>have fixed.
>
>I'm checking this in to CVS so do a cvs update and see if you problem
>persists. I've tested it against the uniprot_trembl.dat.
>
>Not sure what the problem is with the interpro parser, someone else will
>need to look into that.
>
>>
>>I have bioperl 1.4. and latest Bio::SeqIO (for swiss.pm to work
>>correctly) from CVS on SuSe8.1.
>>
>>Thanks a milloin for any help!
>>Cheers,
>>mikko
>>Mikko Arvas
>>VTT Biotechnology
>>
>>e-mail: mikko.arvas at vtt.fi
>>tel: +358-(0)9-456 5827
>>mobile: +358-(0)44-381 0502
>>fax: +358-(0)9-455 2103
>>mail: Tietotie 2, Espoo
>> P.O. Box 1500
>> FIN-02044 VTT, Finland
>>
>>
>>_______________________________________________
>>Bioperl-l mailing list
>>Bioperl-l at portal.open-bio.org
>>http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>
>--
>Jason Stajich
>jason.stajich at duke.edu
>http://www.duke.edu/~jes12/
>
Mikko Arvas
VTT Biotechnology
e-mail: mikko.arvas at vtt.fi
tel: +358-(0)9-456 5827
mobile: +358-(0)44-381 0502
fax: +358-(0)9-455 2103
mail: Tietotie 2, Espoo
P.O. Box 1500
FIN-02044 VTT, Finland
More information about the Bioperl-l
mailing list