[Bioperl-l] Bio::SeqIO and bad entries in uniprot and interpro

Mikko Arvas Mikko.Arvas at vtt.fi
Fri Nov 26 08:32:46 EST 2004


Thanks a lot! Its fine now.
Mikko

At 15:58 22.11.2004 -0500, Jason Stajich wrote:

>On Nov 18, 2004, at 6:53 AM, Mikko Arvas wrote:
>
>>Hi,
>>
>>I want to get all available Interpro matches for S. cerevisiae and some 
>>other species. So I need to parse Uniprot files to find a set of IDs for 
>>a given species and then get the Interpro matches from them. But the 
>>Uniprot release uniprot_trembl.dat gives an error towards the end of the 
>>file in next_seq call:
>>
>>my $inseq = Bio::SeqIO->new('-file' => '<uniprot_trembl.dat',
>>                                       '-format' => 'swiss');
>>while (my $seq = $inseq->next_seq) { check species etc. in here}
>>
>>After happily processing a lot of sequences it gives:
>>Invalid [] range "6-1" in regex; marked by <-- HERE in m/^Tomato severe 
>>leaf curl virus-[Guatemala 96-1 <-- HERE ]$/
>>
>>Same goes for interpro:
>>
>>my $infeat = Bio::SeqIO->new('-file' => '<match.xml',
>>                                             '-format' => 'interpro' );
>>while (my $feat = $infeat->next_seq) { store features etc. in here}
>>
>>After happily processing a lot of features it gives:
>>not well-formed (invalid token) at line 2, column 53, byte 131 at 
>>/usr/lib/perl5/site_perl/5.8.0/i586-linux-thread-multi/XML/Parser.pm line 187
>>
>>I guess its no wonder that such big DBs have errors or are out of sync 
>>with perl modules etc. and I don't mind losing one seq or feature here or 
>>there. The files are rather big so fixing them manually is a bit painful. 
>>But I need to somehow get most things processed, is there a way to skip 
>>these bad entries or would you have some other smart ideas?
>
>I think this has to do with some unsafe code the swiss.pm module which
>compares the species name against a list of Unknown species name values 
>and is trying to interpret the 96-1 as a range in a regexp.  Putting a \Q 
>in front of the variable where this is being compared should be enough to 
>fix it.  This is the grep on line 986.
>
>- return if grep { /^$binomial$/ } @Unknown_names;
>+ return if grep { /^\Q$binomial$/ } @Unknown_names;
>
>There was one more place in the code that did this as well which I think I 
>have fixed.
>
>I'm checking this in to CVS so do a cvs update and see if you problem 
>persists.  I've tested it against the uniprot_trembl.dat.
>
>Not sure what the problem is with the interpro parser, someone else will 
>need to look into that.
>
>>
>>I have bioperl 1.4. and latest Bio::SeqIO (for swiss.pm to work 
>>correctly) from CVS on SuSe8.1.
>>
>>Thanks a milloin for any help!
>>Cheers,
>>mikko
>>Mikko Arvas
>>VTT Biotechnology
>>
>>e-mail:            mikko.arvas at vtt.fi
>>tel:                 +358-(0)9-456 5827
>>mobile:           +358-(0)44-381 0502
>>fax:                +358-(0)9-455 2103
>>mail:               Tietotie 2, Espoo
>>                       P.O. Box 1500
>>                       FIN-02044 VTT, Finland
>>
>>
>>_______________________________________________
>>Bioperl-l mailing list
>>Bioperl-l at portal.open-bio.org
>>http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>
>--
>Jason Stajich
>jason.stajich at duke.edu
>http://www.duke.edu/~jes12/
>

Mikko Arvas
VTT Biotechnology

e-mail:            mikko.arvas at vtt.fi
tel:                 +358-(0)9-456 5827
mobile:           +358-(0)44-381 0502
fax:                +358-(0)9-455 2103
mail:               Tietotie 2, Espoo
                       P.O. Box 1500
                       FIN-02044 VTT, Finland




More information about the Bioperl-l mailing list