[Bioperl-l] Bio::SeqIO and bad entries in uniprot and interpro

Jason Stajich jason.stajich at duke.edu
Mon Nov 22 15:58:20 EST 2004


On Nov 18, 2004, at 6:53 AM, Mikko Arvas wrote:

> Hi,
>
> I want to get all available Interpro matches for S. cerevisiae and 
> some other species. So I need to parse Uniprot files to find a set of 
> IDs for a given species and then get the Interpro matches from them. 
> But the Uniprot release uniprot_trembl.dat gives an error towards the 
> end of the file in next_seq call:
>
> my $inseq = Bio::SeqIO->new('-file' => '<uniprot_trembl.dat',
>                                       '-format' => 'swiss');
> while (my $seq = $inseq->next_seq) { check species etc. in here}
>
> After happily processing a lot of sequences it gives:
> Invalid [] range "6-1" in regex; marked by <-- HERE in m/^Tomato 
> severe leaf curl virus-[Guatemala 96-1 <-- HERE ]$/
>
> Same goes for interpro:
>
> my $infeat = Bio::SeqIO->new('-file' => '<match.xml',
>                                             '-format' => 'interpro' );
> while (my $feat = $infeat->next_seq) { store features etc. in here}
>
> After happily processing a lot of features it gives:
> not well-formed (invalid token) at line 2, column 53, byte 131 at 
> /usr/lib/perl5/site_perl/5.8.0/i586-linux-thread-multi/XML/Parser.pm 
> line 187
>
> I guess its no wonder that such big DBs have errors or are out of sync 
> with perl modules etc. and I don't mind losing one seq or feature here 
> or there. The files are rather big so fixing them manually is a bit 
> painful. But I need to somehow get most things processed, is there a 
> way to skip these bad entries or would you have some other smart 
> ideas?

I think this has to do with some unsafe code the swiss.pm module which  
compares the species name against a list of Unknown species name values 
and is trying to interpret the 96-1 as a range in a regexp.  Putting a 
\Q in front of the variable where this is being compared should be 
enough to fix it.  This is the grep on line 986.

- return if grep { /^$binomial$/ } @Unknown_names;
+ return if grep { /^\Q$binomial$/ } @Unknown_names;

There was one more place in the code that did this as well which I 
think I have fixed.

I'm checking this in to CVS so do a cvs update and see if you problem 
persists.  I've tested it against the uniprot_trembl.dat.

Not sure what the problem is with the interpro parser, someone else 
will need to look into that.

>
> I have bioperl 1.4. and latest Bio::SeqIO (for swiss.pm to work 
> correctly) from CVS on SuSe8.1.
>
> Thanks a milloin for any help!
> Cheers,
> mikko
> Mikko Arvas
> VTT Biotechnology
>
> e-mail:            mikko.arvas at vtt.fi
> tel:                 +358-(0)9-456 5827
> mobile:           +358-(0)44-381 0502
> fax:                +358-(0)9-455 2103
> mail:               Tietotie 2, Espoo
>                       P.O. Box 1500
>                       FIN-02044 VTT, Finland
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
--
Jason Stajich
jason.stajich at duke.edu
http://www.duke.edu/~jes12/



More information about the Bioperl-l mailing list