[Bioperl-l] Bio::SeqIO and bad entries in uniprot and interpro
Jason Stajich
jason.stajich at duke.edu
Mon Nov 22 15:58:20 EST 2004
On Nov 18, 2004, at 6:53 AM, Mikko Arvas wrote:
> Hi,
>
> I want to get all available Interpro matches for S. cerevisiae and
> some other species. So I need to parse Uniprot files to find a set of
> IDs for a given species and then get the Interpro matches from them.
> But the Uniprot release uniprot_trembl.dat gives an error towards the
> end of the file in next_seq call:
>
> my $inseq = Bio::SeqIO->new('-file' => '<uniprot_trembl.dat',
> '-format' => 'swiss');
> while (my $seq = $inseq->next_seq) { check species etc. in here}
>
> After happily processing a lot of sequences it gives:
> Invalid [] range "6-1" in regex; marked by <-- HERE in m/^Tomato
> severe leaf curl virus-[Guatemala 96-1 <-- HERE ]$/
>
> Same goes for interpro:
>
> my $infeat = Bio::SeqIO->new('-file' => '<match.xml',
> '-format' => 'interpro' );
> while (my $feat = $infeat->next_seq) { store features etc. in here}
>
> After happily processing a lot of features it gives:
> not well-formed (invalid token) at line 2, column 53, byte 131 at
> /usr/lib/perl5/site_perl/5.8.0/i586-linux-thread-multi/XML/Parser.pm
> line 187
>
> I guess its no wonder that such big DBs have errors or are out of sync
> with perl modules etc. and I don't mind losing one seq or feature here
> or there. The files are rather big so fixing them manually is a bit
> painful. But I need to somehow get most things processed, is there a
> way to skip these bad entries or would you have some other smart
> ideas?
I think this has to do with some unsafe code the swiss.pm module which
compares the species name against a list of Unknown species name values
and is trying to interpret the 96-1 as a range in a regexp. Putting a
\Q in front of the variable where this is being compared should be
enough to fix it. This is the grep on line 986.
- return if grep { /^$binomial$/ } @Unknown_names;
+ return if grep { /^\Q$binomial$/ } @Unknown_names;
There was one more place in the code that did this as well which I
think I have fixed.
I'm checking this in to CVS so do a cvs update and see if you problem
persists. I've tested it against the uniprot_trembl.dat.
Not sure what the problem is with the interpro parser, someone else
will need to look into that.
>
> I have bioperl 1.4. and latest Bio::SeqIO (for swiss.pm to work
> correctly) from CVS on SuSe8.1.
>
> Thanks a milloin for any help!
> Cheers,
> mikko
> Mikko Arvas
> VTT Biotechnology
>
> e-mail: mikko.arvas at vtt.fi
> tel: +358-(0)9-456 5827
> mobile: +358-(0)44-381 0502
> fax: +358-(0)9-455 2103
> mail: Tietotie 2, Espoo
> P.O. Box 1500
> FIN-02044 VTT, Finland
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
--
Jason Stajich
jason.stajich at duke.edu
http://www.duke.edu/~jes12/
More information about the Bioperl-l
mailing list