[Bioperl-l] Parsing a netblast file
Jason Stajich
jason at cgt.duhs.duke.edu
Thu Jul 31 09:36:22 EDT 2003
> Through trial and error I have narrowed down the problem to the negative
> sign in the database details. Here is the section in question from a
> netblast result file:
>
> Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS,
> or phase 0, 1 or 2 HTGS sequences)
> 1,819,241 sequences; -24,217,474 total letters
integer overflow. The number of letters in nt is > than the
largest signed number (2147483647) that an integer can represent.
Looks like nt length is 8,782,847,770 - seems like it has been larger than
INT_MAX for a while, surprised they haven't updated their code. Do you
have the latest version of netblast on your machine? A bug report to NCBI
is probably a good idea if you are running the latest version
Some C code to illustrate what happens:
#include <stdlib.h>
#include <limits.h>
int main ( )
{
int i = INT_MAX;
unsigned int ui = INT_MAX;
printf ("max integer size is %d\n",i);
printf ("max unsigned int size is %u\n",ui);
printf ("max integer+1 size is %d\n",i+1);
printf ("max unsigned integer*2 size is %u\n",ui*2);
return 0;
}
>
> I don't know why, but all netblast result files I have looked at show a
> negative value for the total number of letters. If I remove the '-' sign,
> the blast result file parses just fine with the above script.
>
> Why does a netblast result file have a minus sign for the database size?
> Why won't the parser work if there is a minus sign?
> Is there a way to make the parser work despite the minus sign?
>
We'd just need to tweak the regexp a little bit to handle a leading -.
What version of bioperl are you running so can provide a patch which is
appropriate for your version?
-jason
--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu
More information about the Bioperl-l
mailing list