EMBOSS produces an error when doing wildcard searches
David Mathog
mathog at mendel.bio.caltech.edu
Fri Feb 15 16:53:44 UTC 2002
>
> Dear all,
>
> has anybody else experienced problems when iterating over a database
with a wildcard search?
> For example, when I state
> infoseq embl:\* ,
> everything works as expected, but when I say
> infoseq embl:p\* ,
> after a while the output stops and the following error message is
displayed:
It doesn't work here either.
EMBOSS-2.2.0 on Solaris 5.8 on a Sparc.
We keep our Genbank database in GCG format. It was indexed with:
dbigcg -dbname genbank \
-directory $WORKDIR \
-indexdirectory $EMBOSSINDEXDIR \
-release "$THERELEASE" \
-date "$THEMONTH/$THEDAY/$THE2YEAR" \
-idformat GENBANK \
-filename "*.seq"
When this is run:
% infoseq genbank:\* -outfile=/tmp/testwild.txt
it stops (no error messages) after 32993 lines with the last entry at:
AC003118 which is just slightly into gb_htg. More interestingly:
wc /export/home/gcg/data/gcggenbank/*seqcat
32940 417123 3441822 /export/home/gcg/data/gcggenbank/gb_ba.seqcat
93349 1323961 11229588 /export/home/gcg/data/gcggenbank/gb_htg.seqcat
108723 1578814 13275437 /export/home/gcg/data/gcggenbank/gb_in.seqcat
32993 - 32940 = 53.
Looks like the indexing lookup functions do not transition properly from
one file
to another. Or it could be a signed short is used somewhere it
shouldn't since
32993 isn't that much more than 32768.
Or maybe it's a memory leak? Run it again and watch the process...
Doesn't look like it, it grew slowly to (top output):
22941 root 1 0 10 9552K 5968K run 2:14 98.31% infoseq
and stabilized there. So not a memory leak.
Let the run complete and see if it fails at the same place... Yes, it
did.
Is it just infoseek or other programs as well? Try fuzznuc with:
% fuzznuc -sequence=genbank:\* '-pattern=<N' -outf=fuzznucout.txt
-mismatch=0
oh great, there's another bug - this pattern doesn't match anything!
Try it the other way:
% fuzznuc -sequence=genbank:\* '-pattern=N>' -outf=fuzznucout.txt
-mismatch=0
That works. Let's see where it stops... Looks bad folks, the output
file
has 32992 lines this time and the last line is, you guessed it:
AC003118 94882 G
Definitely a bug in the low level wildcard retrieval routines somewhere.
Alan, please move this one way up the priority list - it probably means
all wildcarded
database operations are broken. The worst part is that here these are
failing as if they were completing normally - there are no error
messages to tip off the user that things
have gone wrong.
Regards,
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the EMBOSS
mailing list