Blast format change at NCBI

David Mathog mathog at mendel.bio.caltech.edu
Tue Feb 12 17:42:30 UTC 2002


> NCBI's FTP server will be releasing BLAST formatted databases in a new
> format (version 4) from February 19th.
> 
> This format is not supported by EMBOSS dbiblast indexing (or by NCBI's
> blastall before the current version 2.2.2)
> 
> See ftp://ftp.ncbi.nih.gov/blast/db/NewFormattedDatabases/README
> 
> If you have the fasta format files, you should be able to use dbifasta
to
> index the same entries.
> 
> The blast 2.2.2 release notes say "Fastacmd will dump out an entire
BLAST
> database in FASTA format if the new -D option is used", but I have not
> tried this yet to see how well it fits with EMBOSS.
> 
> Has anyone else looked into this yet?

Sorry, no.   I imagine that it will break GCG's BLAST implementation as
well
though.

I have used fastacmd (which is undocumented, of course)  in the
past with success but obviously not with -D.  Usage was:

  fastacmd -d nt -s ""

This snippet of code (from an older NCBI version) is the only
documentation I've found
for the fastacmd command line options:

static Args myargs [NUMARG] = {
    { "Database", 
      NULL, NULL, NULL, TRUE, 'd', ARG_STRING, 0.0, 0, NULL},
    { "Search string: GIs, accessions and locuses may be used
delimited\n"
      "      by comma or space)",
      NULL, NULL, NULL, TRUE, 's', ARG_STRING, 0.0, 0, NULL},
    { "Input file wilth GIs/accessions/locuses for batch retrieval",
      NULL, NULL, NULL, TRUE, 'i', ARG_STRING, 0.0, 0, NULL},
    { "Retrieve duplicated accessions",
      "F", NULL, NULL, TRUE, 'a', ARG_BOOLEAN, 0.0, 0, NULL},
    { "Line length for sequence", 
      "80", NULL, NULL, TRUE, 'l', ARG_INT, 0.0, 0, NULL}
}

A more pressing problem for us regarding BLAST is that nr (entire) and
nt_lcl (fragments of
1/9th nt) both just grew beyond the point where the data can fit along
with blast and linux in the 512 Mb on each of our 9 DS10 nodes. 
Consequently local BLAST runs on these databases now crawl since all
data must be read from disk for each run.  
The NCBI must have something like 10Gb in their main server these days
to keep both
nr and nt (and whatever else) in memory.  I'd _kill_ for MPI or PVM
BLAST, but it seems
like the NCBI is not going to see the need to write that until they
reach the memory limit
on their huge server.   I'd add it myself but a horrible experience
tracking down the
gi list memory leak bug in one version of BLAST taught me once and for
all that their code
structure is so complex that the task would be extraordinarily
difficult.  Others must have
come to the same conclusion since nobody else has done it either.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech




More information about the EMBOSS mailing list