[EMBOSS] Antwort: Genbank GI fetching?

Niels Larsen niels at genomics.dk
Sun Jun 17 23:56:40 UTC 2007


Hi David,

Thanks again, for the hints. Great.

I found dbxflat behaves well, goes fast and and makes small indices
when only id,acc are asked for. But Genbank/EMBL have become 500gb+
monsters uncompressed, and so I made this primitive scheme in addition:
split the flatfiles into many smaller compressed files organised in
directories that are the first 4 digits of the GI number. Then with
grep and zcat as "accessors", and 5-10 mb chunks, the average access
time is 0.1-0.2 seconds - much worse than dbxflat, but better than
fetching posts from NCBI, and then its 100gb instead of 500, close
to its distributed compressed size. I would have used EMBL if EBI's
remote services worked reliably. Btw, the seqret documentation
doesnt say, but stdin: works as stdout:

zcat 2.gz | seqret -filter stdin:AAIY01677200 -sbegin1 11 -osformat2 embl -firstonly

Is adding to indices on the todo-list for dbxflat?

Niels L





More information about the EMBOSS mailing list