[EMBOSS] Antwort: Genbank GI fetching?
Niels Larsen
niels at genomics.dk
Sun Jun 17 23:56:40 UTC 2007
Hi David,
Thanks again, for the hints. Great.
I found dbxflat behaves well, goes fast and and makes small indices
when only id,acc are asked for. But Genbank/EMBL have become 500gb+
monsters uncompressed, and so I made this primitive scheme in addition:
split the flatfiles into many smaller compressed files organised in
directories that are the first 4 digits of the GI number. Then with
grep and zcat as "accessors", and 5-10 mb chunks, the average access
time is 0.1-0.2 seconds - much worse than dbxflat, but better than
fetching posts from NCBI, and then its 100gb instead of 500, close
to its distributed compressed size. I would have used EMBL if EBI's
remote services worked reliably. Btw, the seqret documentation
doesnt say, but stdin: works as stdout:
zcat 2.gz | seqret -filter stdin:AAIY01677200 -sbegin1 11 -osformat2 embl -firstonly
Is adding to indices on the todo-list for dbxflat?
Niels L
More information about the EMBOSS
mailing list