[Biopython] how Entrez.parse() internally work

Wed Dec 9 21:25:46 UTC 2015

On Wed, Dec 9, 2015 at 7:22 PM,  <c.buhtz at posteo.jp> wrote:
> On 2015-12-09 13:23 Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> There is one call to  Entrez.efetch using the retstart and retmax
>> values given. The NCBI will return a stream of data (like a file
>> handle) containing one record after another.
>
> "stream"? Not sure if I understand that.

The English word "stream" (small river) is used sometimes in
the sense of stream processing, or streaming video, to mean
you deal with the data as it arrives WITHOUT random access.
i.e. No seeking within the file, just reading forwards in one go.

> In my case there are round about 5 GB of data in one complete eFetch
> call (if retmax would be 99999). When are these 5 GB transfered from
> the NCBI to me?

Almost certainly asking for 5 GB like that will fail. You should
request much smaller batches of data, by making multiple
calls to efetch with an increasing start value.

Also, I would cache this data local as files on disk to avoid
having to re-download it if you need to re-run your script.

See the example "Searching for and downloading sequences
using the history" in the tutorial (which includes retries if a
batch download fails).

> When I call Entrez.eFetch(retmax=999999)?
> Or is physically/really only one record (some KBytes, not much)
> transfered from NCBI to me while each iteration (or next())?

It should be a few Kbytes at a time as each record is parsed.

> I don't want to be called from NCBI because of to much load on their
> servers. ;)

Since you plan to download a very large amount of data,
you should email the NCBI for advice about this.

Alternatively, you might be able to download this another
way (e.g. a lot of the NCBI datasets are on their FTP
serveer), or access a local mirror or equivalent database
like PDBj, DDBJ, or service like TogoWS?

Regards,

Peter