<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">

</head>

<body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">

<div class="">Hi,</div>

<div class=""><br class="">

</div>

<div class="">As Peter says, it’s probably dependent on the database/file format you want to download. I’ve had success by downloading NCBI records in batches, and keeping them in a local list, then treating them as though they were a single returned list of

 records.</div>

<div class=""><br class="">

</div>

<div class="">For a recent script I used this approach to batch large result sets:</div>

<div class=""><br class="">

</div>

<div class="">i) Wrap the Entrez search in a retry function (args.retries comes from argparse. It’s lazy not to make it a function argument - I should fix that).</div>

<div class=""><br class="">

# Retry Entrez requests<br class="">

def entrez_retry(fn, *fnargs, **fnkwargs):<br class="">

    """Retries the passed function up to the number of times specified<br class="">

    by args.retries<br class="">

    """<br class="">

    tries, success = 0, False<br class="">

    while not success and tries < args.retries:<br class="">

        try:<br class="">

            output = fn(*fnargs, **fnkwargs)<br class="">

            success = True<br class="">

        except:<br class="">

            tries += 1<br class="">

            logger.warning("Entrez query %s(%s, %s) failed (%d/%d)" %<br class="">

                           (fn, fnargs, fnkwargs, tries+1, args.retries))<br class="">

            logger.warning(last_exception())<br class="">

    if not success:<br class="">

        logger.error("Too many Entrez failures (exiting)")<br class="">

        sys.exit(1)<br class="">

    return output</div>

<div class=""><br class="">

</div>

<div class="">ii) Wrap pulling record IDs from NCBI in batches, using the webhistory:</div>

<div class=""><br class="">

# Get results from NCBI web history, in batches<br class="">

def entrez_batch_webhistory(record, expected, batchsize, *fnargs, **fnkwargs):<br class="">

    """Recovers the Entrez data from a prior NCBI webhistory search, in <br class="">

    batches of defined size, using Efetch. Returns all results as a list.<br class="">

    - record: Entrez webhistory record<br class="">

    - expected: number of expected search returns<br class="">

    - batchsize: how many search returns to retrieve in a batch<br class="">

    - *fnargs: arguments to Efetch<br class="">

    - **fnkwargs: keyword arguments to Efetch<br class="">

    """<br class="">

    results = []<br class="">

    for start in range(0, expected, batchsize):<br class="">

        batch_handle = entrez_retry(Entrez.efetch,<br class="">

                                    retstart=start, retmax=batchsize,<br class="">

                                    webenv=record["WebEnv"],<br class="">

                                    query_key=record["QueryKey"],<br class="">

                                    *fnargs, **fnkwargs)<br class="">

        batch_record = Entrez.read(batch_handle)<br class="">

        results.extend(batch_record)<br class="">

    return results</div>

<div class=""><br class="">

</div>

<div class="">

<div class="">iii) Run complete query, saving record IDs to webhistory (this could at times identify thousands of records) e.g. here, record has [“WebEnv”] and [“QueryKey”] fields that allow you to recover the results later. It also has a [‘Count’] field that

 tells you how many total records you should expect back. In my experience this caps at 100,000 - even though sometimes there have been more records to return. I have no robust, reliable way to overcome this.</div>

<div class=""><br class="">

</div>

<div class="">    # Use NCBI history for the search.<br class="">

    handle = entrez_retry(Entrez.esearch, db="assembly", term=query,<br class="">

                          <span class="Apple-tab-span" style="white-space: pre;">

</span>format="xml", usehistory="y”)</div>

<div class="">    record = Entrez.read(handle)</div>

</div>

<div class="">    # Recover assembly UIDs from the web history<br class="">

    asm_ids = entrez_batch_webhistory(record, int(record[‘Count’]), 250,<br class="">

                                      db="assembly", retmode="xml")</div>

<div class=""><br class="">

</div>

<div class="">YMMV, but I hope this is helpful.</div>

<div class=""><br class="">

</div>

<div class="">Cheers,</div>

<div class=""><br class="">

</div>

<div class="">L.</div>

<br class="">

<div>

<blockquote type="cite" class="">

<div class="">On 2 Dec 2015, at 22:04, Peter Cock <<a href="mailto:p.j.a.cock@googlemail.com" class="">p.j.a.cock@googlemail.com</a>> wrote:</div>

<br class="Apple-interchange-newline">

<div class="">

<div dir="ltr" class="">Hi,

<div class=""><br class="">

</div>

<div class="">Currently Biopython does not attempt to do anything about</div>

<div class="">limiting retmax on your behalf.  The suggested retmax limit of 500</div>

<div class="">is probably specific to that database and/or file format (or so I</div>

<div class="">would imagine - some records like uilists are tiny in comparison).</div>

<div class=""><br class="">

</div>

<div class="">Are you using the results as XML? It probably is possible to</div>

<div class="">merge the XML files, but it might be more hassle that its worth.</div>

<div class=""><br class="">

</div>

<div class="">I would suggest a double loop ought to work fine - loop over</div>

<div class="">the collection of XML files, and then for each file loop over the</div>

<div class="">records returned from the parser.</div>

<div class=""><br class="">

</div>

<div class="">Regards,</div>

<div class=""><br class="">

</div>

<div class="">Peter<br class="">

<div class="gmail_extra"><br class="">

<div class="gmail_quote">On Wed, Dec 2, 2015 at 9:39 PM, <span dir="ltr" class="">

<<a href="mailto:c.buhtz@posteo.jp" target="_blank" class="">c.buhtz@posteo.jp</a>></span> wrote:<br class="">

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

I asked the Entrez support how should I tread the servers resources<br class="">

with "respect". :)<br class="">

<br class="">

First answer was without discrete numbers but in the second one they<br class="">

told me asking for 500 (retmax for eSearch) is a "reasonable" value<br class="">

because the eBot (a tool they offer on their website) use it, too.<br class="">

<br class="">

No I have nearly 13.000 PIDs I want to fetch their article infos via<br class="">

eFetch. It is a lot. ;)<br class="">

<br class="">

But I am not sure how to do that with biopython. When I separate that<br class="">

in 500-packages I would have 26 different record objects back.<br class="">

I don't like that. I would prefer one big record object I can analyse.<br class="">

<br class="">

Do you see a way to merge this record objects. Or maybe there is<br class="">

another way for that?<br class="">

Or does Biopython.Entrez still handle that problem internal (like the<br class="">

only-3-per-second-querys-rule or the HTTP-POST-decision)?<br class="">

<br class="">

Any suggestions?<br class="">

<span class="HOEnZb"><font color="#888888" class="">--<br class="">

GnuPGP-Key ID 0751A8EC<br class="">

_______________________________________________<br class="">

Biopython mailing list  -  <a href="mailto:Biopython@mailman.open-bio.org" class="">

Biopython@mailman.open-bio.org</a><br class="">

<a href="http://mailman.open-bio.org/mailman/listinfo/biopython" rel="noreferrer" target="_blank" class="">http://mailman.open-bio.org/mailman/listinfo/biopython</a><br class="">

</font></span></blockquote>

</div>

<br class="">

</div>

</div>

</div>

_______________________________________________<br class="">

Biopython mailing list  -  <a href="mailto:Biopython@mailman.open-bio.org" class="">Biopython@mailman.open-bio.org</a><br class="">

<a href="http://mailman.open-bio.org/mailman/listinfo/biopython" class="">http://mailman.open-bio.org/mailman/listinfo/biopython</a></div>

</blockquote>

</div>

<br class="">

<div apple-content-edited="true" class="">

<div style="color: rgb(0, 0, 0); letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">

<div style="color: rgb(0, 0, 0); letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">

<span class="Apple-style-span" style="border-collapse: separate; border-spacing: 0px;">

<div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">

<span class="Apple-style-span" style="border-collapse: separate; orphans: 2; text-indent: 0px; widows: 2; border-spacing: 0px;">

<div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">

<span class="Apple-style-span" style="border-collapse: separate; orphans: 2; text-indent: 0px; widows: 2; border-spacing: 0px;">

<div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">

<span class="Apple-style-span" style="border-collapse: separate; orphans: 2; text-indent: 0px; widows: 2; border-spacing: 0px;">

<div style="color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">

--<br class="">

Dr Leighton Pritchard<br class="">

Information and Computing Sciences Group; Weeds, Pests and Diseases Theme<br class="">

DG31, James Hutton Institute (Dundee)<br class="">

Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA<br class="">

e: <a href="mailto:leighton.pritchard@hutton.ac.uk" class="">leighton.pritchard@hutton.ac.uk</a>       w:

<a href="http://www.hutton.ac.uk/staff/leighton-pritchard" class="">http://www.hutton.ac.uk/staff/leighton-pritchard</a><br class="">

gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827<br class="">

<br class="">

</div>

</span></div>

</span></div>

</span></div>

</span></div>

</div>

</div>

<br class="">

<br /><br />

<p>This email is from the James Hutton Institute, however the views expressed by the sender are not necessarily the views of the James Hutton Institute and its subsidiaries. This email and any attachments are confidential and

are intended solely for the use of the recipient(s) to whom they are addressed.</p>

<p>If you are not the intended recipient, you should not read, copy, disclose or rely on any information contained in this email, and we would ask you to contact the sender immediately and delete the email from your system.  Although the James Hutton Institute has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and any attachments.</p>

The James Hutton Institute is a Scottish charitable company limited by guarantee.

<br />

Registered in Scotland No. SC374831

<br />

Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA.

<br />

Charity No. SC041796<p></p></body>

</html>