[Biopython] Searching for and downloading sequences using the history
Peter
biopython at maubp.freeserve.co.uk
Fri Sep 18 18:51:32 UTC 2009
On Fri, Sep 18, 2009 at 6:15 PM, Carlos Javier Borroto
<carlos.borroto at gmail.com> wrote:
> On Fri, Sep 18, 2009 at 12:59 PM, Carlos Javier Borroto wrote:
>> Hi all,
>>
>> I'm trying to download all of the EST from a specie, I'm following the
>> example on the tutorial which seems to be exactly what I need. But I
>> running into this problem:
>> ...
>
> I just found this:
> http://portal.open-bio.org/pipermail/biopython/2008-August/004451.html
>
> So I tested this:
>>>> search_handle = Entrez.esearch(db=dbname,term=query_term,retmax=193951)
>>>> search_results = Entrez.read(search_handle)
>>>> search_handle.close()
>>>> print search_results["Count"]
> 193951
>>>> len(search_results["IdList"])
> 100000
>
> Still not the complete list, maybe there is a maximum of result you
> can get and I see there is a retstart, so I'm guessing the only way to
> get all of the ids is dividing my search and using retstart.
OK, good - you found the retmax parameter. It looks like the NCBI
still limit their return data to 100000 here - I don't know if EFetch
(via the history) would also be limited to 100000 or not, but this
is still a pretty large amount of EST data to try an download this
way.
I would first suggest you refine your Entrez search to use "species
name[orgn]" rather than just "species name" (i.e. explicitly search
on the organism rather than all fields). That may reduce things
further. Even better, search using an NCBI taxonomy ID to be
absolutely explicit. This may reduce the dataset a bit.
Secondly, this seems like an awfully large amount of data to
try and download via Entrez. Email the NCBI to ask if if this is
OK (and if so what batch size you should use for EFetch calls),
or if they have an alternative suggestion (e.g. some FTP site).
Peter
P.S. You could try wrapping each EFetch call in a
try/except in order to retry any individual retrieval which fails.
More information about the Biopython
mailing list