[Biopython] Searching for and downloading sequences using the history

Carlos Javier Borroto carlos.borroto at gmail.com
Fri Sep 18 17:15:41 UTC 2009


On Fri, Sep 18, 2009 at 12:59 PM, Carlos Javier Borroto
<carlos.borroto at gmail.com> wrote:
> Hi all,
>
> I'm trying to download all of the EST from a specie, I'm following the
> example on the tutorial which seems to be exactly what I need. But I
> running into this problem:
>
>>>> from Bio import Entrez
>>>> Entrez.email = "carlos.borroto at gmail.com"
>>>> dbname = "nucest"
>>>> query_term = "Genus specie"
>>>> search_handle = Entrez.esearch(db=dbname,term=query_term,usehistory="y")
>>>> search_results = Entrez.read(search_handle)
>>>> search_handle.close()
>>>> len(search_results["IdList"])
> 20
>>>> print search_results["Count"]
> 193951
>
> So the assert statement if failing:
>>>> gi_list = search_results["IdList"]
>>>> count = int(search_results["Count"])
>>>> assert count == len(gi_list)
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
> AssertionError
>

I just found this:
http://portal.open-bio.org/pipermail/biopython/2008-August/004451.html

So I tested this:
>>> search_handle = Entrez.esearch(db=dbname,term=query_term,retmax=193951)
>>> search_results = Entrez.read(search_handle)
>>> search_handle.close()
>>> print search_results["Count"]
193951
>>> len(search_results["IdList"])
100000

Still not the complete list, maybe there is a maximum of result you
can get and I see there is a retstart, so I'm guessing the only way to
get all of the ids is dividing my search and using retstart.

I'm right? I'm going to implement this I share it here.

regards,
-- 
Carlos Javier Borroto
Baltimore, MD
Phone: (410) 929 4020




More information about the Biopython mailing list