[Biopython] Skipping over blank/erroneous Entrez.esummary()

Brad Chapman chapmanb at 50mail.com
Wed Oct 7 20:29:11 UTC 2009


Hi Austin;
That is strange. That change may have unintended consequences
downstream. Could you send along a GI number that is causing
problems? If you revert that change and run the code printing out GI
numbers at each step, let me know the specific ones that are leading
to the initial error.

Once we have something reproducible to work with, we should be able
to track it down and provide a fix.

Thanks,
Brad

> I'm confused now.  In the latest version
> 
> http://github.com/biopython/biopython/commit/1fff8038e4fa9e2643851a70118e3227ccbea44e
> 
> Missing values are empty strings so if I did something like
> 
> record = Entrez.read(handle)
> 
> for item in record:
>     myList.append += item['TaxId']
> 
> myList should be something like :
> [ '1234', '2434', '', '9970' ]
> where myList[2] is the result of a missing value
> 
> However, when I run my script.  I find no blank spaces despite knowing
> that there are some that should have missing values.
> Which screws things up later when I zip tax ID's with their
> corresponding accession number:
> 
> zip (accessions, taxids)
> 
> I'm all for using '1' (root) or '-1' for missing values.
> 
> 
> 2009/10/7  <biopython-request at lists.open-bio.org>:
> > Send Biopython mailing list submissions to
> >        biopython at lists.open-bio.org
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> >        http://lists.open-bio.org/mailman/listinfo/biopython
> > or, via email, send a message with subject or body 'help' to
> >        biopython-request at lists.open-bio.org
> >
> > You can reach the person managing the list at
> >        biopython-owner at lists.open-bio.org
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of Biopython digest..."
> >
> >
> > Today's Topics:
> >
> >   1. Skipping over blank/erroneous Entrez.esummary() results
> >      (Austin Davis-Richardson)
> >   2. Re: Skipping over blank/erroneous Entrez.esummary()       results
> >      (Michiel de Hoon)
> >   3. Re: Combine nexus files but not concatenating them (Peter)
> >   4. Re: Skipping over blank/erroneous Entrez.esummary()       results
> >      (Peter)
> >   5. Re: Skipping over blank/erroneous Entrez.esummary()       results
> >      (Brad Chapman)
> >   6. Re: Skipping over blank/erroneous Entrez.esummary()       results
> >      (Michiel de Hoon)
> >   7. Re: Skipping over blank/erroneous Entrez.esummary()       results
> >      (Brad Chapman)
> >
> >
> > ----------------------------------------------------------------------
> >
> > Message: 1
> > Date: Tue, 6 Oct 2009 17:07:52 -0400
> > From: Austin Davis-Richardson <harekrishna at gmail.com>
> > Subject: [Biopython] Skipping over blank/erroneous Entrez.esummary()
> >        results
> > To: biopython at lists.open-bio.org
> > Message-ID:
> >        <d8e68faf0910061407v90f050dw1c16f2f5f97aa697 at mail.gmail.com>
> > Content-Type: text/plain; charset=ISO-8859-1
> >
> > Howdy,
> >
> > I'm using BioPython to generate a table of accession numbers and their
> > corresponding TaxIDs.  The fastest way I can do this is 20 at a time
> > (20 per 3 seconds rather than 1 per 3 seconds).
> >
> > However, this results in a problem.
> >
> > whenever my script receives a result from NCBI that is blank such as
> > there being no value for TaxID, BioPython crashes with the error:
> >
> >  File "taxcollector3.py", line 39, in getTaxID
> >    record = Entrez.read(handle)
> >  File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py",
> > line 259, in read
> >    record = handler.run(handle)
> >  File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
> > line 90, in run
> >    self.parser.ParseFile(handle)
> >  File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
> > line 191, in endElement
> >    value = IntegerElement(value)
> > ValueError: invalid literal for int() with base 10: ''
> >
> >
> > my code looks like this:  Where gids is a string of comma-separated GIDs
> > (I get the GIDs from the accession numbers using
> > eEntrez.esearch(db="nucleotide", rettype="text", term=accessions))
> >
> >                        handle = Entrez.esummary(db="nucleotide", id=gids)
> >                        record = Entrez.read(handle)
> >
> >
> > The only solution I can come up with is searching one at a time, but
> > this is very slow.  (I have about 300,000 accession numbers)
> >
> > Does anyone know perhaps a patch or a solution for this?  Or maybe an
> > easier way to get a TaxID from an accession number?
> >
> > Thanks,
> > Austin Davis-Richardson
> >
> >
> > ------------------------------
> >
> > Message: 2
> > Date: Tue, 6 Oct 2009 19:11:36 -0700 (PDT)
> > From: Michiel de Hoon <mjldehoon at yahoo.com>
> > Subject: Re: [Biopython] Skipping over blank/erroneous
> >        Entrez.esummary()       results
> > To: biopython at lists.open-bio.org,       Austin Davis-Richardson
> >        <harekrishna at gmail.com>
> > Message-ID: <362834.37683.qm at web62401.mail.re1.yahoo.com>
> > Content-Type: text/plain; charset=iso-8859-1
> >
> > You could try the following (with biopython 1.52):
> >
> > handle = Entrez.esummary(db="nucleotide", id=gids)
> > records = Entrez.parse(handle)
> > while True:
> >    try:
> >        record = records.next()
> >    except StopIteration:
> >        break
> >    except:
> >        print "Skipping record"
> >
> >
> > We should probably modify Bio.Entrez so that empty "integer" values are treated correctly.
> >
> >
> > --Michiel.
> >
> > --- On Tue, 10/6/09, Austin Davis-Richardson <harekrishna at gmail.com> wrote:
> >
> >> From: Austin Davis-Richardson <harekrishna at gmail.com>
> >> Subject: [Biopython] Skipping over blank/erroneous Entrez.esummary() results
> >> To: biopython at lists.open-bio.org
> >> Date: Tuesday, October 6, 2009, 5:07 PM
> >> Howdy,
> >>
> >> I'm using BioPython to generate a table of accession
> >> numbers and their
> >> corresponding TaxIDs.? The fastest way I can do this
> >> is 20 at a time
> >> (20 per 3 seconds rather than 1 per 3 seconds).
> >>
> >> However, this results in a problem.
> >>
> >> whenever my script receives a result from NCBI that is
> >> blank such as
> >> there being no value for TaxID, BioPython crashes with the
> >> error:
> >>
> >> ? File "taxcollector3.py", line 39, in getTaxID
> >> ? ? record = Entrez.read(handle)
> >> ? File
> >> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py",
> >> line 259, in read
> >> ? ? record = handler.run(handle)
> >> ? File
> >> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
> >> line 90, in run
> >> ? ? self.parser.ParseFile(handle)
> >> ? File
> >> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
> >> line 191, in endElement
> >> ? ? value = IntegerElement(value)
> >> ValueError: invalid literal for int() with base 10: ''
> >>
> >>
> >> my code looks like this:? Where gids is a string of
> >> comma-separated GIDs
> >> (I get the GIDs from the accession numbers using
> >> eEntrez.esearch(db="nucleotide", rettype="text",
> >> term=accessions))
> >>
> >> ??? ??? ???
> >> handle = Entrez.esummary(db="nucleotide", id=gids)
> >> ??? ??? ???
> >> record = Entrez.read(handle)
> >>
> >>
> >> The only solution I can come up with is searching one at a
> >> time, but
> >> this is very slow.? (I have about 300,000 accession
> >> numbers)
> >>
> >> Does anyone know perhaps a patch or a solution for
> >> this?? Or maybe an
> >> easier way to get a TaxID from an accession number?
> >>
> >> Thanks,
> >> Austin Davis-Richardson
> >> _______________________________________________
> >> Biopython mailing list? -? Biopython at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biopython
> >>
> >
> >
> >
> >
> >
> >
> > ------------------------------
> >
> > Message: 3
> > Date: Wed, 7 Oct 2009 10:29:36 +0100
> > From: Peter <biopython at maubp.freeserve.co.uk>
> > Subject: Re: [Biopython] Combine nexus files but not concatenating
> >        them
> > To: Denzel Li <denzel.dz.li at gmail.com>
> > Cc: Biopython Mailing List <biopython at lists.open-bio.org>
> > Message-ID:
> >        <320fb6e00910070229n1b78542dj82998de13cf7eed7 at mail.gmail.com>
> > Content-Type: text/plain; charset=ISO-8859-1
> >
> > On Wed, Oct 7, 2009 at 4:22 AM, Denzel Li <denzel.dz.li at gmail.com> wrote:
> >> Hi Peter:
> >> Thank you for the help. Both functions work well. By the way, will
> >> "standard" datatype or "mixed" datatype be supported in Bio:Nexus:Nexus?
> >>
> >> Best,
> >> Denzel
> >
> > Hi Denzel,
> >
> > I CC'd the list - please try and keep replies send there.
> >
> > I'm glad Bio.Nexus is working well for you.
> >
> > Regarding the finer details of the NEXUS file format and the Biopython
> > code, I am not an expert - we need Frank or Cymon to comment. If
> > you could give us a couple of examples of what you are asking for it
> > would probably be much clearer (to me at least).
> >
> > Regards,
> >
> > Peter
> >
> >
> > ------------------------------
> >
> > Message: 4
> > Date: Wed, 7 Oct 2009 12:17:30 +0100
> > From: Peter <biopython at maubp.freeserve.co.uk>
> > Subject: Re: [Biopython] Skipping over blank/erroneous
> >        Entrez.esummary()       results
> > To: Michiel de Hoon <mjldehoon at yahoo.com>
> > Cc: biopython at lists.open-bio.org,       Austin Davis-Richardson
> >        <harekrishna at gmail.com>
> > Message-ID:
> >        <320fb6e00910070417w26236a62ifece2e2610256609 at mail.gmail.com>
> > Content-Type: text/plain; charset=ISO-8859-1
> >
> > On Wed, Oct 7, 2009 at 3:11 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> >>
> >> We should probably modify Bio.Entrez so that empty "integer" values are treated correctly.
> >>
> >
> > Does "correctly" mean a default value? I see Brad has just commited a change to
> > use -1 in this case, but perhaps None is also a good choice? Can we
> > alternatively
> > leave this bit of the data structure empty?
> >
> > Peter
> >
> >
> > ------------------------------
> >
> > Message: 5
> > Date: Wed, 7 Oct 2009 07:17:37 -0400
> > From: Brad Chapman <chapmanb at 50mail.com>
> > Subject: Re: [Biopython] Skipping over blank/erroneous
> >        Entrez.esummary()       results
> > To: Austin Davis-Richardson <harekrishna at gmail.com>
> > Cc: biopython at lists.open-bio.org
> > Message-ID: <20091007111737.GC84267 at sobchak.mgh.harvard.edu>
> > Content-Type: text/plain; charset=us-ascii
> >
> > Hi Austin;
> >
> >> I'm using BioPython to generate a table of accession numbers and their
> >> corresponding TaxIDs.  The fastest way I can do this is 20 at a time
> >> (20 per 3 seconds rather than 1 per 3 seconds).
> >>
> >> However, this results in a problem.
> >>
> >> whenever my script receives a result from NCBI that is blank such as
> >> there being no value for TaxID, BioPython crashes with the error:
> >>
> >>   File "taxcollector3.py", line 39, in getTaxID
> >>     record = Entrez.read(handle)
> >>   File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py",
> >> line 259, in read
> >>     record = handler.run(handle)
> >>   File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
> >> line 90, in run
> >>     self.parser.ParseFile(handle)
> >>   File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
> >> line 191, in endElement
> >>     value = IntegerElement(value)
> >> ValueError: invalid literal for int() with base 10: ''
> >
> > In addition to Michiel's workaround, I checked in a small change
> > which could at least circumvent the error you are reporting:
> >
> > http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279
> >
> > It affects only one file, so if you don't want to pull the latest
> > from GitHub, you can download just that file and replace it in your
> > Biopython library:
> >
> > http://github.com/biopython/biopython/blob/master/Bio/Entrez/Parser.py
> >
> > Ideally, we should have a test case to cover this. Could you let us
> > know specific GIs that are causing the problem? The group of 20 is
> > fine if you haven't narrowed it further than that. This'll also help
> > us check if there are any other problems with these records.
> >
> > Thanks for reporting this,
> > Brad
> >
> >
> > ------------------------------
> >
> > Message: 6
> > Date: Wed, 7 Oct 2009 05:19:01 -0700 (PDT)
> > From: Michiel de Hoon <mjldehoon at yahoo.com>
> > Subject: Re: [Biopython] Skipping over blank/erroneous
> >        Entrez.esummary()       results
> > To: Austin Davis-Richardson <harekrishna at gmail.com>,    Brad Chapman
> >        <chapmanb at 50mail.com>
> > Cc: biopython at lists.open-bio.org
> > Message-ID: <826538.32828.qm at web62406.mail.re1.yahoo.com>
> > Content-Type: text/plain; charset=iso-8859-1
> >
> >> In addition to Michiel's workaround, I checked in a small
> >> change
> >> which could at least circumvent the error you are
> >> reporting:
> >>
> >> http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279
> >
> > Sorry, but that change introduces two bugs. First, we should be able to distinguish between -1 and missing values. More importantly, we want to be able to add attributes to value. Since -1 is an integer instead of an object, it won't allow that.
> >
> > Can you revert this change?
> >
> > --Michiel
> >
> > --- On Wed, 10/7/09, Brad Chapman <chapmanb at 50mail.com> wrote:
> >
> >> From: Brad Chapman <chapmanb at 50mail.com>
> >> Subject: Re: [Biopython] Skipping over blank/erroneous Entrez.esummary() results
> >> To: "Austin Davis-Richardson" <harekrishna at gmail.com>
> >> Cc: biopython at lists.open-bio.org
> >> Date: Wednesday, October 7, 2009, 7:17 AM
> >> Hi Austin;
> >>
> >> > I'm using BioPython to generate a table of accession
> >> numbers and their
> >> > corresponding TaxIDs.? The fastest way I can do
> >> this is 20 at a time
> >> > (20 per 3 seconds rather than 1 per 3 seconds).
> >> >
> >> > However, this results in a problem.
> >> >
> >> > whenever my script receives a result from NCBI that is
> >> blank such as
> >> > there being no value for TaxID, BioPython crashes with
> >> the error:
> >> >
> >> >???File "taxcollector3.py", line 39, in
> >> getTaxID
> >> >? ???record = Entrez.read(handle)
> >> >???File
> >> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py",
> >> > line 259, in read
> >> >? ???record = handler.run(handle)
> >> >???File
> >> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
> >> > line 90, in run
> >> >? ???self.parser.ParseFile(handle)
> >> >???File
> >> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
> >> > line 191, in endElement
> >> >? ???value = IntegerElement(value)
> >> > ValueError: invalid literal for int() with base 10:
> >> ''
> >>
> >> In addition to Michiel's workaround, I checked in a small
> >> change
> >> which could at least circumvent the error you are
> >> reporting:
> >>
> >> http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279
> >>
> >> It affects only one file, so if you don't want to pull the
> >> latest
> >> from GitHub, you can download just that file and replace it
> >> in your
> >> Biopython library:
> >>
> >> http://github.com/biopython/biopython/blob/master/Bio/Entrez/Parser.py
> >>
> >> Ideally, we should have a test case to cover this. Could
> >> you let us
> >> know specific GIs that are causing the problem? The group
> >> of 20 is
> >> fine if you haven't narrowed it further than that. This'll
> >> also help
> >> us check if there are any other problems with these
> >> records.
> >>
> >> Thanks for reporting this,
> >> Brad
> >> _______________________________________________
> >> Biopython mailing list? -? Biopython at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biopython
> >>
> >
> >
> >
> >
> >
> >
> > ------------------------------
> >
> > Message: 7
> > Date: Wed, 7 Oct 2009 08:32:27 -0400
> > From: Brad Chapman <chapmanb at 50mail.com>
> > Subject: Re: [Biopython] Skipping over blank/erroneous
> >        Entrez.esummary()       results
> > To: Michiel de Hoon <mjldehoon at yahoo.com>
> > Cc: Austin Davis-Richardson <harekrishna at gmail.com>,
> >        biopython at lists.open-bio.org
> > Message-ID: <20091007123227.GD84267 at sobchak.mgh.harvard.edu>
> > Content-Type: text/plain; charset=us-ascii
> >
> > Peter and Michiel;
> >
> >> > In addition to Michiel's workaround, I checked in a small
> >> > change which could at least circumvent the error you are
> >> > reporting:
> >> >
> >> > http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279
> >
> > Peter:
> >> Does "correctly" mean a default value? I see Brad has just commited a change to
> >> use -1 in this case, but perhaps None is also a good choice? Can we
> >> alternatively
> >> leave this bit of the data structure empty?
> >
> > Michiel:
> >> Sorry, but that change introduces two bugs. First, we should be able
> >> to distinguish between -1 and missing values. More importantly, we
> >> want to be able to add attributes to value. Since -1 is an integer
> >> instead of an object, it won't allow that.
> >>
> >> Can you revert this change?
> >
> > Thanks guys -- not the best choice. How do you feel about just passing
> > it along as an empty string and only doing the integer conversion if we
> > actually have data to convert?
> >
> > http://github.com/biopython/biopython/commit/1fff8038e4fa9e2643851a70118e3227ccbea44e
> >
> > So now missing values are empty strings, as passed, instead of any
> > sort of integer interpretation of them.
> >
> > Brad
> >
> >
> > ------------------------------
> >
> > _______________________________________________
> > Biopython mailing list  -  Biopython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
> >
> >
> > End of Biopython Digest, Vol 82, Issue 3
> > ****************************************
> >
> 
> 
> 
> -- 
> AGDR
> 
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython



More information about the Biopython mailing list