[BioPython] Entrez.efetch

Wed Oct 8 11:33:51 UTC 2008

Hi,

I am using biopython for a week or so. The package is amazing, I wonder how I possibly ignored this for so long now.
Since I am not only new to biopython I am also new in this mailing list, so forgive me if this is not the right forum for a question like this.

Anyway, here is a weird little problem with the Bio.Entrez.efetch tool:
(I use python 2.5 and the latest Biopython 1.48)
I want to run the following little test-code, using etetch to get chromosome 4 of Drosophila melanogaster as a genbank-file:

---------------------------CODE------------------------------------
from Bio import Entrez, SeqIO

print Entrez.read(Entrez.esummary(db="genome", id="56"))[0]["Title"]
handle = Entrez.efetch(db="genome", id="56", rettype="genbank")
print "downloading to SeqRecord..."
record = SeqIO.read(handle, "genbank")
print "...done"

handle = Entrez.efetch(db="genome", id="56", rettype="genbank")
filehandle = open("NCBI_DroMel", "w")
print "downloading to file..."
filehandle.write(handle.read())
print "...done"

handle = open("NCBI_DroMel")
print "reading from file..."
record = SeqIO.read(handle, "genbank")
---------------------------END-CODE------------------------------------

In the last line we have a crash, see the output of the code:

---------------------------OUTPUT------------------------------------
Drosophila melanogaster chromosome 4, complete sequence
downloading to SeqRecord...
...done
downloading to file...
...done
reading chr2L from file...
Traceback (most recent call last):
  File "efetch-test.py", line 17, in <module>
    record = SeqIO.read(handle, "genbank")
  File "HOME/lib/python/Bio/SeqIO/__init__.py", line 366, in read
    first = iterator.next()
  File "HOME/lib/python/Bio/GenBank/Scanner.py", line 410, in parse_records
    record = self.parse(handle)
  File "HOME/lib/python/Bio/GenBank/Scanner.py", line 393, in parse
    if self.feed(handle, consumer) :
  File "HOME/lib/python/Bio/GenBank/Scanner.py", line 370, in feed
    misc_lines, sequence_string = self.parse_footer()
  File "HOME/lib/python/Bio/GenBank/Scanner.py", line 723, in parse_footer
    raise ValueError("Premature end of file in sequence data")
ValueError: Premature end of file in sequence data
---------------------------END-OUTPUT------------------------------------

It seems that downloading the file to disk will corrupt the genbank file, while downloading directly into biopythons SeqIO.read() function works properly. I dont get it!
When I download this chromosome manually from the NCBI-website, I indeed find a difference in one line, namely in line 3 of the genbank file. In the manually downloaded file line 3 reads: "ACCESSION   NC_004353 REGION: 1..1351857", while in the file produced from my code I have only: "ACCESSION   NC_004353". So without that region-information, the biopython parser of course runs to a premature end.

I rather use the cPickle-module now to save the whole SeqRecord-instance. Thats works fine, so I dont need an immediate solution for the above posted problem, but I thought it might be interesting maybe...

Any hints?

Regards, Stephan