[BioPython] Entrez.efetch
Stephan
stephan80 at mac.com
Wed Oct 8 11:33:51 UTC 2008
Hi,
I am using biopython for a week or so. The package is amazing, I wonder how I possibly ignored this for so long now.
Since I am not only new to biopython I am also new in this mailing list, so forgive me if this is not the right forum for a question like this.
Anyway, here is a weird little problem with the Bio.Entrez.efetch tool:
(I use python 2.5 and the latest Biopython 1.48)
I want to run the following little test-code, using etetch to get chromosome 4 of Drosophila melanogaster as a genbank-file:
---------------------------CODE------------------------------------
from Bio import Entrez, SeqIO
print Entrez.read(Entrez.esummary(db="genome", id="56"))[0]["Title"]
handle = Entrez.efetch(db="genome", id="56", rettype="genbank")
print "downloading to SeqRecord..."
record = SeqIO.read(handle, "genbank")
print "...done"
handle = Entrez.efetch(db="genome", id="56", rettype="genbank")
filehandle = open("NCBI_DroMel", "w")
print "downloading to file..."
filehandle.write(handle.read())
print "...done"
handle = open("NCBI_DroMel")
print "reading from file..."
record = SeqIO.read(handle, "genbank")
---------------------------END-CODE------------------------------------
In the last line we have a crash, see the output of the code:
---------------------------OUTPUT------------------------------------
Drosophila melanogaster chromosome 4, complete sequence
downloading to SeqRecord...
...done
downloading to file...
...done
reading chr2L from file...
Traceback (most recent call last):
File "efetch-test.py", line 17, in <module>
record = SeqIO.read(handle, "genbank")
File "HOME/lib/python/Bio/SeqIO/__init__.py", line 366, in read
first = iterator.next()
File "HOME/lib/python/Bio/GenBank/Scanner.py", line 410, in parse_records
record = self.parse(handle)
File "HOME/lib/python/Bio/GenBank/Scanner.py", line 393, in parse
if self.feed(handle, consumer) :
File "HOME/lib/python/Bio/GenBank/Scanner.py", line 370, in feed
misc_lines, sequence_string = self.parse_footer()
File "HOME/lib/python/Bio/GenBank/Scanner.py", line 723, in parse_footer
raise ValueError("Premature end of file in sequence data")
ValueError: Premature end of file in sequence data
---------------------------END-OUTPUT------------------------------------
It seems that downloading the file to disk will corrupt the genbank file, while downloading directly into biopythons SeqIO.read() function works properly. I dont get it!
When I download this chromosome manually from the NCBI-website, I indeed find a difference in one line, namely in line 3 of the genbank file. In the manually downloaded file line 3 reads: "ACCESSION NC_004353 REGION: 1..1351857", while in the file produced from my code I have only: "ACCESSION NC_004353". So without that region-information, the biopython parser of course runs to a premature end.
I rather use the cPickle-module now to save the whole SeqRecord-instance. Thats works fine, so I dont need an immediate solution for the above posted problem, but I thought it might be interesting maybe...
Any hints?
Regards, Stephan
More information about the Biopython
mailing list