[Biopython] error with entrez id code

Wed Oct 5 23:21:29 UTC 2011

Hi All

I've written a program to identify Entrez gene ids from a blastall that 
I performed.  The code is as follows:

from Bio import SeqIO
from Bio import Entrez
import os
import os.path
import re
import csv

dirname1="/Users/dally/Desktop/BlastFiles/annotate_me/"
dirname2="/Users/dally/Desktop/BlastFiles/annotated/"

allfiles=os.listdir(dirname1)
fanddir=[os.path.join(dirname1,fname) for fname in allfiles]
OutFileName="Contig_annotation.csv"
c=csv.writer(open(os.path.join(dirname2,OutFileName),"wb"))

for f in fanddir:
     print f
     InFile=open(f,'rU')
     LineNumber=0
     for Line in InFile:
         print LineNumber#, ':', Line
         ElementList=Line.split('\t')
         geneid=ElementList[1]
         #print geneid
         Sections=geneid.split('|')
         NewID=Sections[3]

         from Bio import Entrez
         from Bio import SeqFeature
         Entrez.email = "dally at projects.sdsu.edu"
         handle=Entrez.efetch(db="nucleotide", id=NewID,rettype="gb")  # 
rettype="gb" is GenBank format or XML format retmode="xml"
         record=SeqIO.read(handle,"genbank")
         handle.close()
         #print record.id
         lineage=record.annotations["taxonomy"]

c.writerow([ElementList[0],ElementList[1],ElementList[2],ElementList[3],ElementList[4],ElementList[5],ElementList[6],ElementList[7],ElementList[8], 
ElementList[9],ElementList[10], NewID, record.id, record.description, 
record.annotations["source"], lineage[0], lineage[1],lineage[2], 
record.annotations["keywords"], ])
         LineNumber=LineNumber+1

InFile.close()

The gene identifier looks like this: gi|2252639|gb|AC002292.1|AC002292.  
But I"m only interested in the fourth component (AC002292.1)It runs 
through a file with approximately 8000-10000 identifiers and then 
extracts information from the associated genbank file.

The code seemed to run fine on my first file for the first 1287 lines 
but then I got this error

> raceback (most recent call last):
>   File "Ally_EntrezID_Search_Final_Script.py", line 38, in <module>
>     record=SeqIO.read(handle,"genbank")
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", 
> line 604, in read
>     first = iterator.next()
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", 
> line 532, in parse
>     for r in i:
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", 
> line 440, in parse_records
>     record = self.parse(handle, do_features)
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", 
> line 423, in parse
>     if self.feed(handle, consumer, do_features):
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", 
> line 400, in feed
>     misc_lines, sequence_string = self.parse_footer()
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", 
> line 921, in parse_footer
>     line = self.handle.readline()
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", 
> line 447, in readline
>     data = self._sock.recv(self._rbufsize)
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", 
> line 533, in read
>     return self._read_chunked(amt)
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", 
> line 586, in _read_chunked
>     value.append(self._safe_read(amt))
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", 
> line 637, in _safe_read
>     raise IncompleteRead(''.join(s), amt)
> httplib.IncompleteRead: IncompleteRead(707 bytes read, 3147 more expected)
I'm new to python and biopython programming.  So any advice would be 
extremely appreciated.

Thanks.

Dilara