[Biopython-dev] Notification: incoming/30

Fri May 4 12:58:57 EDT 2001

JitterBug notification

chapmanb moved PR#30 from incoming to fixed-bugs
Message summary for PR#30
	From: dimlight at lgci.co.kr
	Subject: PRIVATE: About Genbank Iterator
	Date: Thu, 3 May 2001 20:18:39 -0400
	0 replies 	0 followups
	Notes: Problem was GenBank record NM_006141.1, which was lacking a REFERENCE section.
Fixed the parser to be able to handle this case, fixes in CVS.

====> ORIGINAL MESSAGE FOLLOWS <====

>From dimlight at lgci.co.kr Thu May  3 20:18:39 2001
Received: from localhost (localhost [127.0.0.1])
	by pw600a.bioperl.org (8.11.2/8.11.2) with ESMTP id f440Ic208720
	for <biopython-bugs at pw600a.bioperl.org>; Thu, 3 May 2001 20:18:39 -0400
Date: Thu, 3 May 2001 20:18:39 -0400
Message-Id: <200105040018.f440Ic208720 at pw600a.bioperl.org>
From: dimlight at lgci.co.kr
To: biopython-bugs at bioperl.org
Subject: PRIVATE: About Genbank Iterator

Full_Name: Wankyu Kim
Module: GenBank,SeqFeature
Version: biopython-1.00a1
OS: win98
Submission from: cache14.bora.net (210.120.192.31)

I tried parsing GenBank-formatted file and just print every element on screen. 
And I've downloaded RefSeq flat file in Genbank format at the following site.

ftp://ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/hs.gbff.gz

After unzipped the hs.gbff.gz file, I tryed parsing every element of RefSeq
Record.
It seemed working very well, and I could see the parsed elements scrolling down
on and on...
but on parsing 5287th record, I had the following error message.

Traceback (innermost last):
  File "C:\Python20\genbank_element.py", line 11, in ?
    cur_record = gb_iterator.next()
  File "c:\python20\Bio\GenBank\__init__.py", line 156, in next
    return self._parser.parse(File.StringHandle(data))
  File "c:\python20\Bio\GenBank\__init__.py", line 233, in parse
    self._scanner.feed(handle, self._consumer)
  File "c:\python20\Bio\GenBank\__init__.py", line 1004, in feed
    self._parser.parseFile(handle)
  File "c:\python20\Martel\Parser.py", line 206, in parseFile
    self.parseString(fileobj.read())
  File "c:\python20\Martel\Parser.py", line 234, in parseString
    self._err_handler.fatalError(result)
  File "c:\python20\lib\xml\sax\handler.py", line 38, in fatalError
    raise exception
ParserPositionException: error parsing at or beyond character 446

I had similar errors on RedHat 6.2 too.
Please cut & paste my code and test it. It'will took hours to test.

< Code >

from Bio import GenBank
gb_file = "hs.gbff"
from Bio import SeqFeature

gb_handle = open(gb_file, 'r')
feature_parser = GenBank.FeatureParser()
gb_iterator = GenBank.Iterator(gb_handle, feature_parser)
k = 0
while 1:
  cur_record = gb_iterator.next()
  k = k +1
  print
  print "record no", k
  print
  if cur_record is None:
     break

  print "cur_record.seq:", cur_record.seq.tostring()
  print
  print "cur_record.id",cur_record.id
  print
  print "cur_record.name", cur_record.name
  print
  print "cur_record.description", cur_record.description
  print

  print "cur_record.annotations"
  print "gi : ", cur_record.annotations['gi']
  print "organism : ", cur_record.annotations['organism']
  print "taxonomy : ", cur_record.annotations['taxonomy'][:]
  print "keywords : ", cur_record.annotations['keywords']
  print "data_file_division : ", cur_record.annotations['data_file_division']
  print "date : ", cur_record.annotations['date']

  ref_len = len(cur_record.annotations['references'])
  for j in range(ref_len):
    print cur_record.annotations['references'][j].journal
    print cur_record.annotations['references'][j].title
    print cur_record.annotations['references'][j].authors
    print cur_record.annotations['references'][j].medline_id
    print cur_record.annotations['references'][j].pubmed_id
    print cur_record.annotations['references'][j].comment

  print len(cur_record.features)
  i = len(cur_record.features)

  for i in range(i):
    print "type:", '\t\t',cur_record.features[i].type
    print "location:",'\t', cur_record.features[i].location
    for key in cur_record.features[i].qualifiers.keys():
        print key, '\t', cur_record.features[i].qualifiers[key]

print
print
print "Congulatulations!!! You've gone through RefSeq file "
print
print