[Biopython-dev] [Bug 1909] Format issue with GenBank with segmented BACs (eg GI:55276707)

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Tue Dec 20 07:32:41 EST 2005


http://bugzilla.open-bio.org/show_bug.cgi?id=1909


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |INVALID




------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2005-12-20 07:32 -------
A GenBank format entry for GI:55276707 can be downloaded from here:

http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=55276707

Its a 401 kb GenBank file, containing THREE separate GenBank records (three
segments), starting:

LOCUS       AY643842S1             12998 bp    DNA     linear   PLN 17-NOV-2004
DEFINITION  Hordeum vulgare subsp. vulgare clone BAC 519K7 hardness locus
            region.
ACCESSION   AY643842
VERSION     AY643842.1  GI:55276708
KEYWORDS    .
SEGMENT     1 of 3
..

Using the old Martel GenBank parser (e.g. BioPython 1.41) the following works
perfectly:

print "Method 1 - Using for record in Iterator"
from Bio import GenBank
gbk_filename = "AY643842.gbk"
input_file  = open(gbk_filename, "r")
for gb_record in GenBank.Iterator(input_file, GenBank.RecordParser()) :
    print "Loaded GenBank record %s" % gb_record.locus
print "Done"
input_file.close()

Or:

print "Method 2 - Using Iterator.next()"
from Bio import GenBank
gbk_filename = "AY643842.gbk"
input_file  = open(gbk_filename, "r")
gb_iterator = GenBank.Iterator(input_file, GenBank.RecordParser())
while True:
    gb_record = gb_iterator.next()
    if gb_record is None : break
    print "Loaded GenBank record %s" % gb_record.locus
print "Done"
input_file.close()

This bit of code will reproduce the error reported:

print "Method 3 - No Iterator object, this fails"
from Bio import GenBank
gbk_filename = "AY643842.gbk"
input_file  = open(gbk_filename, "r")
gb_record = GenBank.RecordParser().parse(input_file)
..

The reason the error message says "unparsed text remains" beyond position
18263, is the fact that there are actually two more records in the file.

Your text editor may have a "goto character" command (TextPad does, available
to try from www.textpad.com but it does cost money).

The following snippet of code is another way to find out where a Martel parser
is failing from a position in a file, in this case 18263:

print "Debug:"
input_file = open(gbk_filename, "r")
raw_text = "".join(input_file.readlines())
input_file.close()
print raw_text[18263:18263+100] + "..."

Debug:
LOCUS       AY643842S2            129099 bp    DNA     linear   PLN 17-NOV-2004
DEFINITION  Hordeum ...

i.e. It's complaining about the presence of second record (i.e. LOCUS line
onwards) in the GenBank file.

Resolution
==========
If you can't be sure in advance that there is only one record, allways use the
GenBank.Iterator object.

Note
====
Using the current version of the GenBank parser (in CVS, not yet released),
then method 3 above will work and give you the (just) first record.  It does
not warn you in any way that there is a second or third record available.

P.S.
====
My testing and the original report were done on Windows.  If you run this on
unix, then because of the different line endings, the exact position of the
second record will change slightly.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


More information about the Biopython-dev mailing list