[BioPython] Cannot parse ApE plasmid editor GenBank file

Mon Jun 25 14:31:49 UTC 2007

Hi Peter,
  I have re-tried current CVS version of biopyhton with a file regenerated
by fixed version of ApE editor. Unfortunately, I got:

$ python generate_image_from_genbank.py 
Traceback (most recent call last):
  File "generate_image_from_genbank.py", line 7, in ?
    genbank_entry = parser.parse(fhandle)
  File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 187, in parse
    self._scanner.feed(handle, self._consumer)
  File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 360, in feed
    self._feed_first_line(consumer, self.line)
  File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 876, in _feed_first_line
    raise SyntaxError('Did not recognise the LOCUS line layout:\n' + line)
SyntaxError: Did not recognise the LOCUS line layout:
LOCUS       pBL-RLuc-GBB+3-III        5391 bp ds-DNA   circular     14-JUN-2007

What's wrong with the LOCUS line now? Bioperl from CVS can read it, and
I thought it is already following the current specs. ;-)
Thanks for your help,
Martin

Peter wrote:
> Martin MOKREJŠ wrote:
>> Hi Peter, Chris and others, here I am passing the answer from Wayne
>> back, sorry for the difficult cross-communication.
> 
> Thank you both, Martin & Wayne.
> 
> Wayne Davis wrote:
>> [the] locus line I'm using is the old standard (some older parsers
>  > wanted it that way).
> 
> That's worth knowing - thank you.  Give that, maybe we (Biopython) 
> should try and parse these files (which aside from the missing 
> identifier in the LOCUS line should be fairly simple). On the other 
> hand, I doubt many people still use this particular the old format.
> 
> Wayne Davis wrote:
>>> I've updated to write the new standard, if your
>>> program isn't flexible enough to read the old style locus lines.
> 
> That's good news.  Martin - will this solve your problem, or do you 
> think we should also update Biopython to cope with these  "old style" 
> LOCUS lines (which also lack identifiers)?
> 
> Wayne Davis wrote:
>>> We encourage software developers to switch to a token-based LOCUS
>>> parsing approach, rather than a column-specific approach. If this
>>> is done, then future changes to the LOCUS line that affect only the
>>> spacing of its data values will not require any modifications to
>  >> software.
> 
> Easier said than done, as some fields can also contain white space. 
> However, Howard Salis has some interesting code to tackle this attached 
> to Biopython bug 2294.
> 
> Peter wrote:
>>> The next six lines of that example file (elh/pNEX3.gb) have no
>>> values - as Chris Fields pointed out on the Biopython mailing list,
>>> the NCBI likes to use a dot/period as a place holder.
>>>
>>> The spec does explicitly say that the KEYWORDS can be omitted, but
>>> seems to assume the other lines are expected. Biopython should be
>>> happy if these lines are just omitted.
> 
> Just to correct myself, many of those fields are described as mandatory 
> single entries further up in the documentation - so using a dot/period 
> (as Wayne has done for the ApE plasmid editor) does seem the best solution.
> 
> Quoting: ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt
>> 3.4.2  Entry Organization
>> ...
>>   The following is a brief description of each entry field. Detailed
>> information about each field may be found in Sections 3.4.4 to 3.4.15.
>>
>> LOCUS    ... Mandatory keyword/exactly one record. DEFINITION ... 
>> Mandatory keyword/one or more records.  ACCESSION ... Mandatory 
>> keyword/one or more records. VERSION...  Mandatory keyword/exactly one 
>> record. ...
> 
> KEYWORDS, SOURCE and ORGANISM are described as mandatory in all annotated
> entries (so not mandatory in general). COMMENT is optional.
> 
> Peter
> 
> 
> 

-- 
Dr. Martin Mokrejs
Dept. of Genetics and Microbiology
Faculty of Science, Charles University
Vinicna 5, 128 43 Prague, Czech Republic
http://www.iresite.org
http://www.iresite.org/~mmokrejs
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pGL3R.gb.gz
Type: application/x-tar
Size: 3117 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20070625/f6e08a5a/attachment-0002.tar>