[BioPython] Cannot parse ApE plasmid editor GenBank file

Thu Jun 7 14:44:17 UTC 2007

Hi Peter,

Peter wrote:
> Martin MOKREJŠ wrote:
>> Hi Peter, Chris and others, here I am passing the answer from Wayne
>> back, sorry for the difficult cross-communication.
> 
> Thank you both, Martin & Wayne.
> 
> Wayne Davis wrote:
>> [the] locus line I'm using is the old standard (some older parsers
>  > wanted it that way).
> 
> That's worth knowing - thank you.  Give that, maybe we (Biopython) 
> should try and parse these files (which aside from the missing 
> identifier in the LOCUS line should be fairly simple). On the other 
> hand, I doubt many people still use this particular the old format.
> 
> Wayne Davis wrote:
>>> I've updated to write the new standard, if your
>>> program isn't flexible enough to read the old style locus lines.
> 
> That's good news.  Martin - will this solve your problem, or do you 
> think we should also update Biopython to cope with these  "old style" 
> LOCUS lines (which also lack identifiers)?

I think that if it was ever a valid format it should cope with it.

> 
> Wayne Davis wrote:
>>> We encourage software developers to switch to a token-based LOCUS
>>> parsing approach, rather than a column-specific approach. If this
>>> is done, then future changes to the LOCUS line that affect only the
>>> spacing of its data values will not require any modifications to
>  >> software.
> 
> Easier said than done, as some fields can also contain white space. 
> However, Howard Salis has some interesting code to tackle this attached 
> to Biopython bug 2294.

Please follow the bug #2305 in bioperl on this as well and see what
competitors have done in this regard. ;)

> 
> Peter wrote:
>>> The next six lines of that example file (elh/pNEX3.gb) have no
>>> values - as Chris Fields pointed out on the Biopython mailing list,
>>> the NCBI likes to use a dot/period as a place holder.
>>>
>>> The spec does explicitly say that the KEYWORDS can be omitted, but
>>> seems to assume the other lines are expected. Biopython should be
>>> happy if these lines are just omitted.
> 
> Just to correct myself, many of those fields are described as mandatory 
> single entries further up in the documentation - so using a dot/period 
> (as Wayne has done for the ApE plasmid editor) does seem the best solution.

OK, biopython now can survive the missing dots, I think biopython should
do the same. If one can fix the problem by adding internally in the parser
a default value, why not to do it?

> 
> Quoting: ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt
>> 3.4.2  Entry Organization
>> ...
>>   The following is a brief description of each entry field. Detailed
>> information about each field may be found in Sections 3.4.4 to 3.4.15.
>>
>> LOCUS    ... Mandatory keyword/exactly one record. DEFINITION ... 
>> Mandatory keyword/one or more records.  ACCESSION ... Mandatory 
>> keyword/one or more records. VERSION...  Mandatory keyword/exactly one 
>> record. ...
> 
> KEYWORDS, SOURCE and ORGANISM are described as mandatory in all annotated
> entries (so not mandatory in general). COMMENT is optional.

Martin