[BioPython] Cannot parse ApE plasmid editor GenBank file

Peter biopython at maubp.freeserve.co.uk
Tue Jun 5 19:58:46 UTC 2007


Martin MOKREJŠ wrote:
> Hi Peter, Chris and others, here I am passing the answer from Wayne
> back, sorry for the difficult cross-communication.

Thank you both, Martin & Wayne.

Wayne Davis wrote:
> [the] locus line I'm using is the old standard (some older parsers
 > wanted it that way).

That's worth knowing - thank you.  Give that, maybe we (Biopython) 
should try and parse these files (which aside from the missing 
identifier in the LOCUS line should be fairly simple). On the other 
hand, I doubt many people still use this particular the old format.

Wayne Davis wrote:
>> I've updated to write the new standard, if your
>> program isn't flexible enough to read the old style locus lines.

That's good news.  Martin - will this solve your problem, or do you 
think we should also update Biopython to cope with these  "old style" 
LOCUS lines (which also lack identifiers)?

Wayne Davis wrote:
>> We encourage software developers to switch to a token-based LOCUS
>> parsing approach, rather than a column-specific approach. If this
>> is done, then future changes to the LOCUS line that affect only the
>> spacing of its data values will not require any modifications to
 >> software.

Easier said than done, as some fields can also contain white space. 
However, Howard Salis has some interesting code to tackle this attached 
to Biopython bug 2294.

Peter wrote:
>> The next six lines of that example file (elh/pNEX3.gb) have no
>> values - as Chris Fields pointed out on the Biopython mailing list,
>> the NCBI likes to use a dot/period as a place holder.
>> 
>> The spec does explicitly say that the KEYWORDS can be omitted, but
>> seems to assume the other lines are expected. Biopython should be
>> happy if these lines are just omitted.

Just to correct myself, many of those fields are described as mandatory 
single entries further up in the documentation - so using a dot/period 
(as Wayne has done for the ApE plasmid editor) does seem the best solution.

Quoting: ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt
> 3.4.2  Entry Organization
> ...
>   The following is a brief description of each entry field. Detailed
> information about each field may be found in Sections 3.4.4 to 3.4.15.
> 
> LOCUS	... Mandatory keyword/exactly one record. 
> DEFINITION ... Mandatory keyword/one or more records.  
> ACCESSION ... Mandatory keyword/one or more records. 
> VERSION...  Mandatory keyword/exactly one record. 
> ...

KEYWORDS, SOURCE and ORGANISM are described as mandatory in all annotated
entries (so not mandatory in general). COMMENT is optional.

Peter




More information about the Biopython mailing list