[BioPython] Cannot parse ApE plasmid editor GenBank file
Martin MOKREJŠ
mmokrejs at ribosome.natur.cuni.cz
Thu Jun 7 14:44:17 UTC 2007
Hi Peter,
Peter wrote:
> Martin MOKREJŠ wrote:
>> Hi Peter, Chris and others, here I am passing the answer from Wayne
>> back, sorry for the difficult cross-communication.
>
> Thank you both, Martin & Wayne.
>
> Wayne Davis wrote:
>> [the] locus line I'm using is the old standard (some older parsers
> > wanted it that way).
>
> That's worth knowing - thank you. Give that, maybe we (Biopython)
> should try and parse these files (which aside from the missing
> identifier in the LOCUS line should be fairly simple). On the other
> hand, I doubt many people still use this particular the old format.
>
> Wayne Davis wrote:
>>> I've updated to write the new standard, if your
>>> program isn't flexible enough to read the old style locus lines.
>
> That's good news. Martin - will this solve your problem, or do you
> think we should also update Biopython to cope with these "old style"
> LOCUS lines (which also lack identifiers)?
I think that if it was ever a valid format it should cope with it.
>
> Wayne Davis wrote:
>>> We encourage software developers to switch to a token-based LOCUS
>>> parsing approach, rather than a column-specific approach. If this
>>> is done, then future changes to the LOCUS line that affect only the
>>> spacing of its data values will not require any modifications to
> >> software.
>
> Easier said than done, as some fields can also contain white space.
> However, Howard Salis has some interesting code to tackle this attached
> to Biopython bug 2294.
Please follow the bug #2305 in bioperl on this as well and see what
competitors have done in this regard. ;)
>
> Peter wrote:
>>> The next six lines of that example file (elh/pNEX3.gb) have no
>>> values - as Chris Fields pointed out on the Biopython mailing list,
>>> the NCBI likes to use a dot/period as a place holder.
>>>
>>> The spec does explicitly say that the KEYWORDS can be omitted, but
>>> seems to assume the other lines are expected. Biopython should be
>>> happy if these lines are just omitted.
>
> Just to correct myself, many of those fields are described as mandatory
> single entries further up in the documentation - so using a dot/period
> (as Wayne has done for the ApE plasmid editor) does seem the best solution.
OK, biopython now can survive the missing dots, I think biopython should
do the same. If one can fix the problem by adding internally in the parser
a default value, why not to do it?
>
> Quoting: ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt
>> 3.4.2 Entry Organization
>> ...
>> The following is a brief description of each entry field. Detailed
>> information about each field may be found in Sections 3.4.4 to 3.4.15.
>>
>> LOCUS ... Mandatory keyword/exactly one record. DEFINITION ...
>> Mandatory keyword/one or more records. ACCESSION ... Mandatory
>> keyword/one or more records. VERSION... Mandatory keyword/exactly one
>> record. ...
>
> KEYWORDS, SOURCE and ORGANISM are described as mandatory in all annotated
> entries (so not mandatory in general). COMMENT is optional.
Martin
More information about the Biopython
mailing list