[BioPython] Cannot parse ApE plasmid editor GenBank file

Tue Jun 5 18:57:14 UTC 2007

Hi Peter, Chris and others,
  here I am passing the answer from Wayne back, sorry for the difficult
cross-communication. Chris, I hope you will update the bioperl bug I have
opened on this once it is clearer. I do not know whether Wayne will have
enough time to answer all your comments, on email lists and in bugzilla.
Few days ago he said they do some organize a meeting, so ... Anyway,
official answer:

Wayne Davis wrote:
> locus line I'm using is the old standard (some older parsers wanted it 
> that way).
> I've updated to write the new standard, if your program isn't flexible 
> enough to read the old style locus lines. We'll see if anyone is using 
> the older parsers still.
> from the document laying out the new standard:
> 
>  We encourage software developers to switch to a token-based LOCUS parsing
> approach, rather than a column-specific approach. If this is done, then future
> changes to the LOCUS line that affect only the spacing of its data values will
> 
> not require any modifications to software.
> 
> 
> 
> 
> I've made the default behavior to put "." in the empty fields. I left 
> those fields there because there are other parsers that require them.
> In my new version you can change the default genbank record values by 
> adding a line to your preferences file like this:
> empty_genbank_header<TAB>{LOCUS       } {} {DEFINITION  } {.} 
> {ACCESSION   } {.} {VERSION     } {.} {SOURCE      } {.} {  ORGANISM  } {.}
> 
> or
> empty_genbank_header<TAB>{LOCUS       } {}
> 
> 
> My access to our web server is temporarily unavailable, but I'll post 
> the update as soon as I can.

Martin

Peter wrote:
> Hi Wayne & all the Biopython mailing list,
> 
> Martin has been trying to parse some GenBank files produced by ApE 
> plasmid editor, and Biopython (and BioPerl?) don't like them.
> 
> Hopefully between us we can sort this out :)
> 
> By the way - Is the current ApE plasmid editor webpage here, because it 
> times out for me?:
> 
> http://www.biology.utah.edu/jorgensen/wayned/ape/
> 
> Martin MOKREJŠ wrote:
>> I would appreciate if you could tell me then what was exactly wrong 
>> with the generated files by ApE editor (author Cc:ed).
> 
> OK then, looking at file elh/pNEX3.gb which starts:
> 
> LOCUS               2981 bp ds-DNA     linear       12-OCT-2006
> DEFINITION
> ACCESSION
> VERSION
> SOURCE
>   ORGANISM
> COMMENT
> COMMENT     ApEinfo:methylated:1
> FEATURES             Location/Qualifiers
>      misc_feature    225..257
>                      /ApEinfo_label=pNEX3-compatibile
> ...
> 
> I think the location of the size (2981 bp), sequence type (ds-DNA, 
> linear) and date (12-OCT-2006) are not in the correct positions (i.e. 
> column numbers).  Also the locus ID is missing, which is not ideal. 
> Trying to do examples in an email is tricky as the line wrapping spoils 
> the effect.
> 
> Interestingly all these files seem to have their LOCUS line fields in 
> the same place - perhaps the ApE plasmid editor is following an out of 
> date version of the GenBank file format which I haven't seen before? If 
> so, we (Biopython) should be able to deal with this too.
> 
> For the current version of the LOCUS line spec, see:
> ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt
> 
> In particular:
>> The detailed format for the LOCUS line format is as follows:
>>
>> Positions  Contents
>> ---------  --------
>> 01-05      'LOCUS'
>> 06-12      spaces
>> 13-28      Locus name
>> 29-29      space
>> 30-40      Length of sequence, right-justified
>> 41-41      space
>> 42-43      bp
>> 44-44      space
>> 45-47      spaces, ss- (single-stranded), ds- (double-stranded), or
>>            ms- (mixed-stranded)
>> 48-53      NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), 
>>            mRNA (messenger RNA), uRNA (small nuclear RNA), snRNA,
>>            snoRNA. Left justified.
>> 54-55      space
>> 56-63      'linear' followed by two spaces, or 'circular'
>> 64-64      space
>> 65-67      The division code (see Section 3.3)
>> 68-68      space
>> 69-79      Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991)
> 
> Note that the proteins variant "GenPept" is slightly different.
> 
> The next six lines of that example file (elh/pNEX3.gb) have no values - 
> as Chris Fields pointed out on the Biopython mailing list, the NCBI 
> likes to use a dot/period as a place holder.
> 
> The spec does explicitly say that the KEYWORDS can be omitted, but seems 
> to assume the other lines are expected. Biopython should be happy if 
> these lines are just omitted.
> 
> See also:
> http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
> 
>> Hope this helps,
> 
> You might have upset some people by emailing an attachment to the entire 
> Biopython mailing list, but it wasn't too big at least ;)
> 
> Regards,
> 
> Peter
> 
> 
> 

-- 
Dr. Martin Mokrejs
Dept. of Genetics and Microbiology
Faculty of Science, Charles University
Vinicna 5, 128 43 Prague, Czech Republic
http://www.iresite.org
http://www.iresite.org/~mmokrejs