[BioPython] Cannot parse ApE plasmid editor GenBank file
Martin MOKREJŠ
mmokrejs at ribosome.natur.cuni.cz
Tue Jun 5 18:57:14 UTC 2007
Hi Peter, Chris and others,
here I am passing the answer from Wayne back, sorry for the difficult
cross-communication. Chris, I hope you will update the bioperl bug I have
opened on this once it is clearer. I do not know whether Wayne will have
enough time to answer all your comments, on email lists and in bugzilla.
Few days ago he said they do some organize a meeting, so ... Anyway,
official answer:
Wayne Davis wrote:
> locus line I'm using is the old standard (some older parsers wanted it
> that way).
> I've updated to write the new standard, if your program isn't flexible
> enough to read the old style locus lines. We'll see if anyone is using
> the older parsers still.
> from the document laying out the new standard:
>
> We encourage software developers to switch to a token-based LOCUS parsing
> approach, rather than a column-specific approach. If this is done, then future
> changes to the LOCUS line that affect only the spacing of its data values will
>
> not require any modifications to software.
>
>
>
>
> I've made the default behavior to put "." in the empty fields. I left
> those fields there because there are other parsers that require them.
> In my new version you can change the default genbank record values by
> adding a line to your preferences file like this:
> empty_genbank_header<TAB>{LOCUS } {} {DEFINITION } {.}
> {ACCESSION } {.} {VERSION } {.} {SOURCE } {.} { ORGANISM } {.}
>
> or
> empty_genbank_header<TAB>{LOCUS } {}
>
>
> My access to our web server is temporarily unavailable, but I'll post
> the update as soon as I can.
Martin
Peter wrote:
> Hi Wayne & all the Biopython mailing list,
>
> Martin has been trying to parse some GenBank files produced by ApE
> plasmid editor, and Biopython (and BioPerl?) don't like them.
>
> Hopefully between us we can sort this out :)
>
> By the way - Is the current ApE plasmid editor webpage here, because it
> times out for me?:
>
> http://www.biology.utah.edu/jorgensen/wayned/ape/
>
> Martin MOKREJŠ wrote:
>> I would appreciate if you could tell me then what was exactly wrong
>> with the generated files by ApE editor (author Cc:ed).
>
> OK then, looking at file elh/pNEX3.gb which starts:
>
> LOCUS 2981 bp ds-DNA linear 12-OCT-2006
> DEFINITION
> ACCESSION
> VERSION
> SOURCE
> ORGANISM
> COMMENT
> COMMENT ApEinfo:methylated:1
> FEATURES Location/Qualifiers
> misc_feature 225..257
> /ApEinfo_label=pNEX3-compatibile
> ...
>
> I think the location of the size (2981 bp), sequence type (ds-DNA,
> linear) and date (12-OCT-2006) are not in the correct positions (i.e.
> column numbers). Also the locus ID is missing, which is not ideal.
> Trying to do examples in an email is tricky as the line wrapping spoils
> the effect.
>
> Interestingly all these files seem to have their LOCUS line fields in
> the same place - perhaps the ApE plasmid editor is following an out of
> date version of the GenBank file format which I haven't seen before? If
> so, we (Biopython) should be able to deal with this too.
>
> For the current version of the LOCUS line spec, see:
> ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt
>
> In particular:
>> The detailed format for the LOCUS line format is as follows:
>>
>> Positions Contents
>> --------- --------
>> 01-05 'LOCUS'
>> 06-12 spaces
>> 13-28 Locus name
>> 29-29 space
>> 30-40 Length of sequence, right-justified
>> 41-41 space
>> 42-43 bp
>> 44-44 space
>> 45-47 spaces, ss- (single-stranded), ds- (double-stranded), or
>> ms- (mixed-stranded)
>> 48-53 NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA),
>> mRNA (messenger RNA), uRNA (small nuclear RNA), snRNA,
>> snoRNA. Left justified.
>> 54-55 space
>> 56-63 'linear' followed by two spaces, or 'circular'
>> 64-64 space
>> 65-67 The division code (see Section 3.3)
>> 68-68 space
>> 69-79 Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991)
>
> Note that the proteins variant "GenPept" is slightly different.
>
> The next six lines of that example file (elh/pNEX3.gb) have no values -
> as Chris Fields pointed out on the Biopython mailing list, the NCBI
> likes to use a dot/period as a place holder.
>
> The spec does explicitly say that the KEYWORDS can be omitted, but seems
> to assume the other lines are expected. Biopython should be happy if
> these lines are just omitted.
>
> See also:
> http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
>
>> Hope this helps,
>
> You might have upset some people by emailing an attachment to the entire
> Biopython mailing list, but it wasn't too big at least ;)
>
> Regards,
>
> Peter
>
>
>
--
Dr. Martin Mokrejs
Dept. of Genetics and Microbiology
Faculty of Science, Charles University
Vinicna 5, 128 43 Prague, Czech Republic
http://www.iresite.org
http://www.iresite.org/~mmokrejs
More information about the Biopython
mailing list