[BioPython] Cannot parse ApE plasmid editor GenBank file
Peter
biopython at maubp.freeserve.co.uk
Tue Jun 5 18:29:52 UTC 2007
Hi Wayne & all the Biopython mailing list,
Martin has been trying to parse some GenBank files produced by ApE
plasmid editor, and Biopython (and BioPerl?) don't like them.
Hopefully between us we can sort this out :)
By the way - Is the current ApE plasmid editor webpage here, because it
times out for me?:
http://www.biology.utah.edu/jorgensen/wayned/ape/
Martin MOKREJŠ wrote:
> I would appreciate if you could tell me then what was exactly wrong with
> the generated files by ApE editor (author Cc:ed).
OK then, looking at file elh/pNEX3.gb which starts:
LOCUS 2981 bp ds-DNA linear 12-OCT-2006
DEFINITION
ACCESSION
VERSION
SOURCE
ORGANISM
COMMENT
COMMENT ApEinfo:methylated:1
FEATURES Location/Qualifiers
misc_feature 225..257
/ApEinfo_label=pNEX3-compatibile
...
I think the location of the size (2981 bp), sequence type (ds-DNA,
linear) and date (12-OCT-2006) are not in the correct positions (i.e.
column numbers). Also the locus ID is missing, which is not ideal.
Trying to do examples in an email is tricky as the line wrapping spoils
the effect.
Interestingly all these files seem to have their LOCUS line fields in
the same place - perhaps the ApE plasmid editor is following an out of
date version of the GenBank file format which I haven't seen before? If
so, we (Biopython) should be able to deal with this too.
For the current version of the LOCUS line spec, see:
ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt
In particular:
> The detailed format for the LOCUS line format is as follows:
>
> Positions Contents
> --------- --------
> 01-05 'LOCUS'
> 06-12 spaces
> 13-28 Locus name
> 29-29 space
> 30-40 Length of sequence, right-justified
> 41-41 space
> 42-43 bp
> 44-44 space
> 45-47 spaces, ss- (single-stranded), ds- (double-stranded), or
> ms- (mixed-stranded)
> 48-53 NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA),
> mRNA (messenger RNA), uRNA (small nuclear RNA), snRNA,
> snoRNA. Left justified.
> 54-55 space
> 56-63 'linear' followed by two spaces, or 'circular'
> 64-64 space
> 65-67 The division code (see Section 3.3)
> 68-68 space
> 69-79 Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991)
Note that the proteins variant "GenPept" is slightly different.
The next six lines of that example file (elh/pNEX3.gb) have no values -
as Chris Fields pointed out on the Biopython mailing list, the NCBI
likes to use a dot/period as a place holder.
The spec does explicitly say that the KEYWORDS can be omitted, but seems
to assume the other lines are expected. Biopython should be happy if
these lines are just omitted.
See also:
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
> Hope this helps,
You might have upset some people by emailing an attachment to the entire
Biopython mailing list, but it wasn't too big at least ;)
Regards,
Peter
More information about the Biopython
mailing list