[BioPython] Cannot parse ApE plasmid editor GenBank file

Martin MOKREJŠ mmokrejs at ribosome.natur.cuni.cz
Fri Jun 8 10:31:36 UTC 2007



Chris Fields wrote:
> 
> On Jun 7, 2007, at 9:44 AM, Martin MOKREJŠ wrote:
> 
>> Hi Peter,
>>> ...
>>> That's good news.  Martin - will this solve your problem, or do you
>>> think we should also update Biopython to cope with these  "old style"
>>> LOCUS lines (which also lack identifiers)?
>>
>> I think that if it was ever a valid format it should cope with it.
> 
> I think it's better to explicitly state that the parser is compliant 
> with a particular GenBank release and can likely parse other similarly 
> formatted GenBank records from third-party software.  If the parser 
> chokes on a bad record then you can point out the deficiency in the 
> record and (if possible) try to make it more flexible w/o borking the 
> parser later on.  The release notes are there for a good reason!
> 
> The LOCUS line format, however, has been relatively stable over time.  
> Here are the release notes for a GenBank release from late 1992:
> 
> ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb74.release.notes
> 
> and the LOCUS line is:
> 
> Positions       Contents
> 
> 1-12    LOCUS
> 13-22    Locus name
> 23-29    Length of sequence, right-justified
> 31-32    bp
> 34-36    Blank, ss- (single-stranded), ds- (double-stranded), or
>      ms- (mixed-stranded)
> 37-40    Blank, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA),
>     mRNA (messenger RNA), or uRNA (small nuclear RNA)
> 43-52    Blank (implies linear) or circular
> 53-55    The division code (see Section 3.3)
> 63-73    Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991)
> 
> The spacing is more explicitly laid out in later versions.  The best 
> part is the Entrez CD order form (clipped out by scissors to be 
> snail-mailed) at the end of the file!

In principle I do agree with you but let me emphasize that I fully agree with Wayne
who wrote me yesterday in the way that the GenBank format is he way to write down
your data, and we often really do not need all the fields required for data syubmission
into the Genbank database:

<quote>
>From the definition of the format (ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt), only DEFINITION, KEYWORDS, SOURCE and ORIGIN (if it contains data) lines end with a period. The periods should be added to the ends of non-period containing lines for those fields only. That is where ApE doesn't conform to the file definition.

I put in the fields DEFINITION, ACCESSION, VERSION, SOURCE, and ORGANISM  because those are listed as mandatory by the release notes. Looks like I missed that REFERENCES is also mandatory. The release notes do not say that the fields must contain data or that they must end with a period (except where noted above). I put them in, figuring that a parser that was working from the file definition might require those fields to be present. It seems like a well written parser could handle null data in the field better than handling the absence of an explicitly required field, since there is nothing in the standard that states what data, if any, must be present, but there is an explicit requirement for the field.


Ok, I'll add an option to take out blank fields (even though that will break compliance with the definition, as I understand it). One could interpret the file standard as only applying to files intended for use in the NCBI database, so the required fields are only an issue for entering into their database, not for file parsers.

Working on ApE isn't what I really do, so I might not get around to it immediately.

Still, while I acknowledge that ApE has been writing files that do not comply completely with the standard (needing the required periods on the end of some of the mandatory fields), your parser should be able to handle null data lines and spaces in the locus name.

for parsing the locus info here is the tcl regexp that I use ($a is the full LOCUS line, x returns the full matched line):
regexp {LOCUS       (.*) ([0-9]*) bp (   |ss-|ds-|ms-)(NA    |DNA   |RNA   |tRNA  |rRNA  |mRNA  |uRNA  |snRNA |snoRNA)[ ]*(linear  |circular|        )[ ]*([ A-Z]{3})[ ]*(..-...-....)} $a x name size stranded type circular div date

you have to do a trim on the name that you get out of this, since it is space padded, as per the file definition. Let me know if you see an exception that is a valid LOCUS line but would break this.
</quote>

Martin





More information about the Biopython mailing list