[Biojava-l] Genbank I/O parser

Fri Sep 13 20:04:36 UTC 2013

Dear List,

For a smaller project of mine have I written a GenBank parser to read and
save genbank files. I would like to share the code, but I am having a hard
time finding my way around the BioJava source (I have no experience in
larger software projects).

I have noticed in the GenbankSequenceParser.java, that various genbank
entries are ignored. These are the KEYWORDS, SOURCE, REFERENCE, and
COMMENT. Is this true, or am I missing something? It further seems that the
qualifiers for each feature is ignored. Again I may be missing something.

Is this because the Sequence object cannot handle this information?

In general, it seems that the current genbank parser is ignoring a lot of
information,  accession numbers other than the first one,  GI-version and
the date of the submission (clever use of regex to parse the first line - I
didn't think of that, but was inspired by the more crude approach of
BioPerl, to be able to handle slightly malformed first lines).

The parser I have written, extract all the information described in the
genbank format description (ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt), even
if the file is not well formed. Saving the genbank file results in a well
formed genbank. The only requirement in my parser is that the 3 blocks
Annotation, Features and Sequence are in the correct order. My parser
returns a list of SequenceObject, and is thus capable of handling several
genbank entries in a single file (as hinted in the genbank format
description).

Like the current genbank parser in BioJava, I have not implemented handling
of the CONTIG element.

My implementation is slightly different, and probably less efficient than
the current one, as mine uses a lot of while loops. The advantage of this
is that the assumptions are limited.

The first block of the genbank file is the most complex, consisting of
several Keywords which can occur several times and span several lines. For
each of the recurring keywords, a List is generated, and for those (few)
keywords which can occur only once a string or int is returned.

The keywords SOURCE and REFERENCE are more complex keywords as they also
contain subkeywords. This I deal with  in that these are stored in a list
of hashmaps.

My parser reads locations in all their complexity, including join with
different accession ids. All qualifiers are stored in a LinkedHash. (I just
realized this was a bad idea and will change it to a List to accomodate for
keeping the original order and allow repeated qualifier key.

The writer looks for element in a specific order and adds appropriate
whitespaces to generate a well formed genbank file. With all my example
files, the output is an exact copy of the input (checked with the diff
command)
.
If I can get some pointers how to integrate this in the current codebase, I
would be happy to start adding.

I have no idea of what elements other file formats provide, and how this
can be unified, but am open for discussion.

Cheers,
Ulrik

PS. My project also includes drawing linear and circular sequences with
features. Is there a side project for these things running? I have seen
some drawing of linear sequences, but could not get to the project. For my
drawing of circular sequences, I would lend and lift from the plasmapper,
which cannot be directly utilized due to some design decisions in
plasmapper.