[Biojava-l] Genbank I/O parser

Wed Sep 18 04:54:32 UTC 2013

Hi Ulrik,

sorry for the slow response. The genbank parser is a very recent feature
(see here https://github.com/biojava/biojava/pull/41 ) As such I am not
surprised that there is additional details missing. The second issue that
you are hitting on is that our feature-framework is not as good developed
as it should be. Did you see this?

http://www.biojava.org/docs/api/org/biojava3/core/sequence/features/package-summary.html

So far it seems only the UniProt parser supports Database cross references.
Perhaps we can extend the genbank parser in a similar way?

Andreas

On Fri, Sep 13, 2013 at 1:04 PM, Ulrik Stervbo <ulrik.stervbo at gmail.com>wrote:

> Dear List,
>
> For a smaller project of mine have I written a GenBank parser to read and
> save genbank files. I would like to share the code, but I am having a hard
> time finding my way around the BioJava source (I have no experience in
> larger software projects).
>
> I have noticed in the GenbankSequenceParser.java, that various genbank
> entries are ignored. These are the KEYWORDS, SOURCE, REFERENCE, and
> COMMENT. Is this true, or am I missing something? It further seems that the
> qualifiers for each feature is ignored. Again I may be missing something.
>
> Is this because the Sequence object cannot handle this information?
>
> In general, it seems that the current genbank parser is ignoring a lot of
> information,  accession numbers other than the first one,  GI-version and
> the date of the submission (clever use of regex to parse the first line - I
> didn't think of that, but was inspired by the more crude approach of
> BioPerl, to be able to handle slightly malformed first lines).
>
> The parser I have written, extract all the information described in the
> genbank format description (ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt),
> even
> if the file is not well formed. Saving the genbank file results in a well
> formed genbank. The only requirement in my parser is that the 3 blocks
> Annotation, Features and Sequence are in the correct order. My parser
> returns a list of SequenceObject, and is thus capable of handling several
> genbank entries in a single file (as hinted in the genbank format
> description).
>
> Like the current genbank parser in BioJava, I have not implemented handling
> of the CONTIG element.
>
> My implementation is slightly different, and probably less efficient than
> the current one, as mine uses a lot of while loops. The advantage of this
> is that the assumptions are limited.
>
> The first block of the genbank file is the most complex, consisting of
> several Keywords which can occur several times and span several lines. For
> each of the recurring keywords, a List is generated, and for those (few)
> keywords which can occur only once a string or int is returned.
>
> The keywords SOURCE and REFERENCE are more complex keywords as they also
> contain subkeywords. This I deal with  in that these are stored in a list
> of hashmaps.
>
> My parser reads locations in all their complexity, including join with
> different accession ids. All qualifiers are stored in a LinkedHash. (I just
> realized this was a bad idea and will change it to a List to accomodate for
> keeping the original order and allow repeated qualifier key.
>
> The writer looks for element in a specific order and adds appropriate
> whitespaces to generate a well formed genbank file. With all my example
> files, the output is an exact copy of the input (checked with the diff
> command)
> .
> If I can get some pointers how to integrate this in the current codebase, I
> would be happy to start adding.
>
> I have no idea of what elements other file formats provide, and how this
> can be unified, but am open for discussion.
>
> Cheers,
> Ulrik
>
> PS. My project also includes drawing linear and circular sequences with
> features. Is there a side project for these things running? I have seen
> some drawing of linear sequences, but could not get to the project. For my
> drawing of circular sequences, I would lend and lift from the plasmapper,
> which cannot be directly utilized due to some design decisions in
> plasmapper.
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>