[Biojava-dev] genbank parser for Ensembl genbank files

Thu Aug 1 17:33:21 UTC 2013

Thanks for the patch, Brian. I can of course patch the code based on this,
but if you submit this via github you will also get the credits for this.
As such I'll wait for your pull request.

Andreas

On Wed, Jul 31, 2013 at 5:11 AM, Brian Repko <brian.repko at learnthinkcode.com
> wrote:

> BioJava folks:
>
>
>
> Just a heads up that I was trying to parse the Ensembl genbank files
> and failed. I have a patched version of GenbankFormat that I will
> create pull request from but you can probably work out the changes from
> this email (which is documentation on the internet until I get the pull
> request done...).
>
>
>
> So the first problem is that Ensembl's genbank files wrap at so many
> characters and this includes dbxrefs.  Dbxrefs that wrap will not work
> with biojava since the line wrap is converted into a space and then the
> dbxref regex rejects it because it has a space in it.
>
> The other is that Ensembl includes the sequence in the genbank file and
> even when one sets elidesSymbols to true, the parser will parser and
> blow up with out of memory errors.  This is a simple fix to stop
> parsing symbols when elidesSymbols is true...
>
>
>
> That allowed it to work - I've seen a few messages looking at how to
> parse Ensembl genbank so thought I'd post - as well as the new work on
> the Biojava3 Genbank parser - would / could / should work with Ensembl.
>
>
>
> diff at line 445
>
>                                 if (key.equals("db_xref")) {
>
>                                     // ----------------- PATCH START
> -------------------
>
>                                     // strip spaces from dbxref if it
> continues across lines
>
>                                     val = val.replaceAll("\\s+","");
>
>                                     // ----------------- PATCH END
> ---------------------
>
>                                     Matcher m = dbxp.matcher(val);
>
>
>
> diff at line 595
>
>                     // --------------- start patch to skip ORIGIN
> section reading if elideSymbols is true
>
>                     if (getElideSymbols() &&
> firstSecKey.equals(START_SEQUENCE_TAG) &&
> !line.startsWith(END_SEQUENCE_TAG)) {
>
>                         continue;
>
>                     }
>
>                     // --------------- end patch
>
>                     Matcher m = sectp.matcher(line);
>
>
>
> Brian
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>