[Biojava-l] Problem while parsing GenBank-like files and persiting them using Hibernate

Thu Jul 17 19:14:39 UTC 2008

I can't remember if I answered something like this before or not...
anyhow here goes just in case!

> 1. Is there a way to read in files downloaded from Ensembl using only the
> designated BioJavaX classes?

You could use the original ones and do some plain-text parsing of your
own on the 'unrich' data. The 'rich' parsers adhere strictly to the
official format, which does not include the Ensembl extensions (exon
etc.). Therefore any attempt to 'enrich' the data will attempt to
force it into the standard format, which as you see causes
non-standard bits either to get skipped or converted into some kind of
catch-all data type (such as 'any').

> 2. How can I extend the terms so that not only "SOME X-specific terms" are
> included, but some more? And how do I tell the parser to use and apply these
> terms? Or more generally, can I somehow read in an ontology (for instance
> the GO), persist it in BioSQL and make use of the terms contained therein?

It's a bit hard. I could have made this code easier to extend I think
- wasn't planning on non-standard versions when I wrote it!
Essentially the way to do this is to locate the appropriate
XYZFormat.Terms class in an IDE such as Eclipse or NetBeans, then find
a term similar to the one you want to use (in your case, you want to
add 'exon' so find something similar in the GenbankFormat.Terms
class), highlight it and do a 'find all usages'. That'll pretty
quickly point you to the parts of the code which use the term. Add
your new term to the XYZFormat.Terms class, then insert extra code in
all the parts that 'find all usages' highlighted.

> 3. How can I persist a sequence from Ensembl within a BioSQL database using
> Hibernate even though they use different accession numbers?

Find the regex and modify it to accept Ensembl-style accessions. Then,
use 'find all usages' on the regex to find the place that uses it and
modify those accordingly to pick up the correct groups from the regex
and assign them to the data model, particularly if you reordered
brackets etc. and therefore renumbered the groups in the regex.

cheers,
Richard