[Biojava-l] Problem while parsing GenBank-like files and persiting them using Hibernate

Tue Apr 22 06:43:29 UTC 2008

Dear all,

Recently I downloaded some GenBank-like files from the Ensembl web site 
(http://www.ensembl.org/index.html) and recognized that the format used 
on this site slightly diverges from what one gets from NCBI.
Especially the ACCESSION number is not valid according to the pattern 
matcher in class org.biojavax.bio.seq.io.GenbankFormat and the files can 
thus not be parsed using the RichSequence.IOTools.
This issue has already been discussed in this list before, but the 
solution was not to use files from Ensemble, but those from NCBI 
instead. However, the reason why the files from Ensembl are so 
important, is that they contain additional annotation, not provided by 
NCBI. For instance the feature "exon".
The old parsers from the biojava.seq.io package are able to read in the 
files from this site. The Sequence objects can be enriched afterwards 
and be written to another genbank file. However, this again results in a 
file, which cannot be stored in a BioSQL database using Hibernate caused 
by the invalid accession number. The next problem is that even the old 
parsers do not treat this "rich" information from the Ensembl files 
properly. The feature "exon" becomes "any" when the sequence is enriched 
and written to a new GenBank file. Hence the benefit from the Ensembl 
annotation gets lost during paring and conversion. By the way, Ensembl 
also offers to write Embl-like files or other formats with the same 
problems as mentioned above.
On the other hand, no matter which parser in BioJavaX I look up within 
the API documentation, I can always find a corresponding "Term" class, 
which states that this class "Implements some ...-specific terms", where 
the dots stand for the considered format like UniProt, GenBank, Embl and 
so forth. None of these Term classes provides any setters or 
add-methods, which would allow to define a new term like "exon". The 
structure of the parsers seems to me to be very sophisticated and it is 
not very easy to extend the parsers or term classes for own purposes.
Therefore, I would like to ask the following questions:
1. Is there a way to read in files downloaded from Ensembl using only 
the designated BioJavaX classes?
2. How can I extend the terms so that not only "SOME X-specific terms" 
are included, but some more? And how do I tell the parser to use and 
apply these terms? Or more generally, can I somehow read in an ontology 
(for instance the GO), persist it in BioSQL and make use of the terms 
contained therein?
3. How can I persist a sequence from Ensembl within a BioSQL database 
using Hibernate even though they use different accession numbers?
I am grateful for any answers.

Cheers
Andreas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: andreas.draeger.vcf
Type: text/x-vcard
Size: 509 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biojava-l/attachments/20080422/906d4e9d/attachment-0002.vcf>