[Biojava-l] Problem while parsing GenBank-like files and persiting them using Hibernate
Andreas Dräger
andreas.draeger at uni-tuebingen.de
Tue Apr 22 06:43:29 UTC 2008
Dear all,
Recently I downloaded some GenBank-like files from the Ensembl web site
(http://www.ensembl.org/index.html) and recognized that the format used
on this site slightly diverges from what one gets from NCBI.
Especially the ACCESSION number is not valid according to the pattern
matcher in class org.biojavax.bio.seq.io.GenbankFormat and the files can
thus not be parsed using the RichSequence.IOTools.
This issue has already been discussed in this list before, but the
solution was not to use files from Ensemble, but those from NCBI
instead. However, the reason why the files from Ensembl are so
important, is that they contain additional annotation, not provided by
NCBI. For instance the feature "exon".
The old parsers from the biojava.seq.io package are able to read in the
files from this site. The Sequence objects can be enriched afterwards
and be written to another genbank file. However, this again results in a
file, which cannot be stored in a BioSQL database using Hibernate caused
by the invalid accession number. The next problem is that even the old
parsers do not treat this "rich" information from the Ensembl files
properly. The feature "exon" becomes "any" when the sequence is enriched
and written to a new GenBank file. Hence the benefit from the Ensembl
annotation gets lost during paring and conversion. By the way, Ensembl
also offers to write Embl-like files or other formats with the same
problems as mentioned above.
On the other hand, no matter which parser in BioJavaX I look up within
the API documentation, I can always find a corresponding "Term" class,
which states that this class "Implements some ...-specific terms", where
the dots stand for the considered format like UniProt, GenBank, Embl and
so forth. None of these Term classes provides any setters or
add-methods, which would allow to define a new term like "exon". The
structure of the parsers seems to me to be very sophisticated and it is
not very easy to extend the parsers or term classes for own purposes.
Therefore, I would like to ask the following questions:
1. Is there a way to read in files downloaded from Ensembl using only
the designated BioJavaX classes?
2. How can I extend the terms so that not only "SOME X-specific terms"
are included, but some more? And how do I tell the parser to use and
apply these terms? Or more generally, can I somehow read in an ontology
(for instance the GO), persist it in BioSQL and make use of the terms
contained therein?
3. How can I persist a sequence from Ensembl within a BioSQL database
using Hibernate even though they use different accession numbers?
I am grateful for any answers.
Cheers
Andreas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: andreas.draeger.vcf
Type: text/x-vcard
Size: 509 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biojava-l/attachments/20080422/906d4e9d/attachment-0002.vcf>
More information about the Biojava-l
mailing list