[Biojava-dev] How to parse large Genbank files?

Mark Schreiber markjschreiber at gmail.com
Wed Aug 5 13:16:03 UTC 2009


Would it be better for the biojava SimpleRichSequence to be backed by a
String and do symbol operations on the fly? Alternatively the default
hibernate mapping could be to a more stringy sequence.

Arguably in the absence of JPA and entity beans Hibernate should probably be
talking to biojava via DTOs. An efficient BioSQL loader would directly use
the DTOs or Entity beans (which could implement biojava interfaces) and not
go through all the symbol hassle.

Might be worth considering for BJ3

- Mark

On Aug 5, 2009 8:45 PM, "Florian Mittag" <florian.mittag at uni-tuebingen.de>
wrote:

On Tuesday, 28. July 2009 14:52, Richard Holland wrote: > > Btw: Should we
move this to Biojava-dev?...
done ;)

> If you want to explore my ideas for a replacement Sequence model, the >
code and docs are here (...
By now, I was mostly interested in a quick and dirty solution. I first
attempted to create a new class StringSymbolList that would use the String
as
representation for the sequence and only convert to Symbols on demand. Since
SimpleRichSequence uses SimpleSymbolList hard-coded, I wanted to implement a
new RichSequence as well, but I was back-stabbed by Hibernate, because the
bindings are set to SimpleRichSequence and when retrieving objects from the
DB it uses the original BioJava classes again

My solution now works and it consists out of my own implementation of
GenbankFormat, RichSequenceBuilder, and RichSequence, a new class called
StringSymbolList as described above and a change to SimpleRichSequence,
adding the method:

@Override
public String seqString() {
   return seqstring;
}

which circumvents most of the array copying stuff.

I also noticed that processing the Genbank files became slower with every
file, so I closed the Hibernate session after each chromosome and opened a
new one. (I also tried session.clean(), but somehow this didn't work).

For now, it seems like everything is fine and I have no more OutOfMemory
exceptions.

- Florian

> > cheers, > Richard > > > - Florian > > > >> On Mon, Jul 27, 2009 at 8:16
PM, Florian > >> > >> ...
> >>>>ng _fi les but it wont work. While there are no problems parsing 1804

> >>>> and > >>>> 24, chromosome > >>>> 23 leads to a OutOfMemory exception
although I gave it 2GB o...
--

Dipl. Inf. Florian Mittag Universität Tuebingen WSI-RA, Sand 1 72076
Tuebingen, Germany Phone: +49 7...




More information about the biojava-dev mailing list