[Biojava-l] How to parse large Genbank files?

Tue Jul 28 12:14:54 UTC 2009

Hi!

On Tuesday, 28. July 2009 05:05, you wrote:
> While you maybe can't do it without code changes you can probably do
> it within the existing framework.  If you look at the readGenbank()
> code in RichSequence.IOTools you will find that the BioJava file
> parsing consists of many pluggable components which are all defined by
> interfaces. Anything that implements one of those interfaces can be
> plugged into the parsing frame work.  So if you want you can change
> the Format object to one of your custom design (which implements
> Format), you can also change the event listeners and the
> SequenceBuilders. In your case the SequenceBuilder might be something
> to look at, it sounds like you don't need to create all the extra
> Sequence objects for every feature so you could modify that part.

Yeah, I see what you mean. I wanted to start with something simple because I 
didn't want to code everything myself, but it seems like I won't get around 
it, if I want to optimize it.

> Also, in the Format objects there are often methods called elideXXX()
> which let you tell the Format object to skip over bits that you don't
> want.

I think I want everything, since I want to story everything in the BioSQL db 
afterwards. I don't think, I can skip something.

> Finally, I suspect the problem with memory use is that the String,
> char[], SymbolList, Sequence copying is both inefficient and worse
> still is probably not releasing resources in a timely fashion. Eg once
> the parser framework converts a char[] to a SymbolList is probably no
> longer needs that char[] reference and might be able to null it. Then
> when memory gets low the GC can clean out all the cruft.
>
> If I have a chance I will run a profiler to see what is sucking up the
> memory (and what can be released) and also see if all that copying is
> making a significant impact on CPU cycles (if not it's probably more
> effort than it's worth to change). The memory thing definitely needs
> to change though.

It turned out our workgroup has a floating JProfiler license, so I did some 
tests and got some clues on where to optimize further. The NetBeans profiler 
reported that most of the memory was consumend by char[], but it only showed 
about 130MB of usage, contrary to the 2GB of heap being full.
So our idea was that maybe the memory management overhead was another sink 
where all the memory vanished. JProfiler then returned more plausible results 
with nearly 800MB used by char[].

I tweaked both the way I call the parser and the GenbankFormat itself, and now 
all files except chromosome 1 (300MB) will parse successfully.

To reduce the memory for SymbolLists, I did:

PackedSymbolListFactory pslf = new PackedSymbolListFactory()
SimpleRichSequenceBuilder listener = new SimpleRichSequenceBuilder(pslf);
GenbankFormat gf = new GenbankFormat();
gf.readRichSequence(fileIn, dnaTokenization, listener, nsGenbank);
RichSequence seq = listener.makeRichSequence();

The PackedSymbolListFactory seemed to help saving some memory, but it still 
wasn't enough.

I then modified the readSection() method of GenbankFormat. What it usually 
does is to put each single line of nucleotide sequence into a String[] which 
it then puts into the ArrayList returned by the method. Since there are 60 
nucleotides (so 60 bytes + whitespaces) per line, this was a big array.
I modified it to build one large string containing only the nucleotide 
characters, instead of returning the array and then have the 
readRichSequence() method build this large String.

This all still isn't enough, the program exits at sl.toArray(), so I agree 
with Richard here to keep the sequence as a String (maybe use the 
Symbol(List) mechanisms to check for invalid characters) and only convert it 
to Symbol objects if really necessary.

Btw: Should we move this to Biojava-dev?
And where do I sign up for BioJava3 development? ;-)

- Florian

> On Mon, Jul 27, 2009 at 8:16 PM, Florian
>
> Mittag<florian.mittag at uni-tuebingen.de> wrote:
> > Hi Mark!
> >
> > On Saturday, 25. July 2009 04:20, Mark Schreiber wrote:
> >> I don't think anyone has done much or anything to optimize these
> >> parsers. The process you outline sounds extremely inefficient. It is
> >> also likely to lead to memory leaks due to the number of copy
> >> operations.
> >
> > I wouldn't necessarily say that it leads to memory leaks, but it
> > definitively leads to a high memory consumption (2GB are not enough for a
> > 200MB file). Also, my outline of the process is based on only 2 hours of
> > viewing the code, so actually I expected to be corrected on this.
> > Unfortunately, it seems like I did get the right idea and it IS extremely
> > inefficient.
> >
> > I mean, I understand that this is a high level of abstraction that might
> > come in handy in many situations, but it certainly is more of an obstacle
> > in my specific case.
> >
> >> As always with java, don't try and optimize without a profiler which
> >> will tell you which methods are taking a long time and which objects
> >> take the most memory.
> >
> > I think we should continue this discussion on the biojava-dev list or in
> > a private conversation, as it will probably get very detailed and
> > technical.
> >
> >
> > My question to this list again:
> > Is there a way to achieve my goal of parsing a 200MB Genbank file with
> > the current biojava version without code changes?
> >
> >
> > - Florian
> >
> >> On 25 Jul 2009, 1:33 AM, "Florian Mittag"
> >> <florian.mittag at uni-tuebingen.de> wrote:
> >>
> >> Hi!
> >>
> >> I think this is a problem worth of its own thread, so I'll start one:
> >>
> >> I want to store all human chromosomes in a BioSQL database after I
> >> loaded the
> >> information from .gbk files. The files I get from NCBI with the
> >> following URIs, where the id ranges from nc_000001 to nc_000024 plus
> >> nc_001804:
> >>
> >> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=n
> >>c_0 00023&rettype=gbwithparts&retmode=text
> >>
> >> I then try to parse the files as described in
> >> http://biojava.org/wiki/BioJava:BioJavaXDocs#Tools_for_reading.2Fwriting
> >>_fi les but it wont work. While there are no problems parsing 1804 and
> >> 24, chromosome
> >> 23 leads to a OutOfMemory exception although I gave it 2GB of heap
> >> space.
> >>
> >> Here is a stack trace (the line numbers might differ, because I already
> >> tried
> >> to improve GenbankFormat.java in memory efficiency):
> >>
> >> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> >>        at
> >> org.biojava.bio.seq.io.ChunkedSymbolListFactory.addSymbols(ChunkedSymbol
> >>Lis tFactory.java:222) at
> >> org.biojavax.bio.seq.io.SimpleRichSequenceBuilder.addSymbols(SimpleRichS
> >>equ enceBuilder.java:256) at
> >> org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.jav
> >>a:5 35) at
> >> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamRead
> >>er. java:110) at
> >> org.prodge.sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Ma
> >>in. java:537) at
> >> org.prodge.sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java
> >>:46 8) at
> >> org.prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java:164)
> >>
> >> The line in GenbankFormat.java is:
> >>
> >> rlistener.addSymbols(
> >>        symParser.getAlphabet(),
> >>        (Symbol[])(sl.toList().toArray(new Symbol[0])),
> >>        0, sl.length());
> >>
> >> Sometimes it fails at the sl.toList().toArray()-part, sometimes it fails
> >> later
> >> inside the addSymbols method, but it always fails.
> >>
> >> How can this be? I mean, the file is only 190MB in size, so 2GB of
> >> memory should be more than enough. Browsing through the source code, I
> >> discovered what I think of as very inefficient handling of sequences:
> >>
> >> 1) the sequence string is read from file into a StringBuffer
> >> 2) it is converted to a string (with whitespaces removed)
> >> 3) a SimpleSymbolList is created out of the string
> >> 4) the SymbolList is converted to a List of Symbols
> >> 5) the List is converted to an array of Symbols
> >> 6) the array is passed to addSymbols
> >> 7) there it is added to a ChunkedSymbolListFactory
> >> 8) if at some point the sequence is requested, a SymbolList is created
> >> and then converted to a string.
> >>
> >> You see, there is a lot of copying and converting, but in the end I have
> >> the same string I started with. Well, I had the string, if it ever
> >> reached the end, because it will crash before completing this process.
> >>
> >>
> >> Am I doing something wrong or is there a great potential of improving
> >> parsing
> >> of Genbank files?
> >>
> >>
> >> Regards,
> >>   Florian
> >> _______________________________________________
> >> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
> > --
> > Dipl. Inf. Florian Mittag
> > Universität Tuebingen
> > WSI-RA, Sand 1
> > 72076 Tuebingen, Germany
> > Phone: +49 7071 / 29 78985  Fax: +49 7071 / 29 5091

-- 
Dipl. Inf. Florian Mittag
Universität Tuebingen
WSI-RA, Sand 1
72076 Tuebingen, Germany
Phone: +49 7071 / 29 78985  Fax: +49 7071 / 29 5091