[Biojava-l] How to parse large Genbank files?

Mon Jul 27 12:16:33 UTC 2009

Hi Mark!

On Saturday, 25. July 2009 04:20, Mark Schreiber wrote:
> I don't think anyone has done much or anything to optimize these parsers.
> The process you outline sounds extremely inefficient. It is also likely to
> lead to memory leaks due to the number of copy operations.

I wouldn't necessarily say that it leads to memory leaks, but it definitively 
leads to a high memory consumption (2GB are not enough for a 200MB file). 
Also, my outline of the process is based on only 2 hours of viewing the code, 
so actually I expected to be corrected on this.
Unfortunately, it seems like I did get the right idea and it IS extremely 
inefficient.

I mean, I understand that this is a high level of abstraction that might come 
in handy in many situations, but it certainly is more of an obstacle in my 
specific case.

> As always with java, don't try and optimize without a profiler which will
> tell you which methods are taking a long time and which objects take the
> most memory.

I think we should continue this discussion on the biojava-dev list or in a 
private conversation, as it will probably get very detailed and technical.

My question to this list again:
Is there a way to achieve my goal of parsing a 200MB Genbank file with the 
current biojava version without code changes?

- Florian

> On 25 Jul 2009, 1:33 AM, "Florian Mittag" <florian.mittag at uni-tuebingen.de>
> wrote:
>
> Hi!
>
> I think this is a problem worth of its own thread, so I'll start one:
>
> I want to store all human chromosomes in a BioSQL database after I loaded
> the
> information from .gbk files. The files I get from NCBI with the following
> URIs, where the id ranges from nc_000001 to nc_000024 plus nc_001804:
>
> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=nc_0
>00023&rettype=gbwithparts&retmode=text
>
> I then try to parse the files as described in
> http://biojava.org/wiki/BioJava:BioJavaXDocs#Tools_for_reading.2Fwriting_fi
>les but it wont work. While there are no problems parsing 1804 and 24,
> chromosome
> 23 leads to a OutOfMemory exception although I gave it 2GB of heap space.
>
> Here is a stack trace (the line numbers might differ, because I already
> tried
> to improve GenbankFormat.java in memory efficiency):
>
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>        at
> org.biojava.bio.seq.io.ChunkedSymbolListFactory.addSymbols(ChunkedSymbolLis
>tFactory.java:222) at
> org.biojavax.bio.seq.io.SimpleRichSequenceBuilder.addSymbols(SimpleRichSequ
>enceBuilder.java:256) at
> org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:5
>35) at
> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.
>java:110) at
> org.prodge.sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Main.
>java:537) at
> org.prodge.sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java:46
>8) at
> org.prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java:164)
>
> The line in GenbankFormat.java is:
>
> rlistener.addSymbols(
>        symParser.getAlphabet(),
>        (Symbol[])(sl.toList().toArray(new Symbol[0])),
>        0, sl.length());
>
> Sometimes it fails at the sl.toList().toArray()-part, sometimes it fails
> later
> inside the addSymbols method, but it always fails.
>
> How can this be? I mean, the file is only 190MB in size, so 2GB of memory
> should be more than enough. Browsing through the source code, I discovered
> what I think of as very inefficient handling of sequences:
>
> 1) the sequence string is read from file into a StringBuffer
> 2) it is converted to a string (with whitespaces removed)
> 3) a SimpleSymbolList is created out of the string
> 4) the SymbolList is converted to a List of Symbols
> 5) the List is converted to an array of Symbols
> 6) the array is passed to addSymbols
> 7) there it is added to a ChunkedSymbolListFactory
> 8) if at some point the sequence is requested, a SymbolList is created and
> then converted to a string.
>
> You see, there is a lot of copying and converting, but in the end I have
> the same string I started with. Well, I had the string, if it ever reached
> the end, because it will crash before completing this process.
>
>
> Am I doing something wrong or is there a great potential of improving
> parsing
> of Genbank files?
>
>
> Regards,
>   Florian
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-- 
Dipl. Inf. Florian Mittag
Universität Tuebingen
WSI-RA, Sand 1
72076 Tuebingen, Germany
Phone: +49 7071 / 29 78985  Fax: +49 7071 / 29 5091