[Biojava-l] How to parse large Genbank files?
Florian Mittag
florian.mittag at uni-tuebingen.de
Fri Jul 24 17:29:08 UTC 2009
Hi!
I think this is a problem worth of its own thread, so I'll start one:
I want to store all human chromosomes in a BioSQL database after I loaded the
information from .gbk files. The files I get from NCBI with the following
URIs, where the id ranges from nc_000001 to nc_000024 plus nc_001804:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=nc_000023&rettype=gbwithparts&retmode=text
I then try to parse the files as described in
http://biojava.org/wiki/BioJava:BioJavaXDocs#Tools_for_reading.2Fwriting_files
but it wont work. While there are no problems parsing 1804 and 24, chromosome
23 leads to a OutOfMemory exception although I gave it 2GB of heap space.
Here is a stack trace (the line numbers might differ, because I already tried
to improve GenbankFormat.java in memory efficiency):
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at
org.biojava.bio.seq.io.ChunkedSymbolListFactory.addSymbols(ChunkedSymbolListFactory.java:222)
at
org.biojavax.bio.seq.io.SimpleRichSequenceBuilder.addSymbols(SimpleRichSequenceBuilder.java:256)
at
org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:535)
at
org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110)
at
org.prodge.sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Main.java:537)
at
org.prodge.sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java:468)
at org.prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java:164)
The line in GenbankFormat.java is:
rlistener.addSymbols(
symParser.getAlphabet(),
(Symbol[])(sl.toList().toArray(new Symbol[0])),
0, sl.length());
Sometimes it fails at the sl.toList().toArray()-part, sometimes it fails later
inside the addSymbols method, but it always fails.
How can this be? I mean, the file is only 190MB in size, so 2GB of memory
should be more than enough. Browsing through the source code, I discovered
what I think of as very inefficient handling of sequences:
1) the sequence string is read from file into a StringBuffer
2) it is converted to a string (with whitespaces removed)
3) a SimpleSymbolList is created out of the string
4) the SymbolList is converted to a List of Symbols
5) the List is converted to an array of Symbols
6) the array is passed to addSymbols
7) there it is added to a ChunkedSymbolListFactory
8) if at some point the sequence is requested, a SymbolList is created and
then converted to a string.
You see, there is a lot of copying and converting, but in the end I have the
same string I started with. Well, I had the string, if it ever reached the
end, because it will crash before completing this process.
Am I doing something wrong or is there a great potential of improving parsing
of Genbank files?
Regards,
Florian
More information about the Biojava-l
mailing list