[Biojava-l] How to parse large Genbank files?

Mark Schreiber markjschreiber at gmail.com
Sat Jul 25 02:20:14 UTC 2009


Hi-

I don't think anyone has done much or anything to optimize these parsers.
The process you outline sounds extremely inefficient. It is also likely to
lead to memory leaks due to the number of copy operations.

As always with java, don't try and optimize without a profiler which will
tell you which methods are taking a long time and which objects take the
most memory.

- Mark

On 25 Jul 2009, 1:33 AM, "Florian Mittag" <florian.mittag at uni-tuebingen.de>
wrote:

Hi!

I think this is a problem worth of its own thread, so I'll start one:

I want to store all human chromosomes in a BioSQL database after I loaded
the
information from .gbk files. The files I get from NCBI with the following
URIs, where the id ranges from nc_000001 to nc_000024 plus nc_001804:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=nc_000023&rettype=gbwithparts&retmode=text

I then try to parse the files as described in
http://biojava.org/wiki/BioJava:BioJavaXDocs#Tools_for_reading.2Fwriting_files
but it wont work. While there are no problems parsing 1804 and 24,
chromosome
23 leads to a OutOfMemory exception although I gave it 2GB of heap space.

Here is a stack trace (the line numbers might differ, because I already
tried
to improve GenbankFormat.java in memory efficiency):

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
       at
org.biojava.bio.seq.io.ChunkedSymbolListFactory.addSymbols(ChunkedSymbolListFactory.java:222)
       at
org.biojavax.bio.seq.io.SimpleRichSequenceBuilder.addSymbols(SimpleRichSequenceBuilder.java:256)
       at
org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:535)
       at
org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110)
       at
org.prodge.sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Main.java:537)
       at
org.prodge.sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java:468)
       at
org.prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java:164)

The line in GenbankFormat.java is:

rlistener.addSymbols(
       symParser.getAlphabet(),
       (Symbol[])(sl.toList().toArray(new Symbol[0])),
       0, sl.length());

Sometimes it fails at the sl.toList().toArray()-part, sometimes it fails
later
inside the addSymbols method, but it always fails.

How can this be? I mean, the file is only 190MB in size, so 2GB of memory
should be more than enough. Browsing through the source code, I discovered
what I think of as very inefficient handling of sequences:

1) the sequence string is read from file into a StringBuffer
2) it is converted to a string (with whitespaces removed)
3) a SimpleSymbolList is created out of the string
4) the SymbolList is converted to a List of Symbols
5) the List is converted to an array of Symbols
6) the array is passed to addSymbols
7) there it is added to a ChunkedSymbolListFactory
8) if at some point the sequence is requested, a SymbolList is created and
then converted to a string.

You see, there is a lot of copying and converting, but in the end I have the
same string I started with. Well, I had the string, if it ever reached the
end, because it will crash before completing this process.


Am I doing something wrong or is there a great potential of improving
parsing
of Genbank files?


Regards,
  Florian
_______________________________________________
Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-l



More information about the Biojava-l mailing list