[Biojava-dev] How to parse large Genbank files?

Florian Mittag florian.mittag at uni-tuebingen.de
Wed Aug 5 12:45:41 UTC 2009


On Tuesday, 28. July 2009 14:52, Richard Holland wrote:
> > Btw: Should we move this to Biojava-dev?
>> probably, yes! :)

done ;)


> If you want to explore my ideas for a replacement Sequence model, the
> code and docs are here (sequence handling is in the 'core' module with
> DNA-specifics in the 'dna' module):
>
> http://biojava.org/wiki/BioJava3:HowTo
> http://www.biojava.org/wiki/BioJava3_project
>
> (Methods such as file parsers would request Strings (or ideally
> CharSequence - more flexible, and String extends it) as parameters
> whenever they don't care about content - if they care about content
> but don't care in advance about size or random access then they should
> request Iterator<Symbol> which can be used to wrap a String and parse
> on demand, and if they need full functionality then they should
> request List<Symbol> which the default implementation of uses
> ArrayLists but there's no reason a String-backed one could be written
> as well).

By now, I was mostly interested in a quick and dirty solution. I first 
attempted to create a new class StringSymbolList that would use the String as 
representation for the sequence and only convert to Symbols on demand. Since 
SimpleRichSequence uses SimpleSymbolList hard-coded, I wanted to implement a 
new RichSequence as well, but I was back-stabbed by Hibernate, because the 
bindings are set to SimpleRichSequence and when retrieving objects from the 
DB it uses the original BioJava classes again

My solution now works and it consists out of my own implementation of 
GenbankFormat, RichSequenceBuilder, and RichSequence, a new class called 
StringSymbolList as described above and a change to SimpleRichSequence, 
adding the method:

@Override
public String seqString() {
    return seqstring;
}

which circumvents most of the array copying stuff.

I also noticed that processing the Genbank files became slower with every 
file, so I closed the Hibernate session after each chromosome and opened a 
new one. (I also tried session.clean(), but somehow this didn't work).

For now, it seems like everything is fine and I have no more OutOfMemory 
exceptions.

- Florian


>
> cheers,
> Richard
>
> > - Florian
> >
> >> On Mon, Jul 27, 2009 at 8:16 PM, Florian
> >>
> >> Mittag<florian.mittag at uni-tuebingen.de> wrote:
> >>> Hi Mark!
> >>>
> >>> On Saturday, 25. July 2009 04:20, Mark Schreiber wrote:
> >>>> I don't think anyone has done much or anything to optimize these
> >>>> parsers. The process you outline sounds extremely inefficient. It
> >>>> is
> >>>> also likely to lead to memory leaks due to the number of copy
> >>>> operations.
> >>>
> >>> I wouldn't necessarily say that it leads to memory leaks, but it
> >>> definitively leads to a high memory consumption (2GB are not
> >>> enough for a
> >>> 200MB file). Also, my outline of the process is based on only 2
> >>> hours of
> >>> viewing the code, so actually I expected to be corrected on this.
> >>> Unfortunately, it seems like I did get the right idea and it IS
> >>> extremely
> >>> inefficient.
> >>>
> >>> I mean, I understand that this is a high level of abstraction that
> >>> might
> >>> come in handy in many situations, but it certainly is more of an
> >>> obstacle
> >>> in my specific case.
> >>>
> >>>> As always with java, don't try and optimize without a profiler
> >>>> which
> >>>> will tell you which methods are taking a long time and which
> >>>> objects
> >>>> take the most memory.
> >>>
> >>> I think we should continue this discussion on the biojava-dev list
> >>> or in
> >>> a private conversation, as it will probably get very detailed and
> >>> technical.
> >>>
> >>>
> >>> My question to this list again:
> >>> Is there a way to achieve my goal of parsing a 200MB Genbank file
> >>> with
> >>> the current biojava version without code changes?
> >>>
> >>>
> >>> - Florian
> >>>
> >>>> On 25 Jul 2009, 1:33 AM, "Florian Mittag"
> >>>> <florian.mittag at uni-tuebingen.de> wrote:
> >>>>
> >>>> Hi!
> >>>>
> >>>> I think this is a problem worth of its own thread, so I'll start
> >>>> one:
> >>>>
> >>>> I want to store all human chromosomes in a BioSQL database after I
> >>>> loaded the
> >>>> information from .gbk files. The files I get from NCBI with the
> >>>> following URIs, where the id ranges from nc_000001 to nc_000024
> >>>> plus
> >>>> nc_001804:
> >>>>
> >>>> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id
> >>>>=n c_0 00023&rettype=gbwithparts&retmode=text
> >>>>
> >>>> I then try to parse the files as described in
> >>>> http://biojava.org/wiki/BioJava:BioJavaXDocs#Tools_for_reading.2Fwriti
> >>>>ng _fi les but it wont work. While there are no problems parsing 1804
> >>>> and
> >>>> 24, chromosome
> >>>> 23 leads to a OutOfMemory exception although I gave it 2GB of heap
> >>>> space.
> >>>>
> >>>> Here is a stack trace (the line numbers might differ, because I
> >>>> already
> >>>> tried
> >>>> to improve GenbankFormat.java in memory efficiency):
> >>>>
> >>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap
> >>>> space
> >>>>        at
> >>>> org
> >>>> .biojava
> >>>> .bio.seq.io.ChunkedSymbolListFactory.addSymbols(ChunkedSymbol
> >>>> Lis tFactory.java:222) at
> >>>> org
> >>>> .biojavax
> >>>> .bio.seq.io.SimpleRichSequenceBuilder.addSymbols(SimpleRichS
> >>>> equ enceBuilder.java:256) at
> >>>> org
> >>>> .biojavax
> >>>> .bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.jav
> >>>> a:5 35) at
> >>>> org
> >>>> .biojavax
> >>>> .bio.seq.io.RichStreamReader.nextRichSequence(RichStreamRead
> >>>> er. java:110) at
> >>>> org
> >>>> .prodge
> >>>> .sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Ma
> >>>> in. java:537) at
> >>>> org
> >>>> .prodge
> >>>> .sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java
> >>>>
> >>>> :46 8) at
> >>>>
> >>>> org
> >>>> .prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java:
> >>>> 164)
> >>>>
> >>>> The line in GenbankFormat.java is:
> >>>>
> >>>> rlistener.addSymbols(
> >>>>        symParser.getAlphabet(),
> >>>>        (Symbol[])(sl.toList().toArray(new Symbol[0])),
> >>>>        0, sl.length());
> >>>>
> >>>> Sometimes it fails at the sl.toList().toArray()-part, sometimes
> >>>> it fails
> >>>> later
> >>>> inside the addSymbols method, but it always fails.
> >>>>
> >>>> How can this be? I mean, the file is only 190MB in size, so 2GB of
> >>>> memory should be more than enough. Browsing through the source
> >>>> code, I
> >>>> discovered what I think of as very inefficient handling of
> >>>> sequences:
> >>>>
> >>>> 1) the sequence string is read from file into a StringBuffer
> >>>> 2) it is converted to a string (with whitespaces removed)
> >>>> 3) a SimpleSymbolList is created out of the string
> >>>> 4) the SymbolList is converted to a List of Symbols
> >>>> 5) the List is converted to an array of Symbols
> >>>> 6) the array is passed to addSymbols
> >>>> 7) there it is added to a ChunkedSymbolListFactory
> >>>> 8) if at some point the sequence is requested, a SymbolList is
> >>>> created
> >>>> and then converted to a string.
> >>>>
> >>>> You see, there is a lot of copying and converting, but in the end
> >>>> I have
> >>>> the same string I started with. Well, I had the string, if it ever
> >>>> reached the end, because it will crash before completing this
> >>>> process.
> >>>>
> >>>>
> >>>> Am I doing something wrong or is there a great potential of
> >>>> improving
> >>>> parsing
> >>>> of Genbank files?
> >>>>
> >>>>
> >>>> Regards,
> >>>>   Florian
> >>>> _______________________________________________
> >>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>
> >>> --
> >>> Dipl. Inf. Florian Mittag
> >>> Universität Tuebingen
> >>> WSI-RA, Sand 1
> >>> 72076 Tuebingen, Germany
> >>> Phone: +49 7071 / 29 78985  Fax: +49 7071 / 29 5091
> >
> > --
> > Dipl. Inf. Florian Mittag
> > Universität Tuebingen
> > WSI-RA, Sand 1
> > 72076 Tuebingen, Germany
> > Phone: +49 7071 / 29 78985  Fax: +49 7071 / 29 5091

-- 
Dipl. Inf. Florian Mittag
Universität Tuebingen
WSI-RA, Sand 1
72076 Tuebingen, Germany
Phone: +49 7071 / 29 78985  Fax: +49 7071 / 29 5091




More information about the biojava-dev mailing list