[Biojava-l] How to parse large Genbank files?

Richard Holland holland at eaglegenomics.com
Tue Jul 28 12:52:00 UTC 2009


>
>
> Btw: Should we move this to Biojava-dev?

probably, yes! :)

> And where do I sign up for BioJava3 development? ;-)

Andreas Prlic has the keys to the project these days. BJ3 does already  
have some new code in place for handling sequences as strings but it's  
in an out-of-the-way bit of the repository and is not part of the main  
roadmap for the project at present. The current focus is on  
modularising the existing bits, so that individual components can be  
refactored to behave better at a future date.

If you want to explore my ideas for a replacement Sequence model, the  
code and docs are here (sequence handling is in the 'core' module with  
DNA-specifics in the 'dna' module):

http://biojava.org/wiki/BioJava3:HowTo
http://www.biojava.org/wiki/BioJava3_project

(Methods such as file parsers would request Strings (or ideally  
CharSequence - more flexible, and String extends it) as parameters  
whenever they don't care about content - if they care about content  
but don't care in advance about size or random access then they should  
request Iterator<Symbol> which can be used to wrap a String and parse  
on demand, and if they need full functionality then they should  
request List<Symbol> which the default implementation of uses  
ArrayLists but there's no reason a String-backed one could be written  
as well).

cheers,
Richard

>
> - Florian
>
>> On Mon, Jul 27, 2009 at 8:16 PM, Florian
>>
>> Mittag<florian.mittag at uni-tuebingen.de> wrote:
>>> Hi Mark!
>>>
>>> On Saturday, 25. July 2009 04:20, Mark Schreiber wrote:
>>>> I don't think anyone has done much or anything to optimize these
>>>> parsers. The process you outline sounds extremely inefficient. It  
>>>> is
>>>> also likely to lead to memory leaks due to the number of copy
>>>> operations.
>>>
>>> I wouldn't necessarily say that it leads to memory leaks, but it
>>> definitively leads to a high memory consumption (2GB are not  
>>> enough for a
>>> 200MB file). Also, my outline of the process is based on only 2  
>>> hours of
>>> viewing the code, so actually I expected to be corrected on this.
>>> Unfortunately, it seems like I did get the right idea and it IS  
>>> extremely
>>> inefficient.
>>>
>>> I mean, I understand that this is a high level of abstraction that  
>>> might
>>> come in handy in many situations, but it certainly is more of an  
>>> obstacle
>>> in my specific case.
>>>
>>>> As always with java, don't try and optimize without a profiler  
>>>> which
>>>> will tell you which methods are taking a long time and which  
>>>> objects
>>>> take the most memory.
>>>
>>> I think we should continue this discussion on the biojava-dev list  
>>> or in
>>> a private conversation, as it will probably get very detailed and
>>> technical.
>>>
>>>
>>> My question to this list again:
>>> Is there a way to achieve my goal of parsing a 200MB Genbank file  
>>> with
>>> the current biojava version without code changes?
>>>
>>>
>>> - Florian
>>>
>>>> On 25 Jul 2009, 1:33 AM, "Florian Mittag"
>>>> <florian.mittag at uni-tuebingen.de> wrote:
>>>>
>>>> Hi!
>>>>
>>>> I think this is a problem worth of its own thread, so I'll start  
>>>> one:
>>>>
>>>> I want to store all human chromosomes in a BioSQL database after I
>>>> loaded the
>>>> information from .gbk files. The files I get from NCBI with the
>>>> following URIs, where the id ranges from nc_000001 to nc_000024  
>>>> plus
>>>> nc_001804:
>>>>
>>>> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=n
>>>> c_0 00023&rettype=gbwithparts&retmode=text
>>>>
>>>> I then try to parse the files as described in
>>>> http://biojava.org/wiki/BioJava:BioJavaXDocs#Tools_for_reading.2Fwriting
>>>> _fi les but it wont work. While there are no problems parsing  
>>>> 1804 and
>>>> 24, chromosome
>>>> 23 leads to a OutOfMemory exception although I gave it 2GB of heap
>>>> space.
>>>>
>>>> Here is a stack trace (the line numbers might differ, because I  
>>>> already
>>>> tried
>>>> to improve GenbankFormat.java in memory efficiency):
>>>>
>>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap  
>>>> space
>>>>        at
>>>> org 
>>>> .biojava 
>>>> .bio.seq.io.ChunkedSymbolListFactory.addSymbols(ChunkedSymbol
>>>> Lis tFactory.java:222) at
>>>> org 
>>>> .biojavax 
>>>> .bio.seq.io.SimpleRichSequenceBuilder.addSymbols(SimpleRichS
>>>> equ enceBuilder.java:256) at
>>>> org 
>>>> .biojavax 
>>>> .bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.jav
>>>> a:5 35) at
>>>> org 
>>>> .biojavax 
>>>> .bio.seq.io.RichStreamReader.nextRichSequence(RichStreamRead
>>>> er. java:110) at
>>>> org 
>>>> .prodge 
>>>> .sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Ma
>>>> in. java:537) at
>>>> org 
>>>> .prodge 
>>>> .sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java
>>>> :46 8) at
>>>> org 
>>>> .prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java: 
>>>> 164)
>>>>
>>>> The line in GenbankFormat.java is:
>>>>
>>>> rlistener.addSymbols(
>>>>        symParser.getAlphabet(),
>>>>        (Symbol[])(sl.toList().toArray(new Symbol[0])),
>>>>        0, sl.length());
>>>>
>>>> Sometimes it fails at the sl.toList().toArray()-part, sometimes  
>>>> it fails
>>>> later
>>>> inside the addSymbols method, but it always fails.
>>>>
>>>> How can this be? I mean, the file is only 190MB in size, so 2GB of
>>>> memory should be more than enough. Browsing through the source  
>>>> code, I
>>>> discovered what I think of as very inefficient handling of  
>>>> sequences:
>>>>
>>>> 1) the sequence string is read from file into a StringBuffer
>>>> 2) it is converted to a string (with whitespaces removed)
>>>> 3) a SimpleSymbolList is created out of the string
>>>> 4) the SymbolList is converted to a List of Symbols
>>>> 5) the List is converted to an array of Symbols
>>>> 6) the array is passed to addSymbols
>>>> 7) there it is added to a ChunkedSymbolListFactory
>>>> 8) if at some point the sequence is requested, a SymbolList is  
>>>> created
>>>> and then converted to a string.
>>>>
>>>> You see, there is a lot of copying and converting, but in the end  
>>>> I have
>>>> the same string I started with. Well, I had the string, if it ever
>>>> reached the end, because it will crash before completing this  
>>>> process.
>>>>
>>>>
>>>> Am I doing something wrong or is there a great potential of  
>>>> improving
>>>> parsing
>>>> of Genbank files?
>>>>
>>>>
>>>> Regards,
>>>>   Florian
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>>> --
>>> Dipl. Inf. Florian Mittag
>>> Universität Tuebingen
>>> WSI-RA, Sand 1
>>> 72076 Tuebingen, Germany
>>> Phone: +49 7071 / 29 78985  Fax: +49 7071 / 29 5091
>
> -- 
> Dipl. Inf. Florian Mittag
> Universität Tuebingen
> WSI-RA, Sand 1
> 72076 Tuebingen, Germany
> Phone: +49 7071 / 29 78985  Fax: +49 7071 / 29 5091





More information about the Biojava-l mailing list