[Biojava-l] Load Genbank files takes ages

Fri Jul 24 17:05:22 UTC 2009

Hi all,

this topic gets a little bit complicated, so I will try to summarize the 
status quo:

If I parse the .gbk files without storing the resulting RichSequence objects 
into the BioSQL database, the program will crash with an OutOfMemory 
exception at chromosome 23, but this will happen fast, so no problem with the 
parsing itself.

If I parse the .gbk files and store them in the DB, the program will enter an 
almost-infinite loop, where it rebuilds objects over and over again (without 
reading any files). Since this process is likely to consume more memory than 
without storing it in the DB, I expect it to crash after a long time with the 
same OutOfMemory exception.
The profiler didn't reveal anything new or helpful. It confirmed my 
observation that the method to construct sequence objects is called over and 
over again, but it didn't reveal why. The memory profiling showed nothing 
either, but since it only occurres when two other chromosomes were parsed and 
stored in the DB before that, I assume it is a problem with Hibernate and its 
caching behavior.

Because of the memory problems, I'll postpone the investigation of the 
almost-infinite loop until I have resolved the memory problem (for which I 
will open a new thread).

Unless anybody has another idea ;-)

Florian

On Friday, 17. July 2009 14:03, Florian Mittag wrote:
> On Friday 17 July 2009 03:33, Mark Schreiber wrote:
> > Have you considered running a profiler?
>
> Yes, I have considered this, but the profilers I know for Eclipse are a
> pain in the a** and don't work, so I will have to use NetBeans or something
> to do profiling.
>
> I noticed another funny thing:
> When I run let our program skip the first two chromosomes (1807 and 24),
> then this is the output:
>
> Jul 17, 2009 1:50:36 PM - FINE: Starting update of chromosome 000023
> dbname: GeneID, raccession: 100132775
>   took 273ms
> dbname: CCDS, raccession: CCDS35344.1
>   took 452ms
> dbname: GeneID, raccession: 644403
>   took 283ms
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at java.util.Arrays.copyOfRange(Arrays.java:3209)
> 	at java.lang.String.<init>(String.java:216)
> 	at java.lang.StringBuffer.toString(StringBuffer.java:585)
> 	at
> org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:5
>26) at
> org.prodge.sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Main.
>java:542) at
> org.prodge.sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java:47
>3) at
> org.prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java:169)
>
> Harddisk activity is pretty high (probably reading the sequence) and the
> OutOfMemoryError occurs after about 2 minutes. It seems like loading the
> other chromosomes before this one somehow changes the behavior.
>
> > Also, are you able to parse that sequence when you don't put it into
> > BioSQL. It could be the parser not the BioSQL binding.
> >
> > - Mark
>
> That's a good idea, I will try this. I don't know if I will have time for
> this today, but I should be able to give an update next week.
>
>
> Florian
>
> > On Thu, Jul 16, 2009 at 11:38 PM, Florian Mittag
> >
> > <florian.mittag at uni-tuebingen.de> wrote:
> > > Hi all!
> > >
> > > We try to load Genbank files into our bioseqdb database using BioJava.
> > > I copy-pasted the code together from tutorials and previous posts on
> > > this mailinglist. My problems:
> > >
> > > 1) It eats huge amounts of memory, so that I needed to increase the
> > > heap size to 2GB.
> > >
> > > 2) Loading the first two files works great, but the third one ran for
> > > one two hours without completion. Here is my code:
> > >
> > > --- snip ---
> > > // loop over all downloaded *.gbk files starting with the highest
> > > number System.out.println("Updating chromosome " + chrNo[j] + " ...");
> > >
> > > BufferedReader fileIn = new BufferedReader(new FileReader(localFile));
> > >
> > > tx = session.beginTransaction();
> > > GenbankFormat gf = new GenbankFormat();
> > > SimpleRichSequenceBuilder listener = new SimpleRichSequenceBuilder();
> > > RichSequence seq = null;
> > >
> > > gf.readRichSequence(fileIn, dnaTokenization, listener, nsGenbank);
> > > seq = listener.makeRichSequence();
> > >
> > > if( seq != null ) {
> > >        // check, if a sequence with this identifier is already in the
> > > DB Query q = session.createQuery(
> > >                "select be from BioEntry as be where
> > > identifier=:identifier");
> > > q.setString("identifier",seq.getIdentifier()); List entries = q.list();
> > >        for( Object o : entries ) {
> > >                // delete the old sequence in the DB
> > >                BioEntry oldSeq = (BioEntry)o;
> > >                session.delete("BioEntry", oldSeq);
> > >        }
> > >        tx.commit();
> > >
> > >        tx = session.beginTransaction();
> > >        session.save("Sequence", seq);
> > >
> > >        System.out.println("Chromosome " + chrNo[j] + " was
> > > updated.\n"); } else {
> > >        System.out.println("Chromosome " + chrNo[j] + " was NOT
> > > updated.\n"); }
> > >
> > > tx.commit();
> > > --- snap ---
> > >
> > >
> > > This is the generated output:
> > > ---snip ---
> > > Jul 16, 2009 4:33:53 PM - FINE: Starting update of chromosome 001807
> > > Updating chromosome 001807 ...
> > > Chromosome 001807 was updated.
> > > Jul 16, 2009 4:33:55 PM - FINE: Starting update of chromosome 000024
> > > Updating chromosome 000024 ...
> > > Chromosome 000024 was updated.
> > > Jul 16, 2009 4:35:27 PM - FINE: Starting update of chromosome 000023
> > > Updating chromosome 000023 ...
> > > --- snap ---
> > >
> > >
> > > The files for this are downloaded from Genbank and the file sizes are:
> > > NC_001807.gbk   58.4 KB
> > > NC_000024.gbk   70.8 MB
> > > NC_000023.gbk   190.1 MB
> > >
> > > So, I don't see, why loading a 70.8 MB file took less than 2 minutes
> > > and a 190.1 MB file isn't completed after 2 hours. But during this
> > > time, the CPU load is almost 100% and there is no significant network
> > > or harddisk activity.
> > >
> > > When I paused the program (I'm using Eclipse) and looked, where the
> > > whole processing power is going to, I ended up with the following
> > > stacktrace (sorry for the unreadable format):
> > >
> > > CharacterTokenization.tokenizeSymbolList(SymbolList) line: 214
> > > AlphabetManager$WellKnownTokenizationWrapper.tokenizeSymbolList(SymbolL
> > >is t) line: 1460
> > > SimpleSymbolList(AbstractSymbolList).seqString() line: 102
> > > BioSQLRichSequenceHandler(DummyRichSequenceHandler).seqString(RichSeque
> > >nc e) line: 115
> > > BioSQLRichSequenceHandler.seqString(RichSequence) line: 155
> > > SimpleRichSequence(ThinRichSequence).seqString() line: 203
> > > SimpleRichSequence.getStringSequence() line: 77
> > > GeneratedMethodAccessor132.invoke(Object, Object[]) line: not available
> > > DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25
> > > Method.invoke(Object, Object...) line: 597
> > > BasicPropertyAccessor$BasicGetter.get(Object) line: 145
> > > PojoEntityTuplizer(AbstractEntityTuplizer).getPropertyValues(Object)
> > > line: 249 PojoEntityTuplizer.getPropertyValues(Object) line: 244
> > > JoinedSubclassEntityPersister(AbstractEntityPersister).getPropertyValue
> > >s( Object, EntityMode) line: 3567
> > > DefaultFlushEntityEventListener.getValues(Object, EntityEntry,
> > > EntityMode, boolean, SessionImplementor) line: 167
> > > DefaultFlushEntityEventListener.onFlushEntity(FlushEntityEvent) line:
> > > 120
> > > DefaultAutoFlushEventListener(AbstractFlushingEventListener).flushEntit
> > >ie s(FlushEvent) line: 196
> > > DefaultAutoFlushEventListener(AbstractFlushingEventListener).flushEvery
> > >th ingToExecutions(FlushEvent) line: 76
> > > DefaultAutoFlushEventListener.onAutoFlush(AutoFlushEvent) line: 35
> > > SessionImpl.autoFlushIfRequired(Set) line: 970
> > > SessionImpl.list(String, QueryParameters) line: 1115
> > > QueryImpl.list() line: 79
> > > QueryImpl(AbstractQueryImpl).uniqueResult() line: 811
> > > GeneratedMethodAccessor38.invoke(Object, Object[]) line: not available
> > > DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25
> > > Method.invoke(Object, Object...) line: 597
> > > BioSQLRichObjectBuilder.buildObject(Class, List) line: 133
> > > RichObjectFactory.getObject(Class, Object[]) line: 107
> > > GenbankFormat.readRichSequence(BufferedReader, SymbolTokenization,
> > > RichSeqIOListener, Namespace) line: 450
> > > UpdateDB_Main.updateChromosome() line: 542
> > >
> > >
> > > Now we go to GenbankFormat.readRichSequence(). It hangs at about line
> > > 450, the line where it loads a CrossRef object, so I added debug
> > > output:
> > >
> > > --- snip ---
> > > // parameter on old feature
> > > if (key.equals("db_xref")) {
> > >        Matcher m = dbxp.matcher(val);
> > >        if (m.matches()) {
> > >                String dbname = m.group(1);
> > >                String raccession = m.group(2);
> > >                if (dbname.equalsIgnoreCase("taxon")) {
> > >                        [...]
> > >                } else {
> > >                        try {
> > >                                long starttime =
> > > System.currentTimeMillis(); CrossRef cr =
> > > (CrossRef)RichObjectFactory.getObject(SimpleCrossRef.class,new Object[]
> > > {dbname, raccession, new Integer(0)});
> > >                                long duration =
> > > System.currentTimeMillis() - starttime; if( duration > 100 ) {
> > >                                        System.out.println("dbname: " +
> > > dbname + ", raccession: " + raccession); System.out.println("  took " +
> > > duration + "ms"); }
> > >                                RankedCrossRef rcr = new
> > > SimpleRankedCrossRef(cr, ++rcrossrefCount);
> > > rlistener.getCurrentFeature().addRankedCrossRef(rcr); --- snap ---
> > >
> > > Which leads to:
> > >
> > > --- snip ---
> > > dbname: GeneID, raccession: 677739
> > >  took 3291ms
> > > dbname: HGNC, raccession: 31847
> > >  took 2427ms
> > > dbname: GeneID, raccession: 55344
> > >  took 2932ms
> > > dbname: HGNC, raccession: 23148
> > >  took 2339ms
> > > dbname: GI, raccession: 94158612
> > >  took 2418ms
> > > dbname: GI, raccession: 8922995
> > >  took 2920ms
> > > [...]
> > > --- snap ---
> > >
> > > Which are all /db_xref properties of the NC_000023.gbk file. Searching
> > > deeper, it looks like for every CrossRef object loaded, the whole
> > > BioEntry object is built and the sequence parsed. But remember, this
> > > only happens on chromosome 23, not on 24, which has /db_xref, too.
> > >
> > > I already spent some time on this, but I can't figure out, what could
> > > be the cause.
> > >
> > >
> > > Thanks
> > >   Florian
> > > _______________________________________________
> > > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/biojava-l