[Biojava-l] Load Genbank files takes ages

Fri Jul 17 12:03:54 UTC 2009

Thanks for the quick answer!

On Friday 17 July 2009 03:33, Mark Schreiber wrote:
> I wonder if there is some kind of memory leak or infinite loop?

As far as I can tell, there is no infinite loop, just a loop that takes very, 
very long because for every parsed feature the whole RichSequence object is 
reprocessed (especially the conversion of the sequence to characters), which 
takes about 3 seconds.

> Have you considered running a profiler?

Yes, I have considered this, but the profilers I know for Eclipse are a pain 
in the a** and don't work, so I will have to use NetBeans or something to do 
profiling.

I noticed another funny thing:
When I run let our program skip the first two chromosomes (1807 and 24), then 
this is the output:

Jul 17, 2009 1:50:36 PM - FINE: Starting update of chromosome 000023
dbname: GeneID, raccession: 100132775
  took 273ms
dbname: CCDS, raccession: CCDS35344.1
  took 452ms
dbname: GeneID, raccession: 644403
  took 283ms
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOfRange(Arrays.java:3209)
	at java.lang.String.<init>(String.java:216)
	at java.lang.StringBuffer.toString(StringBuffer.java:585)
	at 
org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:526)
	at 
org.prodge.sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Main.java:542)
	at 
org.prodge.sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java:473)
	at org.prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java:169)

Harddisk activity is pretty high (probably reading the sequence) and the 
OutOfMemoryError occurs after about 2 minutes. It seems like loading the 
other chromosomes before this one somehow changes the behavior.

> Also, are you able to parse that sequence when you don't put it into
> BioSQL. It could be the parser not the BioSQL binding.
>
> - Mark

That's a good idea, I will try this. I don't know if I will have time for this 
today, but I should be able to give an update next week.

Florian

> On Thu, Jul 16, 2009 at 11:38 PM, Florian Mittag
>
> <florian.mittag at uni-tuebingen.de> wrote:
> > Hi all!
> >
> > We try to load Genbank files into our bioseqdb database using BioJava. I
> > copy-pasted the code together from tutorials and previous posts on this
> > mailinglist. My problems:
> >
> > 1) It eats huge amounts of memory, so that I needed to increase the heap
> > size to 2GB.
> >
> > 2) Loading the first two files works great, but the third one ran for one
> > two hours without completion. Here is my code:
> >
> > --- snip ---
> > // loop over all downloaded *.gbk files starting with the highest number
> > System.out.println("Updating chromosome " + chrNo[j] + " ...");
> >
> > BufferedReader fileIn = new BufferedReader(new FileReader(localFile));
> >
> > tx = session.beginTransaction();
> > GenbankFormat gf = new GenbankFormat();
> > SimpleRichSequenceBuilder listener = new SimpleRichSequenceBuilder();
> > RichSequence seq = null;
> >
> > gf.readRichSequence(fileIn, dnaTokenization, listener, nsGenbank);
> > seq = listener.makeRichSequence();
> >
> > if( seq != null ) {
> >        // check, if a sequence with this identifier is already in the DB
> >        Query q = session.createQuery(
> >                "select be from BioEntry as be where
> > identifier=:identifier"); q.setString("identifier",seq.getIdentifier());
> >        List entries = q.list();
> >        for( Object o : entries ) {
> >                // delete the old sequence in the DB
> >                BioEntry oldSeq = (BioEntry)o;
> >                session.delete("BioEntry", oldSeq);
> >        }
> >        tx.commit();
> >
> >        tx = session.beginTransaction();
> >        session.save("Sequence", seq);
> >
> >        System.out.println("Chromosome " + chrNo[j] + " was updated.\n");
> > } else {
> >        System.out.println("Chromosome " + chrNo[j] + " was NOT
> > updated.\n"); }
> >
> > tx.commit();
> > --- snap ---
> >
> >
> > This is the generated output:
> > ---snip ---
> > Jul 16, 2009 4:33:53 PM - FINE: Starting update of chromosome 001807
> > Updating chromosome 001807 ...
> > Chromosome 001807 was updated.
> > Jul 16, 2009 4:33:55 PM - FINE: Starting update of chromosome 000024
> > Updating chromosome 000024 ...
> > Chromosome 000024 was updated.
> > Jul 16, 2009 4:35:27 PM - FINE: Starting update of chromosome 000023
> > Updating chromosome 000023 ...
> > --- snap ---
> >
> >
> > The files for this are downloaded from Genbank and the file sizes are:
> > NC_001807.gbk   58.4 KB
> > NC_000024.gbk   70.8 MB
> > NC_000023.gbk   190.1 MB
> >
> > So, I don't see, why loading a 70.8 MB file took less than 2 minutes and
> > a 190.1 MB file isn't completed after 2 hours. But during this time, the
> > CPU load is almost 100% and there is no significant network or harddisk
> > activity.
> >
> > When I paused the program (I'm using Eclipse) and looked, where the whole
> > processing power is going to, I ended up with the following stacktrace
> > (sorry for the unreadable format):
> >
> > CharacterTokenization.tokenizeSymbolList(SymbolList) line: 214
> > AlphabetManager$WellKnownTokenizationWrapper.tokenizeSymbolList(SymbolLis
> >t) line: 1460
> > SimpleSymbolList(AbstractSymbolList).seqString() line: 102
> > BioSQLRichSequenceHandler(DummyRichSequenceHandler).seqString(RichSequenc
> >e) line: 115
> > BioSQLRichSequenceHandler.seqString(RichSequence) line: 155
> > SimpleRichSequence(ThinRichSequence).seqString() line: 203
> > SimpleRichSequence.getStringSequence() line: 77
> > GeneratedMethodAccessor132.invoke(Object, Object[]) line: not available
> > DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25
> > Method.invoke(Object, Object...) line: 597
> > BasicPropertyAccessor$BasicGetter.get(Object) line: 145
> > PojoEntityTuplizer(AbstractEntityTuplizer).getPropertyValues(Object)
> > line: 249 PojoEntityTuplizer.getPropertyValues(Object) line: 244
> > JoinedSubclassEntityPersister(AbstractEntityPersister).getPropertyValues(
> >Object, EntityMode) line: 3567
> > DefaultFlushEntityEventListener.getValues(Object, EntityEntry,
> > EntityMode, boolean, SessionImplementor) line: 167
> > DefaultFlushEntityEventListener.onFlushEntity(FlushEntityEvent) line: 120
> > DefaultAutoFlushEventListener(AbstractFlushingEventListener).flushEntitie
> >s(FlushEvent) line: 196
> > DefaultAutoFlushEventListener(AbstractFlushingEventListener).flushEveryth
> >ingToExecutions(FlushEvent) line: 76
> > DefaultAutoFlushEventListener.onAutoFlush(AutoFlushEvent) line: 35
> > SessionImpl.autoFlushIfRequired(Set) line: 970
> > SessionImpl.list(String, QueryParameters) line: 1115
> > QueryImpl.list() line: 79
> > QueryImpl(AbstractQueryImpl).uniqueResult() line: 811
> > GeneratedMethodAccessor38.invoke(Object, Object[]) line: not available
> > DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25
> > Method.invoke(Object, Object...) line: 597
> > BioSQLRichObjectBuilder.buildObject(Class, List) line: 133
> > RichObjectFactory.getObject(Class, Object[]) line: 107
> > GenbankFormat.readRichSequence(BufferedReader, SymbolTokenization,
> > RichSeqIOListener, Namespace) line: 450
> > UpdateDB_Main.updateChromosome() line: 542
> >
> >
> > Now we go to GenbankFormat.readRichSequence(). It hangs at about line
> > 450, the line where it loads a CrossRef object, so I added debug output:
> >
> > --- snip ---
> > // parameter on old feature
> > if (key.equals("db_xref")) {
> >        Matcher m = dbxp.matcher(val);
> >        if (m.matches()) {
> >                String dbname = m.group(1);
> >                String raccession = m.group(2);
> >                if (dbname.equalsIgnoreCase("taxon")) {
> >                        [...]
> >                } else {
> >                        try {
> >                                long starttime =
> > System.currentTimeMillis(); CrossRef cr =
> > (CrossRef)RichObjectFactory.getObject(SimpleCrossRef.class,new Object[]
> > {dbname, raccession, new Integer(0)});
> >                                long duration = System.currentTimeMillis()
> > - starttime; if( duration > 100 ) {
> >                                        System.out.println("dbname: " +
> > dbname + ", raccession: " + raccession); System.out.println("  took " +
> > duration + "ms"); }
> >                                RankedCrossRef rcr = new
> > SimpleRankedCrossRef(cr, ++rcrossrefCount);
> > rlistener.getCurrentFeature().addRankedCrossRef(rcr); --- snap ---
> >
> > Which leads to:
> >
> > --- snip ---
> > dbname: GeneID, raccession: 677739
> >  took 3291ms
> > dbname: HGNC, raccession: 31847
> >  took 2427ms
> > dbname: GeneID, raccession: 55344
> >  took 2932ms
> > dbname: HGNC, raccession: 23148
> >  took 2339ms
> > dbname: GI, raccession: 94158612
> >  took 2418ms
> > dbname: GI, raccession: 8922995
> >  took 2920ms
> > [...]
> > --- snap ---
> >
> > Which are all /db_xref properties of the NC_000023.gbk file. Searching
> > deeper, it looks like for every CrossRef object loaded, the whole
> > BioEntry object is built and the sequence parsed. But remember, this only
> > happens on chromosome 23, not on 24, which has /db_xref, too.
> >
> > I already spent some time on this, but I can't figure out, what could be
> > the cause.
> >
> >
> > Thanks
> >   Florian
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l

-- 
Dipl. Inf. Florian Mittag
Universität Tuebingen
WSI-RA, Sand 1
72076 Tuebingen, Germany
Phone: +49 7071 / 29 78985  Fax: +49 7071 / 29 5091