[Biojava-l] Load Genbank files takes ages

Thu Jul 16 15:38:25 UTC 2009

Hi all!

We try to load Genbank files into our bioseqdb database using BioJava. I 
copy-pasted the code together from tutorials and previous posts on this 
mailinglist. My problems:

1) It eats huge amounts of memory, so that I needed to increase the heap size 
to 2GB.

2) Loading the first two files works great, but the third one ran for one two 
hours without completion. Here is my code:

--- snip ---
// loop over all downloaded *.gbk files starting with the highest number
System.out.println("Updating chromosome " + chrNo[j] + " ...");

BufferedReader fileIn = new BufferedReader(new FileReader(localFile));

tx = session.beginTransaction();
GenbankFormat gf = new GenbankFormat();
SimpleRichSequenceBuilder listener = new SimpleRichSequenceBuilder();
RichSequence seq = null;

gf.readRichSequence(fileIn, dnaTokenization, listener, nsGenbank);
seq = listener.makeRichSequence();

if( seq != null ) {
	// check, if a sequence with this identifier is already in the DB
	Query q = session.createQuery(
		"select be from BioEntry as be where identifier=:identifier");
	q.setString("identifier",seq.getIdentifier());
	List entries = q.list();
	for( Object o : entries ) {
		// delete the old sequence in the DB
		BioEntry oldSeq = (BioEntry)o;
		session.delete("BioEntry", oldSeq);
	}
	tx.commit();

	tx = session.beginTransaction();
	session.save("Sequence", seq);

	System.out.println("Chromosome " + chrNo[j] + " was updated.\n");
} else {
	System.out.println("Chromosome " + chrNo[j] + " was NOT updated.\n");
}

tx.commit();
--- snap ---

This is the generated output:
---snip ---
Jul 16, 2009 4:33:53 PM - FINE: Starting update of chromosome 001807
Updating chromosome 001807 ...
Chromosome 001807 was updated.
Jul 16, 2009 4:33:55 PM - FINE: Starting update of chromosome 000024
Updating chromosome 000024 ...
Chromosome 000024 was updated.
Jul 16, 2009 4:35:27 PM - FINE: Starting update of chromosome 000023
Updating chromosome 000023 ...
--- snap ---

The files for this are downloaded from Genbank and the file sizes are:
NC_001807.gbk	58.4 KB
NC_000024.gbk	70.8 MB
NC_000023.gbk	190.1 MB

So, I don't see, why loading a 70.8 MB file took less than 2 minutes and a 
190.1 MB file isn't completed after 2 hours. But during this time, the CPU 
load is almost 100% and there is no significant network or harddisk activity.

When I paused the program (I'm using Eclipse) and looked, where the whole 
processing power is going to, I ended up with the following stacktrace (sorry 
for the unreadable format):

CharacterTokenization.tokenizeSymbolList(SymbolList) line: 214	
AlphabetManager$WellKnownTokenizationWrapper.tokenizeSymbolList(SymbolList) 
line: 1460	
SimpleSymbolList(AbstractSymbolList).seqString() line: 102	
BioSQLRichSequenceHandler(DummyRichSequenceHandler).seqString(RichSequence) 
line: 115	
BioSQLRichSequenceHandler.seqString(RichSequence) line: 155	
SimpleRichSequence(ThinRichSequence).seqString() line: 203	
SimpleRichSequence.getStringSequence() line: 77	
GeneratedMethodAccessor132.invoke(Object, Object[]) line: not available	
DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25	
Method.invoke(Object, Object...) line: 597	
BasicPropertyAccessor$BasicGetter.get(Object) line: 145	
PojoEntityTuplizer(AbstractEntityTuplizer).getPropertyValues(Object) line: 249	
PojoEntityTuplizer.getPropertyValues(Object) line: 244	
JoinedSubclassEntityPersister(AbstractEntityPersister).getPropertyValues(Object, 
EntityMode) line: 3567	
DefaultFlushEntityEventListener.getValues(Object, EntityEntry, EntityMode, 
boolean, SessionImplementor) line: 167	
DefaultFlushEntityEventListener.onFlushEntity(FlushEntityEvent) line: 120	
DefaultAutoFlushEventListener(AbstractFlushingEventListener).flushEntities(FlushEvent) 
line: 196	
DefaultAutoFlushEventListener(AbstractFlushingEventListener).flushEverythingToExecutions(FlushEvent) 
line: 76	
DefaultAutoFlushEventListener.onAutoFlush(AutoFlushEvent) line: 35	
SessionImpl.autoFlushIfRequired(Set) line: 970	
SessionImpl.list(String, QueryParameters) line: 1115	
QueryImpl.list() line: 79	
QueryImpl(AbstractQueryImpl).uniqueResult() line: 811	
GeneratedMethodAccessor38.invoke(Object, Object[]) line: not available	
DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25	
Method.invoke(Object, Object...) line: 597	
BioSQLRichObjectBuilder.buildObject(Class, List) line: 133	
RichObjectFactory.getObject(Class, Object[]) line: 107	
GenbankFormat.readRichSequence(BufferedReader, SymbolTokenization, 
RichSeqIOListener, Namespace) line: 450	
UpdateDB_Main.updateChromosome() line: 542	

Now we go to GenbankFormat.readRichSequence(). It hangs at about line 450, the 
line where it loads a CrossRef object, so I added debug output:

--- snip ---
// parameter on old feature
if (key.equals("db_xref")) {
	Matcher m = dbxp.matcher(val);
	if (m.matches()) {
		String dbname = m.group(1);
		String raccession = m.group(2);
		if (dbname.equalsIgnoreCase("taxon")) {
			[...]
		} else {
			try {
				long starttime = System.currentTimeMillis();
				CrossRef cr = 
(CrossRef)RichObjectFactory.getObject(SimpleCrossRef.class,new Object[]
{dbname, raccession, new Integer(0)});
				long duration = System.currentTimeMillis() - starttime;
				if( duration > 100 ) {
					System.out.println("dbname: " + dbname + ", raccession: " + raccession);
					System.out.println("  took " + duration + "ms");
				}
				RankedCrossRef rcr = new SimpleRankedCrossRef(cr, ++rcrossrefCount);
				rlistener.getCurrentFeature().addRankedCrossRef(rcr);
--- snap ---

Which leads to:

--- snip ---
dbname: GeneID, raccession: 677739
  took 3291ms
dbname: HGNC, raccession: 31847
  took 2427ms
dbname: GeneID, raccession: 55344
  took 2932ms
dbname: HGNC, raccession: 23148
  took 2339ms
dbname: GI, raccession: 94158612
  took 2418ms
dbname: GI, raccession: 8922995
  took 2920ms
[...]
--- snap ---

Which are all /db_xref properties of the NC_000023.gbk file. Searching deeper, 
it looks like for every CrossRef object loaded, the whole BioEntry object is 
built and the sequence parsed. But remember, this only happens on chromosome 
23, not on 24, which has /db_xref, too.

I already spent some time on this, but I can't figure out, what could be the 
cause.

Thanks
   Florian