[Biojava-l] Load Genbank files takes ages

Mark Schreiber markjschreiber at gmail.com
Fri Jul 17 01:33:58 UTC 2009


I wonder if there is some kind of memory leak or infinite loop? Have
you considered running a profiler?

Also, are you able to parse that sequence when you don't put it into
BioSQL. It could be the parser not the BioSQL binding.

- Mark

On Thu, Jul 16, 2009 at 11:38 PM, Florian Mittag
<florian.mittag at uni-tuebingen.de> wrote:
>
> Hi all!
>
> We try to load Genbank files into our bioseqdb database using BioJava. I
> copy-pasted the code together from tutorials and previous posts on this
> mailinglist. My problems:
>
> 1) It eats huge amounts of memory, so that I needed to increase the heap size
> to 2GB.
>
> 2) Loading the first two files works great, but the third one ran for one two
> hours without completion. Here is my code:
>
> --- snip ---
> // loop over all downloaded *.gbk files starting with the highest number
> System.out.println("Updating chromosome " + chrNo[j] + " ...");
>
> BufferedReader fileIn = new BufferedReader(new FileReader(localFile));
>
> tx = session.beginTransaction();
> GenbankFormat gf = new GenbankFormat();
> SimpleRichSequenceBuilder listener = new SimpleRichSequenceBuilder();
> RichSequence seq = null;
>
> gf.readRichSequence(fileIn, dnaTokenization, listener, nsGenbank);
> seq = listener.makeRichSequence();
>
> if( seq != null ) {
>        // check, if a sequence with this identifier is already in the DB
>        Query q = session.createQuery(
>                "select be from BioEntry as be where identifier=:identifier");
>        q.setString("identifier",seq.getIdentifier());
>        List entries = q.list();
>        for( Object o : entries ) {
>                // delete the old sequence in the DB
>                BioEntry oldSeq = (BioEntry)o;
>                session.delete("BioEntry", oldSeq);
>        }
>        tx.commit();
>
>        tx = session.beginTransaction();
>        session.save("Sequence", seq);
>
>        System.out.println("Chromosome " + chrNo[j] + " was updated.\n");
> } else {
>        System.out.println("Chromosome " + chrNo[j] + " was NOT updated.\n");
> }
>
> tx.commit();
> --- snap ---
>
>
> This is the generated output:
> ---snip ---
> Jul 16, 2009 4:33:53 PM - FINE: Starting update of chromosome 001807
> Updating chromosome 001807 ...
> Chromosome 001807 was updated.
> Jul 16, 2009 4:33:55 PM - FINE: Starting update of chromosome 000024
> Updating chromosome 000024 ...
> Chromosome 000024 was updated.
> Jul 16, 2009 4:35:27 PM - FINE: Starting update of chromosome 000023
> Updating chromosome 000023 ...
> --- snap ---
>
>
> The files for this are downloaded from Genbank and the file sizes are:
> NC_001807.gbk   58.4 KB
> NC_000024.gbk   70.8 MB
> NC_000023.gbk   190.1 MB
>
> So, I don't see, why loading a 70.8 MB file took less than 2 minutes and a
> 190.1 MB file isn't completed after 2 hours. But during this time, the CPU
> load is almost 100% and there is no significant network or harddisk activity.
>
> When I paused the program (I'm using Eclipse) and looked, where the whole
> processing power is going to, I ended up with the following stacktrace (sorry
> for the unreadable format):
>
> CharacterTokenization.tokenizeSymbolList(SymbolList) line: 214
> AlphabetManager$WellKnownTokenizationWrapper.tokenizeSymbolList(SymbolList)
> line: 1460
> SimpleSymbolList(AbstractSymbolList).seqString() line: 102
> BioSQLRichSequenceHandler(DummyRichSequenceHandler).seqString(RichSequence)
> line: 115
> BioSQLRichSequenceHandler.seqString(RichSequence) line: 155
> SimpleRichSequence(ThinRichSequence).seqString() line: 203
> SimpleRichSequence.getStringSequence() line: 77
> GeneratedMethodAccessor132.invoke(Object, Object[]) line: not available
> DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25
> Method.invoke(Object, Object...) line: 597
> BasicPropertyAccessor$BasicGetter.get(Object) line: 145
> PojoEntityTuplizer(AbstractEntityTuplizer).getPropertyValues(Object) line: 249
> PojoEntityTuplizer.getPropertyValues(Object) line: 244
> JoinedSubclassEntityPersister(AbstractEntityPersister).getPropertyValues(Object,
> EntityMode) line: 3567
> DefaultFlushEntityEventListener.getValues(Object, EntityEntry, EntityMode,
> boolean, SessionImplementor) line: 167
> DefaultFlushEntityEventListener.onFlushEntity(FlushEntityEvent) line: 120
> DefaultAutoFlushEventListener(AbstractFlushingEventListener).flushEntities(FlushEvent)
> line: 196
> DefaultAutoFlushEventListener(AbstractFlushingEventListener).flushEverythingToExecutions(FlushEvent)
> line: 76
> DefaultAutoFlushEventListener.onAutoFlush(AutoFlushEvent) line: 35
> SessionImpl.autoFlushIfRequired(Set) line: 970
> SessionImpl.list(String, QueryParameters) line: 1115
> QueryImpl.list() line: 79
> QueryImpl(AbstractQueryImpl).uniqueResult() line: 811
> GeneratedMethodAccessor38.invoke(Object, Object[]) line: not available
> DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25
> Method.invoke(Object, Object...) line: 597
> BioSQLRichObjectBuilder.buildObject(Class, List) line: 133
> RichObjectFactory.getObject(Class, Object[]) line: 107
> GenbankFormat.readRichSequence(BufferedReader, SymbolTokenization,
> RichSeqIOListener, Namespace) line: 450
> UpdateDB_Main.updateChromosome() line: 542
>
>
> Now we go to GenbankFormat.readRichSequence(). It hangs at about line 450, the
> line where it loads a CrossRef object, so I added debug output:
>
> --- snip ---
> // parameter on old feature
> if (key.equals("db_xref")) {
>        Matcher m = dbxp.matcher(val);
>        if (m.matches()) {
>                String dbname = m.group(1);
>                String raccession = m.group(2);
>                if (dbname.equalsIgnoreCase("taxon")) {
>                        [...]
>                } else {
>                        try {
>                                long starttime = System.currentTimeMillis();
>                                CrossRef cr =
> (CrossRef)RichObjectFactory.getObject(SimpleCrossRef.class,new Object[]
> {dbname, raccession, new Integer(0)});
>                                long duration = System.currentTimeMillis() - starttime;
>                                if( duration > 100 ) {
>                                        System.out.println("dbname: " + dbname + ", raccession: " + raccession);
>                                        System.out.println("  took " + duration + "ms");
>                                }
>                                RankedCrossRef rcr = new SimpleRankedCrossRef(cr, ++rcrossrefCount);
>                                rlistener.getCurrentFeature().addRankedCrossRef(rcr);
> --- snap ---
>
> Which leads to:
>
> --- snip ---
> dbname: GeneID, raccession: 677739
>  took 3291ms
> dbname: HGNC, raccession: 31847
>  took 2427ms
> dbname: GeneID, raccession: 55344
>  took 2932ms
> dbname: HGNC, raccession: 23148
>  took 2339ms
> dbname: GI, raccession: 94158612
>  took 2418ms
> dbname: GI, raccession: 8922995
>  took 2920ms
> [...]
> --- snap ---
>
> Which are all /db_xref properties of the NC_000023.gbk file. Searching deeper,
> it looks like for every CrossRef object loaded, the whole BioEntry object is
> built and the sequence parsed. But remember, this only happens on chromosome
> 23, not on 24, which has /db_xref, too.
>
> I already spent some time on this, but I can't figure out, what could be the
> cause.
>
>
> Thanks
>   Florian
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l




More information about the Biojava-l mailing list