[Biojava-l] BLAST parsing explodes in size

Wed Nov 12 04:37:22 EST 2003

Hi Keith,

Thanks for your response. I did paste the method that's doing the
parsing somewhere below. I also ran just now this method trying to parse
a blast output file with a size of approximately 350mb. The output
generated is this:

Before parsing: 402280
After parsing: 1043162496

With the number indicating the memory size of java in bytes. That means
that during the parsing (all biojava) the size explodes from a mere
402kb to 1gb. After that the size doesn't do much anymore.

For your information, I am using the following:
- NCBI Blast 2.2.4
- Java 1.4.2_01
- Linux 
- Biojava from cvs, last updated at 21st of October

Hopefully you will now tell me I am doing something stupid ;-)

private void parseBlastOutput(File file) throws Exception{
      Runtime r = Runtime.getRuntime();
      System.out.println("Before parsing: " +
(r.totalMemory()-r.freeMemory()));
      InputStream is = new FileInputStream(file);
      BlastLikeSAXParser parser = new BlastLikeSAXParser();
      parser.setModeLazy();
      SeqSimilarityAdapter adapter = new SeqSimilarityAdapter();
      parser.setContentHandler(adapter);
      List results = new ArrayList();
      SearchContentHandler builder = new BlastLikeSearchBuilder(results,
new DummySequenceDB("queries"), new DummySequenceDBInstallation());
      adapter.setSearchContentHandler(builder);
      parser.parse(new InputSource(is));

      for (Iterator i = results.iterator(); i.hasNext(); ){
         System.out.println("Iterating: " +
(r.totalMemory()-r.freeMemory()));
         SeqSimilaritySearchResult result =
(SeqSimilaritySearchResult)i.next();

         org.biojava.bio.Annotation anno = result.getAnnotation();
         String queryID = (String)anno.getProperty("queryId");
         String database =
this.parseNameFromDBPath((String)anno.getProperty("databaseId"));
         String lib = this.parseIDForLibrary(queryID);
         BlastSetting bsetting = null;
         if (lib!=null && database!=null) bsetting =
adaptor.fetchSetting(lib, database);
         if (lib == null || database == null || bsetting == null){
            //means no blast setting can be found for this library and
database
            System.out.println("HELP!!!!!");
            throw new Exception("Cannot find Blast Setting in database
for library " + lib + " and blastdatabase " + database);
         }

         File outFile = new File(destDir, queryID + ".out");
         BufferedWriter out = new BufferedWriter(new
FileWriter(outFile));

out.write("queryID\tqueryStart\tqueryEnd\tdatabase\tsubjectID\tsubjectSt
art\tsubjectEnd\tscore\teValue\tDescription\n");
         List hits = result.getHits();
         //System.out.println("Start writing with " + hits.size() + "
hits.");
         for (int j=0; j<hits.size(); j++){     
            SeqSimilaritySearchHit hit =
(SeqSimilaritySearchHit)hits.get(j);
            if (hit.getEValue() > bsetting.getMaxEValue()){

               break;
            }
            //System.out.println("HIT!!!");
            org.biojava.bio.Annotation hitAnno = hit.getAnnotation();
            String description =
hitAnno.containsProperty("subjectDescription") ?
(String)hitAnno.getProperty("subjectDescription") : "No Description";

            out.write(queryID + "\t");
            out.write(hit.getQueryStart() + "\t");
            out.write(hit.getQueryEnd() + "\t");
            out.write(database + "\t");
            out.write(hit.getSubjectID() + "\t");
            out.write(hit.getSubjectStart() + "\t");
            out.write(hit.getSubjectEnd() + "\t");
            out.write(hit.getScore() + "\t");
            out.write(hit.getEValue() + "\t");
            out.write(description + "\n");
            out.flush();
            hitAnno = null;description = null;hit=null;
            System.gc();
         }
         out.close();
         hits = null; out=null; outFile=null; bsetting=null; lib=null;
database=null; queryID=null; anno=null; result=null;
         System.gc();
      }

      file.delete();
   }

> -----Original Message-----
> From: Keith James [mailto:kdj at sanger.ac.uk]
> Sent: Wednesday, November 12, 2003 12:25 AM
> To: VERHOEF Frans
> Cc: biojava-l at biojava.org
> Subject: Re: [Biojava-l] BLAST parsing explodes in size
> 
> >>>>> "FV" == VERHOEF Frans <verhoeff2 at gis.a-star.edu.sg> writes:
> 
>     FV> Hi, I am having a problem parsing huge blast
>     FV> results. Basically I am parsing the blast results pretty much
>     FV> the same way as in "Biojava in Anger", with as only difference
>     FV> that I use the setModeLazy() of the BlastLikeSAXParser, since
>     FV> I am using NCBI Blast version 2.2.4 and that version is not
>     FV> recognised by the parser yet.
> 
> Using blast 2.2.4 or 2.2.6 is safe in lazy mode - diffs show only
> minor whitespace changes in the format.
> 
>     FV> Besides that the only difference lays in the things I do with
>     FV> the data.
> 
> This is likely to be the cause of the problem. See below.
> 
>     FV> The problem is that when I parse a blast result that is a few
>     FV> hundred MB, for example 300MB, the java application is
>     FV> ballooning up to around 1.6GB of memory. Sometimes the
>     FV> application even crashes because I only have got 2GB to play
>     FV> with.
> 
> The parser uses an event driven framework which is designed to handle
> very big data - it will handle multi-GB reports. However, if you
> create many fine-grained objects for every element of every report you
> will quickly run out of resources.
> 
>     FV> Does anyone know what's causing this? Is it because I set the
>     FV> lazy mode?  Is there any way to work around it?
> 
> Either you need to think about which elements of the report you are
> interested in and build a filter which captures those events,
> discarding the rest. See the demos/ssbind package for an example by
> Matthew. Or if you really need all those objects then you should look
> at allowing them to be garbage-collected as soon as possible.
> 
> It is possible that there is a bug somewhere, but without any seeing
> any code it isn't possible to say much more. If you need more help,
> post a short (working) piece of code illustrating the problem and we
> will do our best.
> 
> hth
> 
> Keith
> 
> --
> 
> - Keith James <kdj at sanger.ac.uk> Microarray Facility, Team 65 -
> - The Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK -