[Biojava-l] BLAST parsing explodes in size

Matthew Pocock matthew_pocock at yahoo.co.uk
Wed Nov 12 05:25:36 EST 2003


Morning,

I think the problem is that you are populating the results List with 
/all/ of the blast data. This means that all the data from the complete 
report must be in memory in this List. A better approach is to write an 
object to replace builder in adapter.setSearchContentHandler(builder), 
which does all the processing as the data streams in from the parser. 
This will keep memory consumption down to the bare minimum.

There is some code that does this sort of thing in demos/ssbind, and it 
may be worth scanning the code for BlastLikeSearchBuilder for ideas.

Best,

Matthew

VERHOEF Frans wrote:

>Hi Keith,
>
>Thanks for your response. I did paste the method that's doing the
>parsing somewhere below. I also ran just now this method trying to parse
>a blast output file with a size of approximately 350mb. The output
>generated is this:
>
>Before parsing: 402280
>After parsing: 1043162496
>
>With the number indicating the memory size of java in bytes. That means
>that during the parsing (all biojava) the size explodes from a mere
>402kb to 1gb. After that the size doesn't do much anymore.
>
>For your information, I am using the following:
>- NCBI Blast 2.2.4
>- Java 1.4.2_01
>- Linux 
>- Biojava from cvs, last updated at 21st of October
>
>Hopefully you will now tell me I am doing something stupid ;-)
>
>
>private void parseBlastOutput(File file) throws Exception{
>      Runtime r = Runtime.getRuntime();
>      System.out.println("Before parsing: " +
>(r.totalMemory()-r.freeMemory()));
>      InputStream is = new FileInputStream(file);
>      BlastLikeSAXParser parser = new BlastLikeSAXParser();
>      parser.setModeLazy();
>      SeqSimilarityAdapter adapter = new SeqSimilarityAdapter();
>      parser.setContentHandler(adapter);
>      List results = new ArrayList();
>      SearchContentHandler builder = new BlastLikeSearchBuilder(results,
>new DummySequenceDB("queries"), new DummySequenceDBInstallation());
>      adapter.setSearchContentHandler(builder);
>      parser.parse(new InputSource(is));
>      
>      for (Iterator i = results.iterator(); i.hasNext(); ){
>         System.out.println("Iterating: " +
>(r.totalMemory()-r.freeMemory()));
>         SeqSimilaritySearchResult result =
>(SeqSimilaritySearchResult)i.next();
>         
>         org.biojava.bio.Annotation anno = result.getAnnotation();
>         String queryID = (String)anno.getProperty("queryId");
>         String database =
>this.parseNameFromDBPath((String)anno.getProperty("databaseId"));
>         String lib = this.parseIDForLibrary(queryID);
>         BlastSetting bsetting = null;
>         if (lib!=null && database!=null) bsetting =
>adaptor.fetchSetting(lib, database);
>         if (lib == null || database == null || bsetting == null){
>            //means no blast setting can be found for this library and
>database
>            System.out.println("HELP!!!!!");
>            throw new Exception("Cannot find Blast Setting in database
>for library " + lib + " and blastdatabase " + database);
>         }
>         
>         File outFile = new File(destDir, queryID + ".out");
>         BufferedWriter out = new BufferedWriter(new
>FileWriter(outFile));
> 
>out.write("queryID\tqueryStart\tqueryEnd\tdatabase\tsubjectID\tsubjectSt
>art\tsubjectEnd\tscore\teValue\tDescription\n");
>         List hits = result.getHits();
>         //System.out.println("Start writing with " + hits.size() + "
>hits.");
>         for (int j=0; j<hits.size(); j++){     
>            SeqSimilaritySearchHit hit =
>(SeqSimilaritySearchHit)hits.get(j);
>            if (hit.getEValue() > bsetting.getMaxEValue()){
>               
>               break;
>            }
>            //System.out.println("HIT!!!");
>            org.biojava.bio.Annotation hitAnno = hit.getAnnotation();
>            String description =
>hitAnno.containsProperty("subjectDescription") ?
>(String)hitAnno.getProperty("subjectDescription") : "No Description";
>            
>            out.write(queryID + "\t");
>            out.write(hit.getQueryStart() + "\t");
>            out.write(hit.getQueryEnd() + "\t");
>            out.write(database + "\t");
>            out.write(hit.getSubjectID() + "\t");
>            out.write(hit.getSubjectStart() + "\t");
>            out.write(hit.getSubjectEnd() + "\t");
>            out.write(hit.getScore() + "\t");
>            out.write(hit.getEValue() + "\t");
>            out.write(description + "\n");
>            out.flush();
>            hitAnno = null;description = null;hit=null;
>            System.gc();
>         }
>         out.close();
>         hits = null; out=null; outFile=null; bsetting=null; lib=null;
>database=null; queryID=null; anno=null; result=null;
>         System.gc();
>      }
>      
>      file.delete();
>   }
>
>
>  
>
>>-----Original Message-----
>>From: Keith James [mailto:kdj at sanger.ac.uk]
>>Sent: Wednesday, November 12, 2003 12:25 AM
>>To: VERHOEF Frans
>>Cc: biojava-l at biojava.org
>>Subject: Re: [Biojava-l] BLAST parsing explodes in size
>>
>>    
>>
>>>>>>>"FV" == VERHOEF Frans <verhoeff2 at gis.a-star.edu.sg> writes:
>>>>>>>              
>>>>>>>
>>    FV> Hi, I am having a problem parsing huge blast
>>    FV> results. Basically I am parsing the blast results pretty much
>>    FV> the same way as in "Biojava in Anger", with as only difference
>>    FV> that I use the setModeLazy() of the BlastLikeSAXParser, since
>>    FV> I am using NCBI Blast version 2.2.4 and that version is not
>>    FV> recognised by the parser yet.
>>
>>Using blast 2.2.4 or 2.2.6 is safe in lazy mode - diffs show only
>>minor whitespace changes in the format.
>>
>>    FV> Besides that the only difference lays in the things I do with
>>    FV> the data.
>>
>>This is likely to be the cause of the problem. See below.
>>
>>    FV> The problem is that when I parse a blast result that is a few
>>    FV> hundred MB, for example 300MB, the java application is
>>    FV> ballooning up to around 1.6GB of memory. Sometimes the
>>    FV> application even crashes because I only have got 2GB to play
>>    FV> with.
>>
>>The parser uses an event driven framework which is designed to handle
>>very big data - it will handle multi-GB reports. However, if you
>>create many fine-grained objects for every element of every report you
>>will quickly run out of resources.
>>
>>    FV> Does anyone know what's causing this? Is it because I set the
>>    FV> lazy mode?  Is there any way to work around it?
>>
>>Either you need to think about which elements of the report you are
>>interested in and build a filter which captures those events,
>>discarding the rest. See the demos/ssbind package for an example by
>>Matthew. Or if you really need all those objects then you should look
>>at allowing them to be garbage-collected as soon as possible.
>>
>>It is possible that there is a bug somewhere, but without any seeing
>>any code it isn't possible to say much more. If you need more help,
>>post a short (working) piece of code illustrating the problem and we
>>will do our best.
>>
>>hth
>>
>>Keith
>>
>>--
>>
>>- Keith James <kdj at sanger.ac.uk> Microarray Facility, Team 65 -
>>- The Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK -
>>    
>>
>
>_______________________________________________
>Biojava-l mailing list  -  Biojava-l at biojava.org
>http://biojava.org/mailman/listinfo/biojava-l
>
>  
>




More information about the Biojava-l mailing list