[Biojava-l] Stop condition for blast parser

Marcel Huntemann marcel.huntemann at gmail.com
Thu Mar 12 15:40:21 UTC 2009


OK, thanks heaps 4 your help, Mark!

mark.schreiber at novartis.com wrote:
> 
> Hi Marcel -
> 
> One possible solution would be to customise the handler and the parser
> so they can talk to each other and the handler can make call backs to
> the parser.
> 
> However, there is a fundamental problem with the BlastLikeSAXParser.
> Because it is a SAX parser it is not at all suited to bouncing around
> the file it is parsing because SAX parsing is event based. Therefore I
> think you need a different paradigm.  If you have lots of memory you
> could go with something that is more like a DOM parser and reads the
> whole file into memory (or uses java nio to pretend to) and use
> something like XQuery to find what you want.  If you are using BLAST XML
> output you could also build an object tree with JAXB and navigate that.
> 
> You can also combine SAX and DOM to read memory sized chunks in one go
> but this can be clunky.
> 
> Note, I am assuming you will use BLAST XML. If you are not I would
> strongly encourage it for the task you describe. It will also make you
> parsers much more robust to BLAST version changes.
> 
> Sorry the standard BioJava model can't really help here but please
> consider posting you're solution or adding it as a recipe in the
> cookbook as others are sure to have similar problems soon.
> 
> - Mark
> 
> biojava-l-bounces at lists.open-bio.org wrote on 03/12/2009 11:00:38 AM:
> 
>> Hi Mark!
>>
>> The blast etc. is parallelized. The contigs are split into groups of 1000
>> and I also modified my program in the way that it works now with all those
>> separate files. But nevertheless I also have a program that works on the
>> concatenated blast output. The parser with my customized handler is always
>> looking for the results of a certain contig and then compares these
>> results to something else and also does some other stuff in-between to
>> calculate some statistics and then creates a new parser again to get the
>> results for the next contig. So a System.exit() is not an option, since it
>> would stop my whole program (in which I am using the parser). I also don't
>> wanna start working with threads here. I was just hoping that there would
>> be a way to tell the handler that, when a certain condition is met, it
>> should give the parser a signal to stop parsing (and maybe even to reset
>> itself to the first line). But I guess there's no way to do it in the
>> customized handler...
>>
>> Thanks,
>> Marcel
>>
>>
>> mark.schreiber at novartis.com wrote:
>> >
>> > Hi -
>> >
>> > There are many ways to stop the parsing but it really depends on how you
>> > have set the program up.  Notably there is no way for the Blast parsing
>> > system of BioJava to shut itself down but control probably shouldn't
>> > happen at that level.
>> >
>> > A crude but effective procedure is to write out the results when you
>> > find the hit of interest and then simply call System.exit()
>> >
>> > Another approach would be to spawn Tasks to parse each record and then
>> > have them signal to the main thread when they are complete to shut them
>> > down.  If you are using Java 1.5 or earlier then you would need to do
>> > this with Threads. If you have a later version you can use the
>> > concurrent packages which are much nicer to deal with.
>> >
>> > One thing I don't understand is why you don't blast each contig
>> > separately, in that case the results would only contain your hit of
>> > interest.  That means 90K separate blasts but there are versions of
>> > blast that run on clusters and the database (3 million genes) is not
>> > huge so it should be an embarrassingly parallel problem?
>> >
>> > - Mark
>> >
>> > biojava-l-bounces at lists.open-bio.org wrote on 03/10/2009 03:00:36 AM:
>> >
>> >> Hi Mark!
>> >>
>> >> Mark Schreiber wrote:
>> >> > You could just customize BlastEcho to pass on the events of interest,
>> >> > ignore those that are not interesting.
>> >> That's what I am doing right now. But I don't know, how to tell my
>> >> customized BlastEcho to stop, when a certain condition is met during a
>> >> paricular event call. What's the command for stopping there?
>> >>
>> >> > It could also exit if a certain
>> >> > event occurs.
>> >> How?
>> >>
>> >> > Remember it cost almost nothing to read the file so you
>> >> > save time by only sending interesting events for parsing.
>> >> Hmm, I am not sure, if it's really almost nothing, when I've about
> 90,000
>> >> contigs that were blasted against a database with about maybe 3,000,000
>> >> genes. The blast output that I am parsing is about 13Gig big and every
>> >> cycle I am looking for the results of one particular contig of these
>> >> 90,000 contigs. So I definitely experienced that the time sums up a
> lot,
>> >> when it's running in each of these 90,000 cycles over the whole file,
>> >> although the contig I am looking for was already at the beginning
>> > ofthe file.
>> >>
>> >>
>> >> Cheers,
>> >> Marcel
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
> _________________________
> 
> CONFIDENTIALITY NOTICE
> 
> The information contained in this e-mail message is intended only for
> the exclusive use of the individual or entity named above and may
> contain information that is privileged, confidential or exempt from
> disclosure under applicable law. If the reader of this message is not
> the intended recipient, or the employee or agent responsible for
> delivery of the message to the intended recipient, you are hereby
> notified that any dissemination, distribution or copying of this
> communication is strictly prohibited. If you have received this
> communication in error, please notify the sender immediately by e-mail
> and delete the material from any computer.  Thank you.



More information about the Biojava-l mailing list