[Biojava-l] Stop condition for blast parser

mark.schreiber at novartis.com mark.schreiber at novartis.com
Thu Mar 12 03:49:54 UTC 2009


Hi Marcel -

One possible solution would be to customise the handler and the parser so 
they can talk to each other and the handler can make call backs to the 
parser.

However, there is a fundamental problem with the BlastLikeSAXParser. 
Because it is a SAX parser it is not at all suited to bouncing around the 
file it is parsing because SAX parsing is event based. Therefore I think 
you need a different paradigm.  If you have lots of memory you could go 
with something that is more like a DOM parser and reads the whole file 
into memory (or uses java nio to pretend to) and use something like XQuery 
to find what you want.  If you are using BLAST XML output you could also 
build an object tree with JAXB and navigate that.

You can also combine SAX and DOM to read memory sized chunks in one go but 
this can be clunky.

Note, I am assuming you will use BLAST XML. If you are not I would 
strongly encourage it for the task you describe. It will also make you 
parsers much more robust to BLAST version changes.

Sorry the standard BioJava model can't really help here but please 
consider posting you're solution or adding it as a recipe in the cookbook 
as others are sure to have similar problems soon.

- Mark

biojava-l-bounces at lists.open-bio.org wrote on 03/12/2009 11:00:38 AM:

> Hi Mark!
> 
> The blast etc. is parallelized. The contigs are split into groups of 
1000
> and I also modified my program in the way that it works now with all 
those
> separate files. But nevertheless I also have a program that works on the
> concatenated blast output. The parser with my customized handler is 
always
> looking for the results of a certain contig and then compares these
> results to something else and also does some other stuff in-between to
> calculate some statistics and then creates a new parser again to get the
> results for the next contig. So a System.exit() is not an option, since 
it
> would stop my whole program (in which I am using the parser). I also 
don't
> wanna start working with threads here. I was just hoping that there 
would
> be a way to tell the handler that, when a certain condition is met, it
> should give the parser a signal to stop parsing (and maybe even to reset
> itself to the first line). But I guess there's no way to do it in the
> customized handler...
> 
> Thanks,
> Marcel
> 
> 
> mark.schreiber at novartis.com wrote:
> > 
> > Hi -
> > 
> > There are many ways to stop the parsing but it really depends on how 
you
> > have set the program up.  Notably there is no way for the Blast 
parsing
> > system of BioJava to shut itself down but control probably shouldn't
> > happen at that level.
> > 
> > A crude but effective procedure is to write out the results when you
> > find the hit of interest and then simply call System.exit()
> > 
> > Another approach would be to spawn Tasks to parse each record and then
> > have them signal to the main thread when they are complete to shut 
them
> > down.  If you are using Java 1.5 or earlier then you would need to do
> > this with Threads. If you have a later version you can use the
> > concurrent packages which are much nicer to deal with.
> > 
> > One thing I don't understand is why you don't blast each contig
> > separately, in that case the results would only contain your hit of
> > interest.  That means 90K separate blasts but there are versions of
> > blast that run on clusters and the database (3 million genes) is not
> > huge so it should be an embarrassingly parallel problem?
> > 
> > - Mark
> > 
> > biojava-l-bounces at lists.open-bio.org wrote on 03/10/2009 03:00:36 AM:
> > 
> >> Hi Mark!
> >>
> >> Mark Schreiber wrote:
> >> > You could just customize BlastEcho to pass on the events of 
interest,
> >> > ignore those that are not interesting.
> >> That's what I am doing right now. But I don't know, how to tell my
> >> customized BlastEcho to stop, when a certain condition is met during 
a
> >> paricular event call. What's the command for stopping there?
> >>
> >> > It could also exit if a certain
> >> > event occurs.
> >> How?
> >>
> >> > Remember it cost almost nothing to read the file so you
> >> > save time by only sending interesting events for parsing.
> >> Hmm, I am not sure, if it's really almost nothing, when I've about 
90,000
> >> contigs that were blasted against a database with about maybe 
3,000,000
> >> genes. The blast output that I am parsing is about 13Gig big and 
every
> >> cycle I am looking for the results of one particular contig of these
> >> 90,000 contigs. So I definitely experienced that the time sums up a 
lot,
> >> when it's running in each of these 90,000 cycles over the whole file,
> >> although the contig I am looking for was already at the beginning
> > ofthe file.
> >>
> >>
> >> Cheers,
> >> Marcel
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

_________________________

CONFIDENTIALITY NOTICE

The information contained in this e-mail message is intended only for the 
exclusive use of the individual or entity named above and may contain 
information that is privileged, confidential or exempt from disclosure 
under applicable law. If the reader of this message is not the intended 
recipient, or the employee or agent responsible for delivery of the 
message to the intended recipient, you are hereby notified that any 
dissemination, distribution or copying of this communication is strictly 
prohibited. If you have received this communication in error, please 
notify the sender immediately by e-mail and delete the material from any 
computer.  Thank you.



More information about the Biojava-l mailing list