[Bioperl-l] limit the number of blast output per query

Frank Schwach fs5 at sanger.ac.uk
Fri Jul 15 08:26:17 UTC 2011


So how many HSPs do you get? I noticed that your database file is called
"genomes", is that a collection of many genomes then? Could you break
that up? I guess you are mapping short sequences, possibly repetitive
ones too, so the question is whether you would even gain anything from
limiting the output number if you then still couldn't say where in the
genome your sequence belongs to? Try one of the many short-read
alignment programs out there, Something like SSAHA, SMALT, BWA or bowtie
(there are many more...) and make use of bam files for storing large
amounts of alignments in compressed form. There are the "samtools" to
work with these files and there is a BioPerl wrapper for that called
BIo::DB::SAM. It should not be too difficutl to adapt your scripts and
you can always ask for help here. 

Cheers,

Frank



On Fri, 2011-07-15 at 07:05 +0800, Ross KK Leung wrote:
> blastall -p blastn -F F -b 50 -v 50 -i query.fna -d genomes.fna -e 1e-30 -o
> output.tab -m 8 -a 8
> 
> as what Hans-Rudolf Hotz commented, I guess there are just too many HSPs for
> hitting human genome and all my downstream analysis programs rely on
> blastall m8 output, now I just don't know how to adapt to other new formats
> shortly.
> 
> -----Original Message-----
> From: Frank Schwach [mailto:fs5 at sanger.ac.uk] 
> Sent: 2011年7月14日 23:09
> To: Hans-Rudolf Hotz
> Cc: Ross KK Leung; bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] limit the number of blast output per query
> 
> -b limits the number of output alignments, not sure why it isn't working
> for you. How many HSPs do you actually get? Is there a chance that the
> output is just so large because of a large number of queries and you are
> not seeing 10k+ HSPs but results? 
> Can you post the entire blastall command? The large number of hits looks
> like you are blasting short sequences, there are better algorithms for
> short sequence alignments, so if that's what you are doing I could give
> you some hints for alternative software to try.
> Frank
> 
> 
> 
> On Thu, 2011-07-07 at 11:02 +0200, Hans-Rudolf Hotz wrote:
> > Hi
> > 
> > just double checking: are you really talking abut "10,000+ hits"? or do 
> > you mean "10,000+ HSPs" ('high-scoring segment pairs')?
> > 
> > I don't know how your genome database looks like, but assuming you have 
> > one sequence per chromosome, then you will get just 24 hits (ie each 
> > chromosome) and then depending on your query each hit will have a lot of 
> > HSPs.
> > 
> > As far as as I know, there is no way to limit the number of HSPs (you 
> > might try playing with the E value).
> > 
> > You can try using the tabular output format (this will reduce the file 
> > size) - or may be BLAST is not the right search tool for your task?
> > 
> > 
> > Regards, Hans
> > 
> > 
> > On 07/07/2011 04:43 AM, Ross KK Leung wrote:
> > > I know this question should submit to BLAST help but it seems they have
> > > already been overwhelmed by incoming emails. I wonder any bioperl users
> > > happen to know how to limit the number of blast output per query. For
> > > example, for human genome as a database to blast against, a single query
> can
> > > generate 10,000+ hits. I have already supplied -b 30 -v 30 flags but
> > > obviously the blastall from blast2.2.22 does not "obey" my instruction.
> > >
> > > The output files generated are usually larger than 100G+ but indeed the
> > > final ones that I want usually are only of 10M-. Is there any way to
> help
> > > save our Earth (Not exaggerated, energy is WASTED in a meaningless
> manner)?
> > >
> > >
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l



-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 




More information about the Bioperl-l mailing list