[Bioperl-l] parsing a BLAST output
Angshu Kar
angshu96 at gmail.com
Wed Dec 14 17:19:06 EST 2005
Hi Andrew,
I don't have a test dataset to test the code! That's my main problem! :(
That's why I'm seeking experieced help like yours...Could you please
comment just seeing the code?
It'll be very helpful for me.
Thanks a lot,
Angshu
On 12/14/05, Andrew Walsh <walsh at cenix-bioscience.com> wrote:
> If you're not sure what your code is doing, you should write a test for
> it. Create different alignments that meet and don't meet your
> requirements and test that your code does the correct thing.
>
>
> Angshu Kar wrote:
> > Hi Jason,
> >
> > So again, what I want is the following:
> >
> > query sequence
> > ----------------------------------
> > | alignment |
> > --------------------------------
> > hit sequence
> >
> > The length of the alignment should be at least 50% of the
> > length of the query sequence, and at least 50% of the length
> > of the hit sequence.
> >
> > Now I try to convert it to bioperl language:
> >
> >
> > if($hit->length_aln()/hit->length()>=0.5 and
> > $hit->length_aln()/result->query_length()>=0.5)
> > {
> > print $line{$result->query_accession} . "\t" . $line{$hit->name} .
> > "\t" . $hit->hsp-evalue . "\n";
> > }
> >
> > Is the statement after "if" correctly represent what I mean?
> >
> > Appreciate your guidance.
> >
> > Thanks,
> > Angshu
> >
> >
> >
> > On 12/13/05, Angshu Kar <angshu96 at gmail.com> wrote:
> >
> >>Thanks Jason...
> >>
> >>
> >>
> >>On 12/13/05, Jason Stajich <jason.stajich at duke.edu> wrote:
> >>
> >>>http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/Hit/GenericHit.html#POD30
> >>>
> >>>
> >>>http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/Hit/GenericHit.html#POD31
> >>>
> >>>On Dec 13, 2005, at 10:41 AM, Angshu Kar wrote:
> >>>
> >>>
> >>>Hi Jason,
> >>>
> >>>Could you please send in the link where I can find something about the HitI interface? I couldn't find those two functions there.
> >>>
> >>>Thanks,
> >>>Angshu
> >>>
> >>>
> >>>On 12/4/05, Jason Stajich <jason.stajich at duke.edu > wrote:
> >>>
> >>>>frac_identical gives the fraction of bases in the HSP that are
> >>>>identical.
> >>>>
> >>>>overlap would be calculated from (length of the query aligned) /
> >>>>(length of query) or (length of hit aligned) / (length of hit).
> >>>>
> >>>>So for an HSP you can calculate this
> >>>>$fracqaligned = $hsp->query->length / $result->query_length;
> >>>>$frachaligned = $hsp->hit->length / $hit->length;
> >>>>
> >>>>But remember there may be multiple HSPs so you may need to merge the
> >>>>HSP information to get the total length aligned. If there is a
> >>>>repeated domain or multiple high scoring suboptimal alignments this
> >>>>can cause things to get a little tricky.
> >>>>
> >>>>There are two methods provided in the HitI interface called
> >>>>frac_aligned_query() and frac_aligned_hit() do try and take all of
> >>>>this into account for you, but I admit this is less well tested
> >>>>code. But do give them a try:
> >>>>$fracqaligned = $hit->frac_aligned_query();
> >>>>$frachaligned = $hit->frac_aligned_hit();
> >>>>
> >>>>
> >>>>If you are using WU-BLASTP add the -postsw option to get a refined
> >>>>alignment which will merge HSPs where appropriate so you should use
> >>>>that.
> >>>>
> >>>>You can also use the -links option to get WU-BLAST to get the logical
> >>>>ordering and a consistent path through the HSPs.
> >>>>
> >>>>
> >>>>On Dec 4, 2005, at 8:32 PM, Angshu Kar wrote:
> >>>>
> >>>>
> >>>>>Hi,
> >>>>>
> >>>>
> >>>> > To begin with, I'm new to Bioperl.
> >>>>
> >>>>>Now, I've written the following simple piece of code to parse a WU-
> >>>>>Blast
> >>>>>output which filters data *for a given e-value and >50% overlap*.
> >>>>>
> >>>>>I'm writing the main algorithm here:
> >>>>>
> >>>>>my $blast_report = $ARG[1];
> >>>>>my $threshold_evalue = $ARG[2];
> >>>>>
> >>>>>my $in = new Bio::SearchIO(-format => 'blast', -file =>
> >>>>>$blast_report);
> >>>>>
> >>>>>while (my $result = $in -> next_result)
> >>>>> {
> >>>>> while(my $hit = $result->next_hit)
> >>>>> {
> >>>>> if(($line{$hit->name} == $line{$result->query_accession}))
> >>>>> {
> >>>>> next;
> >>>>> }
> >>>>> if($hit->hsp->evalue <= $threshold_evalue)
> >>>>> {
> >>>>> if($hit->hsp->frac_indentical>=0.5)
> >>>>> {
> >>>>> print $line{$result->query_accession} . "\t" .
> >>>>>$line{$hit->name} . "\t" . $hit->hsp-evalue . "\n";
> >>>>> }
> >>>>> }
> >>>>> }
> >>>>>}
> >>>>>
> >>>>>My questions are:
> >>>>>
> >>>>>1. does the frac_identical gives the measure of % overlap? Or, are
> >>>>>there any
> >>>>>other methods?
> >>>>>2. now, i don't have any blast data sets to test my code upon.could
> >>>>>any of
> >>>>>the experienced users let me know whether the algorithm is fine?any
> >>>>>tip-offs on any point (from optimization to syntactical errors) are
> >>>>>heartily
> >>>>>welcome.
> >>>>>3. could any one please let me know if i can find sample wu-blast
> >>>>>outputs to
> >>>>>test my script upon?
> >>>>
> >>>>http://fungal.genome.duke.edu/~jes12/BGT203.2005/sample_reports/
> >>>>
> >>>>Also checkout the biodata repository from bioperl and look in the
> >>>>DB_Searching directory, we had started a project cataloging example
> >>>>reports in all the different formats. This sort of fizzled out, but
> >>>>could still use some volunteers to better organize things and
> >>>>incorporate more examples.
> >>>>
> >>>> http://cvs.open-bio.org/cgi-bin/viewcvs/viewcvs.cgi/biodata/
> >>>>DB_Searching/
> >>>>
> >>>>
> >>>>
> >>>>>Appreciate your guidance.
> >>>>>
> >>>>>Thanks,
> >>>>>Angshu
> >>>>>
> >>>>>_______________________________________________
> >>>>>Bioperl-l mailing list
> >>>>>Bioperl-l at portal.open-bio.org
> >>>>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
> >>>>
> >>>>--
> >>>>Jason Stajich
> >>>>Duke University
> >>>>http://www.duke.edu/~jes12
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>--
> >>>Jason Stajich
> >>>Duke University
> >>>http://www.duke.edu/~jes12
> >>>
> >>>
> >>
> >>
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at portal.open-bio.org
> > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> >
>
>
> --
> ------------------------------------------------------------------
> Andrew Walsh, M.Sc.
> Senior Bioinformatics Software Engineer
> IT Unit
> Cenix BioScience GmbH
> Tatzberg 47
> 01307 Dresden
> Germany
> Tel. +49-351-4173 137
> Fax +49-351-4173 109
>
> public key: http://www.cenix-bioscience.com/public_keys/walsh.gpg
> ------------------------------------------------------------------
>
>
More information about the Bioperl-l
mailing list