[Bioperl-l] parsing a BLAST output
Andrew Walsh
walsh at cenix-bioscience.com
Wed Dec 14 17:12:06 EST 2005
If you're not sure what your code is doing, you should write a test for
it. Create different alignments that meet and don't meet your
requirements and test that your code does the correct thing.
Angshu Kar wrote:
> Hi Jason,
>
> So again, what I want is the following:
>
> query sequence
> ----------------------------------
> | alignment |
> --------------------------------
> hit sequence
>
> The length of the alignment should be at least 50% of the
> length of the query sequence, and at least 50% of the length
> of the hit sequence.
>
> Now I try to convert it to bioperl language:
>
>
> if($hit->length_aln()/hit->length()>=0.5 and
> $hit->length_aln()/result->query_length()>=0.5)
> {
> print $line{$result->query_accession} . "\t" . $line{$hit->name} .
> "\t" . $hit->hsp-evalue . "\n";
> }
>
> Is the statement after "if" correctly represent what I mean?
>
> Appreciate your guidance.
>
> Thanks,
> Angshu
>
>
>
> On 12/13/05, Angshu Kar <angshu96 at gmail.com> wrote:
>
>>Thanks Jason...
>>
>>
>>
>>On 12/13/05, Jason Stajich <jason.stajich at duke.edu> wrote:
>>
>>>http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/Hit/GenericHit.html#POD30
>>>
>>>
>>>http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/Hit/GenericHit.html#POD31
>>>
>>>On Dec 13, 2005, at 10:41 AM, Angshu Kar wrote:
>>>
>>>
>>>Hi Jason,
>>>
>>>Could you please send in the link where I can find something about the HitI interface? I couldn't find those two functions there.
>>>
>>>Thanks,
>>>Angshu
>>>
>>>
>>>On 12/4/05, Jason Stajich <jason.stajich at duke.edu > wrote:
>>>
>>>>frac_identical gives the fraction of bases in the HSP that are
>>>>identical.
>>>>
>>>>overlap would be calculated from (length of the query aligned) /
>>>>(length of query) or (length of hit aligned) / (length of hit).
>>>>
>>>>So for an HSP you can calculate this
>>>>$fracqaligned = $hsp->query->length / $result->query_length;
>>>>$frachaligned = $hsp->hit->length / $hit->length;
>>>>
>>>>But remember there may be multiple HSPs so you may need to merge the
>>>>HSP information to get the total length aligned. If there is a
>>>>repeated domain or multiple high scoring suboptimal alignments this
>>>>can cause things to get a little tricky.
>>>>
>>>>There are two methods provided in the HitI interface called
>>>>frac_aligned_query() and frac_aligned_hit() do try and take all of
>>>>this into account for you, but I admit this is less well tested
>>>>code. But do give them a try:
>>>>$fracqaligned = $hit->frac_aligned_query();
>>>>$frachaligned = $hit->frac_aligned_hit();
>>>>
>>>>
>>>>If you are using WU-BLASTP add the -postsw option to get a refined
>>>>alignment which will merge HSPs where appropriate so you should use
>>>>that.
>>>>
>>>>You can also use the -links option to get WU-BLAST to get the logical
>>>>ordering and a consistent path through the HSPs.
>>>>
>>>>
>>>>On Dec 4, 2005, at 8:32 PM, Angshu Kar wrote:
>>>>
>>>>
>>>>>Hi,
>>>>>
>>>>
>>>> > To begin with, I'm new to Bioperl.
>>>>
>>>>>Now, I've written the following simple piece of code to parse a WU-
>>>>>Blast
>>>>>output which filters data *for a given e-value and >50% overlap*.
>>>>>
>>>>>I'm writing the main algorithm here:
>>>>>
>>>>>my $blast_report = $ARG[1];
>>>>>my $threshold_evalue = $ARG[2];
>>>>>
>>>>>my $in = new Bio::SearchIO(-format => 'blast', -file =>
>>>>>$blast_report);
>>>>>
>>>>>while (my $result = $in -> next_result)
>>>>> {
>>>>> while(my $hit = $result->next_hit)
>>>>> {
>>>>> if(($line{$hit->name} == $line{$result->query_accession}))
>>>>> {
>>>>> next;
>>>>> }
>>>>> if($hit->hsp->evalue <= $threshold_evalue)
>>>>> {
>>>>> if($hit->hsp->frac_indentical>=0.5)
>>>>> {
>>>>> print $line{$result->query_accession} . "\t" .
>>>>>$line{$hit->name} . "\t" . $hit->hsp-evalue . "\n";
>>>>> }
>>>>> }
>>>>> }
>>>>>}
>>>>>
>>>>>My questions are:
>>>>>
>>>>>1. does the frac_identical gives the measure of % overlap? Or, are
>>>>>there any
>>>>>other methods?
>>>>>2. now, i don't have any blast data sets to test my code upon.could
>>>>>any of
>>>>>the experienced users let me know whether the algorithm is fine?any
>>>>>tip-offs on any point (from optimization to syntactical errors) are
>>>>>heartily
>>>>>welcome.
>>>>>3. could any one please let me know if i can find sample wu-blast
>>>>>outputs to
>>>>>test my script upon?
>>>>
>>>>http://fungal.genome.duke.edu/~jes12/BGT203.2005/sample_reports/
>>>>
>>>>Also checkout the biodata repository from bioperl and look in the
>>>>DB_Searching directory, we had started a project cataloging example
>>>>reports in all the different formats. This sort of fizzled out, but
>>>>could still use some volunteers to better organize things and
>>>>incorporate more examples.
>>>>
>>>> http://cvs.open-bio.org/cgi-bin/viewcvs/viewcvs.cgi/biodata/
>>>>DB_Searching/
>>>>
>>>>
>>>>
>>>>>Appreciate your guidance.
>>>>>
>>>>>Thanks,
>>>>>Angshu
>>>>>
>>>>>_______________________________________________
>>>>>Bioperl-l mailing list
>>>>>Bioperl-l at portal.open-bio.org
>>>>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>>--
>>>>Jason Stajich
>>>>Duke University
>>>>http://www.duke.edu/~jes12
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>>--
>>>Jason Stajich
>>>Duke University
>>>http://www.duke.edu/~jes12
>>>
>>>
>>
>>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
--
------------------------------------------------------------------
Andrew Walsh, M.Sc.
Senior Bioinformatics Software Engineer
IT Unit
Cenix BioScience GmbH
Tatzberg 47
01307 Dresden
Germany
Tel. +49-351-4173 137
Fax +49-351-4173 109
public key: http://www.cenix-bioscience.com/public_keys/walsh.gpg
------------------------------------------------------------------
More information about the Bioperl-l
mailing list