[Bioperl-l] cigar string in GenericHSP
Marc Logghe
Marc.Logghe at devgen.com
Wed Mar 12 11:12:13 EST 2003
Hi,
I was only wondering, if you really would like to reproduce the original
blast alignment, is the original homology string included in the cigar
string ? If I recall well, the homology sequence is not included in the aln
object and if you want to add it explicitely, bioperl complains about the
bad sequence.
Also, I was missing a writer method for blast alignments. I have written my
own , but did not yet try to plug it in Bio::SearchIO::blast. If someone
else would like to give it a try, I'll include the code I currently have
($LINE_LEN is global and should be set to 60):
sub write_result
{
my ( $self, $hsp ) = @_;
my $aln;
my $len = $hsp->length;
my $l = length($len);
my $p = ' ' x ( $l + 8 );
my $cnt = 0;
while ( ( $cnt * 60 ) < $len )
{
my ( $chunk, $start, $end );
($start) = $hsp->range('query');
( $chunk, $start, $end ) =
$self->_get_chunk( $hsp->query_string, $start, $cnt );
$aln .= sprintf( "Query: %-${l}s %s %d\n", $start, $chunk, $end );
$aln .= sprintf( "$p%s\n",
substr( $hsp->homology_string, $cnt * $LINE_LEN, $LINE_LEN
) );
($start) = $hsp->range('hit');
( $chunk, $start, $end ) =
$self->_get_chunk( $hsp->hit_string, $start, $cnt );
$aln .= sprintf( "Sbjct: %-${l}s %s %d\n", $start, $chunk, $end );
$aln .= "\n";
$cnt++;
}
return $aln;
}
sub _get_chunk
{
my ( undef, $g, $start, $cnt ) = @_;
my $u = substr( $g, 0, $cnt * $LINE_LEN );
my $gchunk = my $uchunk = substr( $g, $cnt * $LINE_LEN, $LINE_LEN );
$uchunk =~ s/-//g;
$u =~ s/-//g;
my $start_in_u = length($u) + $start;
my $stop_in_u = $start_in_u + length($uchunk) - 1;
return ( $gchunk, $start_in_u, $stop_in_u );
}
Regards,
Marc
> -----Original Message-----
> From: Jason Stajich [mailto:jason at cgt.mc.duke.edu]
> Sent: Wednesday, March 12, 2003 3:48 AM
> To: Juguang Xiao
> Cc: Bioperl-l at bioperl.org
> Subject: Re: [Bioperl-l] cigar string in GenericHSP
>
>
> okay -
>
> This is also possible via:
>
> $hsp->get_aln->cigar_line();
>
> But I am fine adding it to the HSP if it is faster.
>
> -jason
> On Wed, 12 Mar 2003, Juguang Xiao wrote:
>
> > Hi all,
> >
> > I added one method in Bio::Search::HSP::GenericHSP, named
> cigar_string.
> > The Cigar string issue raises when we try to annotate
> genome and store
> > into ensembl 9 and above database. I attach the concept of
> cigar string
> > at the end of this email.
> >
> > Now you can have a very simple script to get cigar string from hsp,
> > which works for all favors of blast.
> >
> > my $factory = new Bio::SearchIO( -format => 'blast', -file =>
> > 't/data/blast.report');
> > my $hsp = $factory->next_result->next_hit->next_hsp; #
> supposed to be
> > GenericHSP
> > my $cigar_string = $hsp->cigar_string;
> >
> > Beside this, I also wrote a static method to
> generate_cigar_string from
> > 2 equal-length seqence, and you can use it more directly if
> you have a
> > alignment sequence.
> >
> > my $qstr = 'tacgcta--tacgcta--cactg-c';
> > my $hstr = 'tac---tacgt----ctacgca---cc';
> > my $cigar_string =
> Bio::Search::HSP::GenericHSP::generate_cigar_string
> > ($qstr, $hstr);
> >
> > t/cigarstring.t is serving to test.
> >
> > Suggestions or questions? Thanks
> >
> > Juguang
> >
> > ----------
> > Copied from ensembl doc.
> >
> > Sequence alignment hits were previously stored within the
> core database
> > as
> > ungapped alignments. This imposed 2 major constraints on alignments:
> >
> > a) alignments for a single hit record would require
> multiple rows in the
> > database, and
> > b) it was not possible to accurately retrieve the exact original
> > alignment.
> >
> > Therefore, in the new branch sequence alignments are now stored as
> > ungapped
> > alignments in the cigar line format (where CIGAR stands for Concise
> > Idiosyncratic Gapped Alignment Report).
> >
> > In the cigar line format alignments are sotred as follows:
> >
> > M: Match
> > D: Deletino
> > I: Insertion
> >
> > An example of an alignment for a hypthetical protein match is shown
> > below:
> >
> >
> > Query: 42 PGPAGLP----GSVGLQGPRGLRGPLP-GPLGPPL...
> > PG P G GP R PLGP
> > Sbjct: 1672 PGTP*TPLVPLGPWVPLGPSSPR--LPSGPLGPTD...
> >
> >
> > protein_align_feature table as the following cigar line:
> >
> > 7M4D12M2I2MD7M
>
More information about the Bioperl-l
mailing list