[Bioperl-l] cigar string in GenericHSP
Juguang Xiao
juguang at fugu-sg.org
Wed Mar 12 10:28:15 EST 2003
Hi all,
I added one method in Bio::Search::HSP::GenericHSP, named cigar_string.
The Cigar string issue raises when we try to annotate genome and store
into ensembl 9 and above database. I attach the concept of cigar string
at the end of this email.
Now you can have a very simple script to get cigar string from hsp,
which works for all favors of blast.
my $factory = new Bio::SearchIO( -format => 'blast', -file =>
't/data/blast.report');
my $hsp = $factory->next_result->next_hit->next_hsp; # supposed to be
GenericHSP
my $cigar_string = $hsp->cigar_string;
Beside this, I also wrote a static method to generate_cigar_string from
2 equal-length seqence, and you can use it more directly if you have a
alignment sequence.
my $qstr = 'tacgcta--tacgcta--cactg-c';
my $hstr = 'tac---tacgt----ctacgca---cc';
my $cigar_string = Bio::Search::HSP::GenericHSP::generate_cigar_string
($qstr, $hstr);
t/cigarstring.t is serving to test.
Suggestions or questions? Thanks
Juguang
----------
Copied from ensembl doc.
Sequence alignment hits were previously stored within the core database
as
ungapped alignments. This imposed 2 major constraints on alignments:
a) alignments for a single hit record would require multiple rows in the
database, and
b) it was not possible to accurately retrieve the exact original
alignment.
Therefore, in the new branch sequence alignments are now stored as
ungapped
alignments in the cigar line format (where CIGAR stands for Concise
Idiosyncratic Gapped Alignment Report).
In the cigar line format alignments are sotred as follows:
M: Match
D: Deletino
I: Insertion
An example of an alignment for a hypthetical protein match is shown
below:
Query: 42 PGPAGLP----GSVGLQGPRGLRGPLP-GPLGPPL...
PG P G GP R PLGP
Sbjct: 1672 PGTP*TPLVPLGPWVPLGPSSPR--LPSGPLGPTD...
protein_align_feature table as the following cigar line:
7M4D12M2I2MD7M
More information about the Bioperl-l
mailing list