[Bioperl-l] cigar string in GenericHSP

Juguang Xiao juguang at fugu-sg.org
Wed Mar 12 10:28:15 EST 2003


Hi all,

I added one method in Bio::Search::HSP::GenericHSP, named cigar_string. 
The Cigar string issue raises when we try to annotate genome and store 
into ensembl 9 and above database. I attach the concept of cigar string 
at the end of this email.

Now you can have a very simple script to get cigar string from hsp, 
which works for all favors of blast.

my $factory = new Bio::SearchIO( -format => 'blast', -file => 
't/data/blast.report');
my $hsp = $factory->next_result->next_hit->next_hsp; # supposed to be 
GenericHSP
my $cigar_string = $hsp->cigar_string;

Beside this, I also wrote a static method to generate_cigar_string from 
2 equal-length seqence, and you can use it more directly if you have a 
alignment sequence.

my $qstr = 'tacgcta--tacgcta--cactg-c';
my $hstr = 'tac---tacgt----ctacgca---cc';
my $cigar_string = Bio::Search::HSP::GenericHSP::generate_cigar_string 
($qstr, $hstr);

t/cigarstring.t is serving to test.

Suggestions or questions? Thanks

Juguang

----------
Copied from ensembl doc.

Sequence alignment hits were previously stored within the core database 
as
ungapped alignments. This imposed 2 major constraints on alignments:

a) alignments for a single hit record would require multiple rows in the
database, and
b) it was not possible to accurately retrieve the exact original 
alignment.

Therefore, in the new branch sequence alignments are now stored as 
ungapped
alignments in the cigar line format (where CIGAR stands for Concise
Idiosyncratic Gapped Alignment Report).

In the cigar line format alignments are sotred as follows:

M: Match
D: Deletino
I: Insertion

An example of an alignment for a hypthetical protein match is shown 
below:


Query:   42 PGPAGLP----GSVGLQGPRGLRGPLP-GPLGPPL...
             PG    P    G     GP   R      PLGP
Sbjct: 1672 PGTP*TPLVPLGPWVPLGPSSPR--LPSGPLGPTD...


protein_align_feature table as the following cigar line:

7M4D12M2I2MD7M



More information about the Bioperl-l mailing list