[Bioperl-l] cigar string in GenericHSP

Juguang Xiao juguang at fugu-sg.org
Wed Mar 12 10:28:15 EST 2003

Hi all,

I added one method in Bio::Search::HSP::GenericHSP, named cigar_string. 
The Cigar string issue raises when we try to annotate genome and store 
into ensembl 9 and above database. I attach the concept of cigar string 
at the end of this email.

Now you can have a very simple script to get cigar string from hsp, 
which works for all favors of blast.

my $factory = new Bio::SearchIO( -format => 'blast', -file => 
my $hsp = $factory->next_result->next_hit->next_hsp; # supposed to be 
my $cigar_string = $hsp->cigar_string;

Beside this, I also wrote a static method to generate_cigar_string from 
2 equal-length seqence, and you can use it more directly if you have a 
alignment sequence.

my $qstr = 'tacgcta--tacgcta--cactg-c';
my $hstr = 'tac---tacgt----ctacgca---cc';
my $cigar_string = Bio::Search::HSP::GenericHSP::generate_cigar_string 
($qstr, $hstr);

t/cigarstring.t is serving to test.

Suggestions or questions? Thanks


Copied from ensembl doc.

Sequence alignment hits were previously stored within the core database 
ungapped alignments. This imposed 2 major constraints on alignments:

a) alignments for a single hit record would require multiple rows in the
database, and
b) it was not possible to accurately retrieve the exact original 

Therefore, in the new branch sequence alignments are now stored as 
alignments in the cigar line format (where CIGAR stands for Concise
Idiosyncratic Gapped Alignment Report).

In the cigar line format alignments are sotred as follows:

M: Match
D: Deletino
I: Insertion

An example of an alignment for a hypthetical protein match is shown 

             PG    P    G     GP   R      PLGP

protein_align_feature table as the following cigar line:


More information about the Bioperl-l mailing list