[Bioperl-l] cigar_line

Sat Aug 25 00:56:04 UTC 2012

Hi there,

I'm playing around with alignment formats, and saw the function cigar_line for SimpleAlign objects.  I have a couple of questions/suggestions:

1. It looks like the cigar string is being generated with respect to the consensus sequence.  That's fine, but it would also be really useful to be able to generate it with respect to the reference (first) sequence.  Would that be easy to implement? Could you consider that as a feature request?

2. Is there any commonly accepted definition of CIGAR format? and/or has it changed in recent years? 
The definition I've seen is from the SAM format (http://samtools.sourceforge.net/SAM1.pd) and these cigar strings don't look like they're in that format.  The SAM definition carries a lot of useful information that this cigar string doesn't.

3. the 100% threshold used for generating the consensus from which cigar strings are made is very stringent (and counter-intuitive to the biologist: when I hear "consensus" I don't think 100% conserved). Also different to the default for consensus_string.  Any thoughts on changing that threshold, or maybe just making the documentation a little clearer on that?

4. deletions with respect to consensus sequence don't seem to be reported in the cigar string (see seq4 in my toy example script below). Is this a bug?

thanks for listening!

Janet

------------------------------------------------------------------- 

Dr. Janet Young 

Tapscott and Malik labs

Fred Hutchinson Cancer Research Center
1100 Fairview Avenue N., C3-168, 
P.O. Box 19024, Seattle, WA 98109-1024, USA.

tel: (206) 667 1471 fax: (206) 667 6524
email: jayoung  ...at...  fhcrc.org

------------------------------------------------------------------- 

#!/usr/bin/perl

use warnings;
use strict;
use Bio::AlignIO;

my $alignString = ">seq1\n
AGTGAGGTGATCGGTAGCTGATGCTAGTT\n
>seq2\n
AGTGAGGTGATCGGTAGCTGATGCTAGTT\n
>seq3\n
AGTGAGGTGATCGGTAGCTGATGCTAGTT\n
>seq4\n
AG-GAGGAGATCGGTAGCTGTTGCTAGTT";

my $stringfh;
open($stringfh, "<", \$alignString);
my $in  = Bio::AlignIO->new(-fh     => $stringfh,
                            -format => "fasta");

while (my $aln = $in->next_aln()) {
    my $consString3 = $aln->consensus_string(100);
    print "\nconsensus100          $consString3\n";
    my %cigars = $aln->cigar_line();
    foreach my $seqname (sort keys %cigars) {
        my $shortseqname = (split /\//, $seqname)[0];
        my $seq = $aln->get_seq_by_id($shortseqname)->seq();
        print "seqname $seqname seq $seq cigar1 $cigars{$seqname}\n";
    }
} 

##### script output:
# consensus100          AG?GAGG?GATCGGTAGCTG?TGCTAGTT
# seqname seq1/1-29 seq AGTGAGGTGATCGGTAGCTGATGCTAGTT cigar1 1,2:4,7:9,20:22,29
# seqname seq2/1-29 seq AGTGAGGTGATCGGTAGCTGATGCTAGTT cigar1 1,2:4,7:9,20:22,29
# seqname seq3/1-29 seq AGTGAGGTGATCGGTAGCTGATGCTAGTT cigar1 1,2:4,7:9,20:22,29
# seqname seq4/1-28 seq AG-GAGGAGATCGGTAGCTGTTGCTAGTT cigar1 1,6:8,19:21,28