[Bioperl-l] Tricky pairwise sequence alignment for mtDNA

Mon Jun 12 19:52:45 UTC 2006

Hello all,

I am doing a project relating to some forensic analysis of mitochondrial
DNA. 

I would like to write a script that will take a reference sequence, in
this case the Anderson sequence which is the standard mitochondrial
sequence which sample sequences are compared to, and compare it to an
unknown sequence.

I have been using this script:

use Bio::SearchIO;
use strict;
my $fh;
my @nomatches;
open($fh, "bl2seq -i refseqs/andhv2.fa -j refseqs/testhv2.fa -p blastn |") || die $!;

my $parser = Bio::SearchIO->new(-format => 'blast',fh => $fh);

if( my $result = $parser->next_result ) { 
     if( my $hit = $result->next_hit ) {   
     if( my $hsp = $hit->next_hsp ) { 
         my ( @qmismatches) = $hsp->seq_inds('query', 'nomatch');
	 my ( @hitbases) = $hsp->hit_string;
	 my ( @querybases) = $hsp->query_string;
	 my $seq_string = join("", at querybases);
	 my $seq_string1 = join("", at hitbases);
         for my $base (  @qmismatches ) {
            print "base $base of the hit sequence is a mismatch: ";
	    print substr $seq_string, $base-1, 1;
	    print "->";
            print substr $seq_string1, $base-1, 1;
            print "\n";
        }

     }
     }
}

The problem is, that some mitochondrial sequences from individuals have
insertions, deletion etc, that cause them to be offset from the
reference sequence, this then offsets the numbering system.

To provide an example:

>Anderson Reference Sequence|HV2
ATTTGGT...
1234567

>Sample|HV2....
ATTTG|C|GT
12345,5.1,67

The |C| denote an insertion, and traditionally in the forensics community
this would be called position 5.1G, but the program reads it as position 6.

So basically I need to figure out how to modify a perl script in order to recognize 
that 5.1G is an insertion, and that it is not position 6, position 6 is actually 
the G to the right of it, followed by position 7-T.

Any ideas and suggestions would be greatly helpful, I know this could be very tricky,
or very easy - I just have come to the point where the idea flow has stopped and would 
love to gather some outside input.

Thanks
Colin Erdman
colin.erdman at du.edu
Undergraduate Research Associate
Institute For Forensic Genetic
University of Denver