[Bioperl-l] Tricky pairwise sequence alignment for mtDNA

Tue Jun 13 16:05:30 UTC 2006

I actually have found EMBOSS DiffSeq to work quite well for detecting
the insertions and SNPs in the "sample sequence" as compared to the
"reference sequence". 

If I get this all figured out and integrated I will post a method, I
imagine this would prove useful to others as well.

Thanks all,
Colin

On Tue, 2006-06-13 at 08:19 -0400, aaron.j.mackey at gsk.com wrote:
> See Bio::LocatableSeq
> 
> -Aaron
> 
> bioperl-l-bounces at lists.open-bio.org wrote on 06/12/2006 03:52:45 PM:
> 
> > Hello all,
> > 
> > I am doing a project relating to some forensic analysis of mitochondrial
> > DNA. 
> > 
> > I would like to write a script that will take a reference sequence, in
> > this case the Anderson sequence which is the standard mitochondrial
> > sequence which sample sequences are compared to, and compare it to an
> > unknown sequence.
> > 
> > I have been using this script:
> > 
> > use Bio::SearchIO;
> > use strict;
> > my $fh;
> > my @nomatches;
> > open($fh, "bl2seq -i refseqs/andhv2.fa -j refseqs/testhv2.fa -p 
> > blastn |") || die $!;
> > 
> > my $parser = Bio::SearchIO->new(-format => 'blast',fh => $fh);
> > 
> > if( my $result = $parser->next_result ) { 
> >      if( my $hit = $result->next_hit ) { 
> >      if( my $hsp = $hit->next_hsp ) { 
> >          my ( @qmismatches) = $hsp->seq_inds('query', 'nomatch');
> >     my ( @hitbases) = $hsp->hit_string;
> >     my ( @querybases) = $hsp->query_string;
> >     my $seq_string = join("", at querybases);
> >     my $seq_string1 = join("", at hitbases);
> >          for my $base (  @qmismatches ) {
> >             print "base $base of the hit sequence is a mismatch: ";
> >        print substr $seq_string, $base-1, 1;
> >        print "->";
> >             print substr $seq_string1, $base-1, 1;
> >             print "\n";
> >         }
> > 
> >      }
> >      }
> > }
> > 
> > 
> > The problem is, that some mitochondrial sequences from individuals have
> > insertions, deletion etc, that cause them to be offset from the
> > reference sequence, this then offsets the numbering system.
> > 
> > To provide an example:
> > 
> > >Anderson Reference Sequence|HV2
> > ATTTGGT...
> > 1234567
> > 
> > >Sample|HV2....
> > ATTTG|C|GT
> > 12345,5.1,67
> > 
> > The |C| denote an insertion, and traditionally in the forensics 
> community
> > this would be called position 5.1G, but the program reads it as position 
> 6.
> > 
> > So basically I need to figure out how to modify a perl script in 
> > order to recognize 
> > that 5.1G is an insertion, and that it is not position 6, position 6
> > is actually 
> > the G to the right of it, followed by position 7-T.
> > 
> > Any ideas and suggestions would be greatly helpful, I know this 
> > could be very tricky,
> > or very easy - I just have come to the point where the idea flow has
> > stopped and would 
> > love to gather some outside input.
> > 
> > Thanks
> > Colin Erdman
> > colin.erdman at du.edu
> > Undergraduate Research Associate
> > Institute For Forensic Genetic
> > University of Denver 
> > 
> > 
> > 
> > 
> > 
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> > 
> 
>