[Bioperl-l] reformating dna sequence containing ambiguity symbols

Smithies, Russell Russell.Smithies at agresearch.co.nz
Mon Mar 16 03:30:24 UTC 2009


I've just been reformatting some fasta from dbSNP that contains ambiguity symbols and had to come up with a non-standard solution as I needed to turn validation off in Bio::Seq but couldn't see an obvious way to do it.

dbSNP format the fasta for their rs* SNPs like this:

>gnl|dbSNP|rs29025902 rs=29025902|pos=251|len=501|taxid=9913|mol="genomic"|class=1|alleles="A/G"|build=125
AGGCTACCAA TAGGACATCA CTGACTGTGA GGCTGGGAAG AAAGACCGAG AAGCACCCCA GGACACAGAA TCTCCTTCAC
ATACAGAGGC AGTGGACACA TAGAGTACAG GCAGCGGTAA AATGGAGTAA AAAATTAGAT AGTGGTGAGT TTTGAGTATG
AATTGCCTTT GTTTTTAAAT TAGTTCTAAG TTTATAAGAC AAGTTTTATT TTTTTATTTT ATTTATTTGC CACATGGCTT
GTGGGTTTGC 
R
GGATCTTAGT TCCCTGACCA GGGATTGAAC CTGTGCCCTC AGCAGTGAAA ACATGGAGTC CTAACAACTG GACCACCAGG
GAATTCCCTA TATGACTTAA TTTTTAATAA TATTTGTAGC TAACAATTGA CATGCAGAGT TCCTGAAGAT TTAAGCATGG
GCTCCCATGA ACCAGTATGA ACCAGCTCCA GCACAGCACA GGTTTTGTTT TACTTTTGGA GGGGAGGTTT GATTGTGTCT
ACATGCTAAT

But I needed it (for the Sequenom) with the ambiguity symbol on brackets like this:

>gnl|dbSNP|rs29025902
AGGCTACCAATAGGACATCACTGACTGTGAGGCTGGGAAGAAAGACCGAGAAGCACCCCAGGACACAGAATCTCCTTCACATACAGAGGCAGTGGACACATAGAGTACAGGCAGCGGTAAAATGGAGTAAAAAATTAGATAGTGGTGAGTTTTGAGTATGAATTGCCTTTGTTTTTAAATTAGTTCTAAGTTTATAAGACAAGTTTTATTTTTTTATTTTATTTATTTGCCACATGGCTTGTGGGTTTGC[A/G]GGATCTTAGTTCCCTGACCAGGGATTGAACCTGTGCCCTCAGCAGTGAAAACATGGAGTCCTAACAACTGGACCACCAGGGAATTCCCTATATGACTTAATTTTTAATAATATTTGTAGCTAACAATTGACATGCAGAGTTCCTGAAGATTTAAGCATGGGCTCCCATGAACCAGTATGAACCAGCTCCAGCACAGCACAGGTTTTGTTTTACTTTTGGAGGGGAGGTTTGATTGTGTCTACATGCTAAT

I came up with a 50% BioPerl solution using Bio::SeqIO and Bio::Tools::IUPAC  but the final printing of the fasta is dome 'manually'.
It's a bit hacky but I'm particularly proud of the obscurity I managed in my switch statement  :-)

###############################
#!perl -w

use Bio::SeqIO;
use Bio::Tools::IUPAC;
use Switch;

my $seq_in = Bio::SeqIO->new(-file=>$ARGV[0], -format=>"fasta" ) or die $!;

while (my $seqobj = $seq_in->next_seq) {
	my $seq = sprintf ">%s\n", $seqobj->display_id;
	my $iupac_seq  = new Bio::Tools::IUPAC(-seq => $seqobj);
		foreach (@{$iupac_seq->{_alpha}}){
			switch($#{@{$_}}){
				case 0{$seq.= @{$_}[0]}
				case 1{$seq .= sprintf "[%s]",join("/",@{$_})}
				else  {$seq .= 'N'}
			}
		}
	print "$seq\n";
}
#######################

:-)

--Russell 


Russell Smithies 
Bioinformatics Applications Developer 
T +64 3 489 9085 
E  russell.smithies at agresearch.co.nz 
Invermay  Research Centre 
Puddle Alley, 
Mosgiel, 
New Zealand 
T  +64 3 489 3809   
F  +64 3 489 9174  
www.agresearch.co.nz 

Toitu te whenua, Toitu te tangata 
Sustain the land, Sustain the people 






=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================




More information about the Bioperl-l mailing list