[Bioperl-l] reformating dna sequence containing ambiguity symbols

Mon Mar 16 19:29:10 UTC 2009

Thanx Bruno,
I hadn't looked at Bio::Tools::SeqPattern before but will take a closer look.

--Rusell

From: Bruno Vecchi [mailto:vecchi.b at gmail.com]
Sent: Monday, 16 March 2009 11:05 p.m.
To: Smithies, Russell
Subject: Re: [Bioperl-l] reformating dna sequence containing ambiguity symbols

Please correct me if I didn't understand your problem correctly, but I think that the solution to your problem is Bio::Tools::SeqPattern. Here's how you would do away with ambiguity codons:

   use Bio::Tools::SeqPattern;
   my $seq = 'GTNATAARCC';

   my $pattern = Bio::Tools::SeqPattern->new(
         -seq => $seq
         -TYPE => 'Dna'
   )

   $pattern->expand;
   # outputs: GT[ATCG]ATAA[AG]CC

2009/3/16 Smithies, Russell <Russell.Smithies at agresearch.co.nz<mailto:Russell.Smithies at agresearch.co.nz>>:
> Typical Microsoft Outlook changed my formatting.
> This is what dbSNP fasta looks like:
> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?&db=snp&report=fasta&mode=text&id=29011166
>
> --Russell
>
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org<mailto:bioperl-l-bounces at lists.open-bio.org> [mailto:bioperl-l-<mailto:bioperl-l->
>> bounces at lists.open-bio.org<mailto:bounces at lists.open-bio.org>] On Behalf Of Smithies, Russell
>> Sent: Monday, 16 March 2009 4:30 p.m.
>> To: 'Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>'
>> Subject: [Bioperl-l] reformating dna sequence containing ambiguity symbols
>>
>> I've just been reformatting some fasta from dbSNP that contains ambiguity
>> symbols and had to come up with a non-standard solution as I needed to turn
>> validation off in Bio::Seq but couldn't see an obvious way to do it.
>>
>> dbSNP format the fasta for their rs* SNPs like this:
>>
>> >gnl|dbSNP|rs29025902
>> rs=29025902|pos=251|len=501|taxid=9913|mol="genomic"|class=1|alleles="A/G"|bui
>> ld=125
>> AGGCTACCAA TAGGACATCA CTGACTGTGA GGCTGGGAAG AAAGACCGAG AAGCACCCCA GGACACAGAA
>> TCTCCTTCAC
>> ATACAGAGGC AGTGGACACA TAGAGTACAG GCAGCGGTAA AATGGAGTAA AAAATTAGAT AGTGGTGAGT
>> TTTGAGTATG
>> AATTGCCTTT GTTTTTAAAT TAGTTCTAAG TTTATAAGAC AAGTTTTATT TTTTTATTTT ATTTATTTGC
>> CACATGGCTT
>> GTGGGTTTGC
>> R
>> GGATCTTAGT TCCCTGACCA GGGATTGAAC CTGTGCCCTC AGCAGTGAAA ACATGGAGTC CTAACAACTG
>> GACCACCAGG
>> GAATTCCCTA TATGACTTAA TTTTTAATAA TATTTGTAGC TAACAATTGA CATGCAGAGT TCCTGAAGAT
>> TTAAGCATGG
>> GCTCCCATGA ACCAGTATGA ACCAGCTCCA GCACAGCACA GGTTTTGTTT TACTTTTGGA GGGGAGGTTT
>> GATTGTGTCT
>> ACATGCTAAT
>>
>> But I needed it (for the Sequenom) with the ambiguity symbol on brackets like
>> this:
>>
>> >gnl|dbSNP|rs29025902
>> AGGCTACCAATAGGACATCACTGACTGTGAGGCTGGGAAGAAAGACCGAGAAGCACCCCAGGACACAGAATCTCCTTC
>> ACATACAGAGGCAGTGGACACATAGAGTACAGGCAGCGGTAAAATGGAGTAAAAAATTAGATAGTGGTGAGTTTTGAG
>> TATGAATTGCCTTTGTTTTTAAATTAGTTCTAAGTTTATAAGACAAGTTTTATTTTTTTATTTTATTTATTTGCCACA
>> TGGCTTGTGGGTTTGC[A/G]GGATCTTAGTTCCCTGACCAGGGATTGAACCTGTGCCCTCAGCAGTGAAAACATGGA
>> GTCCTAACAACTGGACCACCAGGGAATTCCCTATATGACTTAATTTTTAATAATATTTGTAGCTAACAATTGACATGC
>> AGAGTTCCTGAAGATTTAAGCATGGGCTCCCATGAACCAGTATGAACCAGCTCCAGCACAGCACAGGTTTTGTTTTAC
>> TTTTGGAGGGGAGGTTTGATTGTGTCTACATGCTAAT
>>
>> I came up with a 50% BioPerl solution using Bio::SeqIO and Bio::Tools::IUPAC
>> but the final printing of the fasta is dome 'manually'.
>> It's a bit hacky but I'm particularly proud of the obscurity I managed in my
>> switch statement  :-)
>>
>> ###############################
>> #!perl -w
>>
>> use Bio::SeqIO;
>> use Bio::Tools::IUPAC;
>> use Switch;
>>
>> my $seq_in = Bio::SeqIO->new(-file=>$ARGV[0], -format=>"fasta" ) or die $!;
>>
>> while (my $seqobj = $seq_in->next_seq) {
>>       my $seq = sprintf ">%s\n", $seqobj->display_id;
>>       my $iupac_seq  = new Bio::Tools::IUPAC(-seq => $seqobj);
>>               foreach (@{$iupac_seq->{_alpha}}){
>>                       switch($#{@{$_}}){
>>                               case 0{$seq.= @{$_}[0]}
>>                               case 1{$seq .= sprintf "[%s]",join("/",@{$_})}
>>                               else  {$seq .= 'N'}
>>                       }
>>               }
>>       print "$seq\n";
>> }
>> #######################
>>
>> :-)
>>
>> --Russell
>>
>>
>> Russell Smithies
>> Bioinformatics Applications Developer
>> T +64 3 489 9085
>> E  russell.smithies at agresearch.co.nz<mailto:russell.smithies at agresearch.co.nz>
>> Invermay  Research Centre
>> Puddle Alley,
>> Mosgiel,
>> New Zealand
>> T  +64 3 489 3809
>> F  +64 3 489 9174
>> www.agresearch.co.nz<http://www.agresearch.co.nz>
>>
>> Toitu te whenua, Toitu te tangata
>> Sustain the land, Sustain the people
>>
>>
>>
>>
>>
>>
>> =======================================================================
>> Attention: The information contained in this message and/or attachments
>> from AgResearch Limited is intended only for the persons or entities
>> to which it is addressed and may contain confidential and/or privileged
>> material. Any review, retransmission, dissemination or other use of, or
>> taking of any action in reliance upon, this information by persons or
>> entities other than the intended recipients is prohibited by AgResearch
>> Limited. If you have received this message in error, please notify the
>> sender immediately.
>> =======================================================================
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>