[Bioperl-l] reformating dna sequence containing ambiguity symbols

Chris Fields cjfields at illinois.edu
Mon Mar 16 19:50:29 UTC 2009


That's probably your best bet.  You could follow it up with a s/\w{60}/ 
$1\n/g; to add in line breaks every 60 bases if you need it.

chris

On Mar 16, 2009, at 2:29 PM, Smithies, Russell wrote:

> Thanx Bruno,
> I hadn't looked at Bio::Tools::SeqPattern before but will take a  
> closer look.
>
> --Rusell
>
> From: Bruno Vecchi [mailto:vecchi.b at gmail.com]
> Sent: Monday, 16 March 2009 11:05 p.m.
> To: Smithies, Russell
> Subject: Re: [Bioperl-l] reformating dna sequence containing  
> ambiguity symbols
>
> Please correct me if I didn't understand your problem correctly, but  
> I think that the solution to your problem is Bio::Tools::SeqPattern.  
> Here's how you would do away with ambiguity codons:
>
>   use Bio::Tools::SeqPattern;
>   my $seq = 'GTNATAARCC';
>
>   my $pattern = Bio::Tools::SeqPattern->new(
>         -seq => $seq
>         -TYPE => 'Dna'
>   )
>
>   $pattern->expand;
>   # outputs: GT[ATCG]ATAA[AG]CC
>
>
> 2009/3/16 Smithies, Russell <Russell.Smithies at agresearch.co.nz<mailto:Russell.Smithies at agresearch.co.nz 
> >>:
>> Typical Microsoft Outlook changed my formatting.
>> This is what dbSNP fasta looks like:
>> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?&db=snp&report=fasta&mode=text&id=29011166
>>
>> --Russell
>>
>>> -----Original Message-----
>>> From: bioperl-l-bounces at lists.open-bio.org<mailto:bioperl-l-bounces at lists.open-bio.org 
>>> > [mailto:bioperl-l-<mailto:bioperl-l->
>>> bounces at lists.open-bio.org<mailto:bounces at lists.open-bio.org>] On  
>>> Behalf Of Smithies, Russell
>>> Sent: Monday, 16 March 2009 4:30 p.m.
>>> To: 'Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org 
>>> >'
>>> Subject: [Bioperl-l] reformating dna sequence containing ambiguity  
>>> symbols
>>>
>>> I've just been reformatting some fasta from dbSNP that contains  
>>> ambiguity
>>> symbols and had to come up with a non-standard solution as I  
>>> needed to turn
>>> validation off in Bio::Seq but couldn't see an obvious way to do it.
>>>
>>> dbSNP format the fasta for their rs* SNPs like this:
>>>
>>>> gnl|dbSNP|rs29025902
>>> rs=29025902|pos=251|len=501|taxid=9913|mol="genomic"|class=1| 
>>> alleles="A/G"|bui
>>> ld=125
>>> AGGCTACCAA TAGGACATCA CTGACTGTGA GGCTGGGAAG AAAGACCGAG AAGCACCCCA  
>>> GGACACAGAA
>>> TCTCCTTCAC
>>> ATACAGAGGC AGTGGACACA TAGAGTACAG GCAGCGGTAA AATGGAGTAA AAAATTAGAT  
>>> AGTGGTGAGT
>>> TTTGAGTATG
>>> AATTGCCTTT GTTTTTAAAT TAGTTCTAAG TTTATAAGAC AAGTTTTATT TTTTTATTTT  
>>> ATTTATTTGC
>>> CACATGGCTT
>>> GTGGGTTTGC
>>> R
>>> GGATCTTAGT TCCCTGACCA GGGATTGAAC CTGTGCCCTC AGCAGTGAAA ACATGGAGTC  
>>> CTAACAACTG
>>> GACCACCAGG
>>> GAATTCCCTA TATGACTTAA TTTTTAATAA TATTTGTAGC TAACAATTGA CATGCAGAGT  
>>> TCCTGAAGAT
>>> TTAAGCATGG
>>> GCTCCCATGA ACCAGTATGA ACCAGCTCCA GCACAGCACA GGTTTTGTTT TACTTTTGGA  
>>> GGGGAGGTTT
>>> GATTGTGTCT
>>> ACATGCTAAT
>>>
>>> But I needed it (for the Sequenom) with the ambiguity symbol on  
>>> brackets like
>>> this:
>>>
>>>> gnl|dbSNP|rs29025902
>>> AGGCTACCAATAGGACATCACTGACTGTGAGGCTGGGAAGAAAGACCGAGAAGCACCCCAGGACACAGAATCTCCTTC
>>> ACATACAGAGGCAGTGGACACATAGAGTACAGGCAGCGGTAAAATGGAGTAAAAAATTAGATAGTGGTGAGTTTTGAG
>>> TATGAATTGCCTTTGTTTTTAAATTAGTTCTAAGTTTATAAGACAAGTTTTATTTTTTTATTTTATTTATTTGCCACA
>>> TGGCTTGTGGGTTTGC[A/ 
>>> G]GGATCTTAGTTCCCTGACCAGGGATTGAACCTGTGCCCTCAGCAGTGAAAACATGGA
>>> GTCCTAACAACTGGACCACCAGGGAATTCCCTATATGACTTAATTTTTAATAATATTTGTAGCTAACAATTGACATGC
>>> AGAGTTCCTGAAGATTTAAGCATGGGCTCCCATGAACCAGTATGAACCAGCTCCAGCACAGCACAGGTTTTGTTTTAC
>>> TTTTGGAGGGGAGGTTTGATTGTGTCTACATGCTAAT
>>>
>>> I came up with a 50% BioPerl solution using Bio::SeqIO and  
>>> Bio::Tools::IUPAC
>>> but the final printing of the fasta is dome 'manually'.
>>> It's a bit hacky but I'm particularly proud of the obscurity I  
>>> managed in my
>>> switch statement  :-)
>>>
>>> ###############################
>>> #!perl -w
>>>
>>> use Bio::SeqIO;
>>> use Bio::Tools::IUPAC;
>>> use Switch;
>>>
>>> my $seq_in = Bio::SeqIO->new(-file=>$ARGV[0], -format=>"fasta" )  
>>> or die $!;
>>>
>>> while (my $seqobj = $seq_in->next_seq) {
>>>      my $seq = sprintf ">%s\n", $seqobj->display_id;
>>>      my $iupac_seq  = new Bio::Tools::IUPAC(-seq => $seqobj);
>>>              foreach (@{$iupac_seq->{_alpha}}){
>>>                      switch($#{@{$_}}){
>>>                              case 0{$seq.= @{$_}[0]}
>>>                              case 1{$seq .= sprintf  
>>> "[%s]",join("/",@{$_})}
>>>                              else  {$seq .= 'N'}
>>>                      }
>>>              }
>>>      print "$seq\n";
>>> }
>>> #######################
>>>
>>> :-)
>>>
>>> --Russell
>>>
>>>
>>> Russell Smithies
>>> Bioinformatics Applications Developer
>>> T +64 3 489 9085
>>> E  russell.smithies at agresearch.co.nz<mailto:russell.smithies at agresearch.co.nz 
>>> >
>>> Invermay  Research Centre
>>> Puddle Alley,
>>> Mosgiel,
>>> New Zealand
>>> T  +64 3 489 3809
>>> F  +64 3 489 9174
>>> www.agresearch.co.nz<http://www.agresearch.co.nz>
>>>
>>> Toitu te whenua, Toitu te tangata
>>> Sustain the land, Sustain the people
>>>
>>>
>>>
>>>
>>>
>>>
>>> = 
>>> = 
>>> = 
>>> ====================================================================
>>> Attention: The information contained in this message and/or  
>>> attachments
>>> from AgResearch Limited is intended only for the persons or entities
>>> to which it is addressed and may contain confidential and/or  
>>> privileged
>>> material. Any review, retransmission, dissemination or other use  
>>> of, or
>>> taking of any action in reliance upon, this information by persons  
>>> or
>>> entities other than the intended recipients is prohibited by  
>>> AgResearch
>>> Limited. If you have received this message in error, please notify  
>>> the
>>> sender immediately.
>>> = 
>>> = 
>>> = 
>>> ====================================================================
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list