[Bioperl-l] reformating dna sequence containing ambiguity symbols
Chris Fields
cjfields at illinois.edu
Mon Mar 16 19:50:29 UTC 2009
That's probably your best bet. You could follow it up with a s/\w{60}/
$1\n/g; to add in line breaks every 60 bases if you need it.
chris
On Mar 16, 2009, at 2:29 PM, Smithies, Russell wrote:
> Thanx Bruno,
> I hadn't looked at Bio::Tools::SeqPattern before but will take a
> closer look.
>
> --Rusell
>
> From: Bruno Vecchi [mailto:vecchi.b at gmail.com]
> Sent: Monday, 16 March 2009 11:05 p.m.
> To: Smithies, Russell
> Subject: Re: [Bioperl-l] reformating dna sequence containing
> ambiguity symbols
>
> Please correct me if I didn't understand your problem correctly, but
> I think that the solution to your problem is Bio::Tools::SeqPattern.
> Here's how you would do away with ambiguity codons:
>
> use Bio::Tools::SeqPattern;
> my $seq = 'GTNATAARCC';
>
> my $pattern = Bio::Tools::SeqPattern->new(
> -seq => $seq
> -TYPE => 'Dna'
> )
>
> $pattern->expand;
> # outputs: GT[ATCG]ATAA[AG]CC
>
>
> 2009/3/16 Smithies, Russell <Russell.Smithies at agresearch.co.nz<mailto:Russell.Smithies at agresearch.co.nz
> >>:
>> Typical Microsoft Outlook changed my formatting.
>> This is what dbSNP fasta looks like:
>> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?&db=snp&report=fasta&mode=text&id=29011166
>>
>> --Russell
>>
>>> -----Original Message-----
>>> From: bioperl-l-bounces at lists.open-bio.org<mailto:bioperl-l-bounces at lists.open-bio.org
>>> > [mailto:bioperl-l-<mailto:bioperl-l->
>>> bounces at lists.open-bio.org<mailto:bounces at lists.open-bio.org>] On
>>> Behalf Of Smithies, Russell
>>> Sent: Monday, 16 March 2009 4:30 p.m.
>>> To: 'Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org
>>> >'
>>> Subject: [Bioperl-l] reformating dna sequence containing ambiguity
>>> symbols
>>>
>>> I've just been reformatting some fasta from dbSNP that contains
>>> ambiguity
>>> symbols and had to come up with a non-standard solution as I
>>> needed to turn
>>> validation off in Bio::Seq but couldn't see an obvious way to do it.
>>>
>>> dbSNP format the fasta for their rs* SNPs like this:
>>>
>>>> gnl|dbSNP|rs29025902
>>> rs=29025902|pos=251|len=501|taxid=9913|mol="genomic"|class=1|
>>> alleles="A/G"|bui
>>> ld=125
>>> AGGCTACCAA TAGGACATCA CTGACTGTGA GGCTGGGAAG AAAGACCGAG AAGCACCCCA
>>> GGACACAGAA
>>> TCTCCTTCAC
>>> ATACAGAGGC AGTGGACACA TAGAGTACAG GCAGCGGTAA AATGGAGTAA AAAATTAGAT
>>> AGTGGTGAGT
>>> TTTGAGTATG
>>> AATTGCCTTT GTTTTTAAAT TAGTTCTAAG TTTATAAGAC AAGTTTTATT TTTTTATTTT
>>> ATTTATTTGC
>>> CACATGGCTT
>>> GTGGGTTTGC
>>> R
>>> GGATCTTAGT TCCCTGACCA GGGATTGAAC CTGTGCCCTC AGCAGTGAAA ACATGGAGTC
>>> CTAACAACTG
>>> GACCACCAGG
>>> GAATTCCCTA TATGACTTAA TTTTTAATAA TATTTGTAGC TAACAATTGA CATGCAGAGT
>>> TCCTGAAGAT
>>> TTAAGCATGG
>>> GCTCCCATGA ACCAGTATGA ACCAGCTCCA GCACAGCACA GGTTTTGTTT TACTTTTGGA
>>> GGGGAGGTTT
>>> GATTGTGTCT
>>> ACATGCTAAT
>>>
>>> But I needed it (for the Sequenom) with the ambiguity symbol on
>>> brackets like
>>> this:
>>>
>>>> gnl|dbSNP|rs29025902
>>> AGGCTACCAATAGGACATCACTGACTGTGAGGCTGGGAAGAAAGACCGAGAAGCACCCCAGGACACAGAATCTCCTTC
>>> ACATACAGAGGCAGTGGACACATAGAGTACAGGCAGCGGTAAAATGGAGTAAAAAATTAGATAGTGGTGAGTTTTGAG
>>> TATGAATTGCCTTTGTTTTTAAATTAGTTCTAAGTTTATAAGACAAGTTTTATTTTTTTATTTTATTTATTTGCCACA
>>> TGGCTTGTGGGTTTGC[A/
>>> G]GGATCTTAGTTCCCTGACCAGGGATTGAACCTGTGCCCTCAGCAGTGAAAACATGGA
>>> GTCCTAACAACTGGACCACCAGGGAATTCCCTATATGACTTAATTTTTAATAATATTTGTAGCTAACAATTGACATGC
>>> AGAGTTCCTGAAGATTTAAGCATGGGCTCCCATGAACCAGTATGAACCAGCTCCAGCACAGCACAGGTTTTGTTTTAC
>>> TTTTGGAGGGGAGGTTTGATTGTGTCTACATGCTAAT
>>>
>>> I came up with a 50% BioPerl solution using Bio::SeqIO and
>>> Bio::Tools::IUPAC
>>> but the final printing of the fasta is dome 'manually'.
>>> It's a bit hacky but I'm particularly proud of the obscurity I
>>> managed in my
>>> switch statement :-)
>>>
>>> ###############################
>>> #!perl -w
>>>
>>> use Bio::SeqIO;
>>> use Bio::Tools::IUPAC;
>>> use Switch;
>>>
>>> my $seq_in = Bio::SeqIO->new(-file=>$ARGV[0], -format=>"fasta" )
>>> or die $!;
>>>
>>> while (my $seqobj = $seq_in->next_seq) {
>>> my $seq = sprintf ">%s\n", $seqobj->display_id;
>>> my $iupac_seq = new Bio::Tools::IUPAC(-seq => $seqobj);
>>> foreach (@{$iupac_seq->{_alpha}}){
>>> switch($#{@{$_}}){
>>> case 0{$seq.= @{$_}[0]}
>>> case 1{$seq .= sprintf
>>> "[%s]",join("/",@{$_})}
>>> else {$seq .= 'N'}
>>> }
>>> }
>>> print "$seq\n";
>>> }
>>> #######################
>>>
>>> :-)
>>>
>>> --Russell
>>>
>>>
>>> Russell Smithies
>>> Bioinformatics Applications Developer
>>> T +64 3 489 9085
>>> E russell.smithies at agresearch.co.nz<mailto:russell.smithies at agresearch.co.nz
>>> >
>>> Invermay Research Centre
>>> Puddle Alley,
>>> Mosgiel,
>>> New Zealand
>>> T +64 3 489 3809
>>> F +64 3 489 9174
>>> www.agresearch.co.nz<http://www.agresearch.co.nz>
>>>
>>> Toitu te whenua, Toitu te tangata
>>> Sustain the land, Sustain the people
>>>
>>>
>>>
>>>
>>>
>>>
>>> =
>>> =
>>> =
>>> ====================================================================
>>> Attention: The information contained in this message and/or
>>> attachments
>>> from AgResearch Limited is intended only for the persons or entities
>>> to which it is addressed and may contain confidential and/or
>>> privileged
>>> material. Any review, retransmission, dissemination or other use
>>> of, or
>>> taking of any action in reliance upon, this information by persons
>>> or
>>> entities other than the intended recipients is prohibited by
>>> AgResearch
>>> Limited. If you have received this message in error, please notify
>>> the
>>> sender immediately.
>>> =
>>> =
>>> =
>>> ====================================================================
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list