<div dir="ltr"><p class="MsoNormal" style="font-size:12.8px">Hello All,</p><p class="MsoNormal" style="font-size:12.8px"><br></p><p class="MsoNormal" style="font-size:12.8px"> I have a fasta dataset in a single file with multiple paired end reads in paired sets of forward and reverse sequences (the reverse sequence is in the correct orientation). I am pretty sure this is the real world example requested in 6.1.3 of the Biopython Cookbook <span style="font-family:Wingdings">J</span>. Within this dataset all of the information is the same i.e. ID:, Name:, Number of features:. The only exceptions are the descriptions and sequences. Ex.</p><p class="MsoNormal" style="font-size:12.8px"><br></p><p class="MsoNormal" style="font-size:12.8px">>UAR Kaktovik 11-004 F L15774b(M13F)</p><p class="MsoNormal" style="font-size:12.8px">GTAGTATAGCAATTACCTTGGTCTTGTAAGCCAAAAACGGAGAATACCTACTCTCCCTAA</p><p class="MsoNormal" style="font-size:12.8px">GACTCAAGGAAGAAGCAACAGCTCCACTACCAGCACCCAAAGCTAATGTTCTATTTAAAC</p><p class="MsoNormal" style="font-size:12.8px">TATTCCCTGGTACATACTACTATTTTACCCCATGTCCTATTCATTTCATATATACCATCT</p><p class="MsoNormal" style="font-size:12.8px">TATGTGCTGTGCCATCGCAGTATGTCCTCGAATACCTTTCCCCCCCTATGTATATCGTGC</p><p class="MsoNormal" style="font-size:12.8px">ATTAATGGTGTGCCCCATGCATATAAGCATGTACATATTACGCTTGGTCTTACATAAGGA</p><p class="MsoNormal" style="font-size:12.8px">CTTACGTTCCGAAAGCTTATTTCAGGTGTATGGTCTGTGAGCATGTATTTCACTTAGTCC</p><p class="MsoNormal" style="font-size:12.8px">GAGAGCTTAATCACCGGGCCTCGAGAAACCAGCAACCCTTGCGAGTACGTGTACCTCTTC</p><p class="MsoNormal" style="font-size:12.8px">TCGCTCCGGGCCCATGGGGTGTGGGGGTTTCTATGTTGAAACTATACCTGGCATCTG</p><p class="MsoNormal" style="font-size:12.8px"> </p><p class="MsoNormal" style="font-size:12.8px">>UAR Kaktovik 11-004 R CSBCH(M13R)</p><p class="MsoNormal" style="font-size:12.8px">TCCCTTCATTATTATCGGACAACTAGCCTCCATTCTCTACTTTACAATCCTCCTAGTACT</p><p class="MsoNormal" style="font-size:12.8px">TATACCTATCGCTGGAATTATTGAAAACAGCCTCTTAAAGTGGAGAGTCTTTGTAGTATA</p><p class="MsoNormal" style="font-size:12.8px">GCAATTACCTTGGTCTTGTAAGCCAAAAACGGAGAATACCTACTCTCCCTAAGACTCAAG</p><p class="MsoNormal" style="font-size:12.8px">GAAGAAGCAACAGCTCCACTACCAGCACCCAAAGCTAATGTTCTATTTAAACTATTCCCT</p><p class="MsoNormal" style="font-size:12.8px">GGTACATACTACTATTTTACCCCATGTCCTATTCATTTCATATATACCATCTTATGTGCT</p><p class="MsoNormal" style="font-size:12.8px">GTGCCATCGCAGTATGTCCTCGAATACCTTTCCCCCCCTATGTATATCGTGCATTAATGG</p><p class="MsoNormal" style="font-size:12.8px">TGTGCCCCATGCATATAAGCATGTACATATTACGCTTGGTCTTACATAAGGACTTACGTT</p><p class="MsoNormal" style="font-size:12.8px">CCGAAAGCTTATTTCAGGTGTATGGTCTGTGAGCATGTATTTCACTTAGTCCGAGAGCTT</p><p class="MsoNormal" style="font-size:12.8px">AATCACCGGGCCTCGAGAAACCAGCAACCCTTGCGAGTACGTGTACCTCTTCTCGCTCCG</p><p class="MsoNormal" style="font-size:12.8px">GGCCCATGGGGTGTGGGGGTTTCTATGTTGAAACTATACCTG</p><p class="MsoNormal" style="font-size:12.8px"> </p><p class="MsoNormal" style="font-size:12.8px">My end goal is to align the paired ends of the sequences that have the same description and save the aligned sequence to another file for further analyses. I have a few problems:</p><p class="MsoNormal" style="font-size:12.8px"> </p><p class="MsoNormal" style="font-size:12.8px">1) The descriptions of each sequence are not identical so I need to delete all but the first three parts and include the associated sequence. I.e. remove F L15774b(M13F) and R CSBCH(M13R) above. The script below is what I have to make a new dictionary in this format. Is this the best way to proceed in order to align the sequences in the next step?</p><p class="MsoNormal" style="font-size:12.8px"></p><p class="MsoNormal" style="font-size:12.8px"> </p><p class="MsoNormal" style="font-size:12.8px">handle = open("pairedend2.txt", 'r')</p><p class="MsoNormal" style="font-size:12.8px"><br></p><p class="MsoNormal" style="font-size:12.8px">output_handle = open("AlignDict.txt", "a")</p><p class="MsoNormal" style="font-size:12.8px"><br></p><p class="MsoNormal" style="font-size:12.8px">desc2=dict()</p><p class="MsoNormal" style="font-size:12.8px">from Bio import SeqIO</p><p class="MsoNormal" style="font-size:12.8px">for seq_record in SeqIO.parse(handle, "fasta"): </p><p class="MsoNormal" style="font-size:12.8px"> parts = seq_record.description.split(" ")</p><p class="MsoNormal" style="font-size:12.8px"> des = [str(parts[0] + ' ' + parts[1] + ' ' + parts[2] + ':' + seq_record.seq)]</p><p class="MsoNormal" style="font-size:12.8px"> desc2=(dict(v.split(':') for v in des))</p><p class="MsoNormal" style="font-size:12.8px"> print ('\n' + str(desc2))</p><p class="MsoNormal" style="font-size:12.8px"> output_handle.write(str(desc2) + '\n')</p><p class="MsoNormal" style="font-size:12.8px"> </p><p class="MsoNormal" style="font-size:12.8px">output_handle.close()</p><p class="MsoNormal" style="font-size:12.8px"> </p><p class="MsoNormal" style="font-size:12.8px">2) My second issue is figuring out how to do the alignment. I thought I would do a local alignment using something like needle (or is there a better way?) but the script examples I have seen so far use two files with a single sequence in each and I have one file with multiple sequences. There is no easy way to separate these out into individual sequences into different files as the data sets are quite large.<br></p><p class="MsoNormal" style="font-size:12.8px"></p><p class="MsoNormal" style="font-size:12.8px"> </p><p class="MsoNormal" style="font-size:12.8px">Any help/ideas would be greatly appreciated.</p><p class="MsoNormal" style="font-size:12.8px"> </p><p class="MsoNormal" style="font-size:12.8px">Thank you</p><p class="MsoNormal" style="font-size:12.8px"><br></p><p class="MsoNormal" style="font-size:12.8px"> Damian</p><div><br></div>-- <br><div class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div><font face="arial, helvetica, sans-serif" size="2">Damian Menning, Ph.D.</font></div></div></div></div></div></div></div></div></div>
</div>