[Biopython] Local alignment using a single fasta file with multiple paired end reads

Wed Sep 16 21:25:22 UTC 2015

Hello All,

  I have a fasta dataset in a single file with multiple paired end reads in
paired sets of forward and reverse sequences (the reverse sequence is in
the correct orientation).  I am pretty sure this is the real world example
requested in 6.1.3 of the Biopython Cookbook J.  Within this dataset all of
the information is the same i.e. ID:, Name:, Number of features:. The only
exceptions are the descriptions and sequences.  Ex.

>UAR Kaktovik 11-004 F L15774b(M13F)

GTAGTATAGCAATTACCTTGGTCTTGTAAGCCAAAAACGGAGAATACCTACTCTCCCTAA

GACTCAAGGAAGAAGCAACAGCTCCACTACCAGCACCCAAAGCTAATGTTCTATTTAAAC

TATTCCCTGGTACATACTACTATTTTACCCCATGTCCTATTCATTTCATATATACCATCT

TATGTGCTGTGCCATCGCAGTATGTCCTCGAATACCTTTCCCCCCCTATGTATATCGTGC

ATTAATGGTGTGCCCCATGCATATAAGCATGTACATATTACGCTTGGTCTTACATAAGGA

CTTACGTTCCGAAAGCTTATTTCAGGTGTATGGTCTGTGAGCATGTATTTCACTTAGTCC

GAGAGCTTAATCACCGGGCCTCGAGAAACCAGCAACCCTTGCGAGTACGTGTACCTCTTC

TCGCTCCGGGCCCATGGGGTGTGGGGGTTTCTATGTTGAAACTATACCTGGCATCTG

>UAR Kaktovik 11-004 R CSBCH(M13R)

TCCCTTCATTATTATCGGACAACTAGCCTCCATTCTCTACTTTACAATCCTCCTAGTACT

TATACCTATCGCTGGAATTATTGAAAACAGCCTCTTAAAGTGGAGAGTCTTTGTAGTATA

GCAATTACCTTGGTCTTGTAAGCCAAAAACGGAGAATACCTACTCTCCCTAAGACTCAAG

GAAGAAGCAACAGCTCCACTACCAGCACCCAAAGCTAATGTTCTATTTAAACTATTCCCT

GGTACATACTACTATTTTACCCCATGTCCTATTCATTTCATATATACCATCTTATGTGCT

GTGCCATCGCAGTATGTCCTCGAATACCTTTCCCCCCCTATGTATATCGTGCATTAATGG

TGTGCCCCATGCATATAAGCATGTACATATTACGCTTGGTCTTACATAAGGACTTACGTT

CCGAAAGCTTATTTCAGGTGTATGGTCTGTGAGCATGTATTTCACTTAGTCCGAGAGCTT

AATCACCGGGCCTCGAGAAACCAGCAACCCTTGCGAGTACGTGTACCTCTTCTCGCTCCG

GGCCCATGGGGTGTGGGGGTTTCTATGTTGAAACTATACCTG

My end goal is to align the paired ends of the sequences that have the same
description and save the aligned sequence to another file for further
analyses.  I have a few problems:

1) The descriptions of each sequence are not identical so I need to delete
all but the first three parts and include the associated sequence. I.e.
remove F L15774b(M13F) and  R CSBCH(M13R) above. The script below is what I
have to make a new dictionary in this format.  Is this the best way to
proceed in order to align the sequences in the next step?

handle = open("pairedend2.txt", 'r')

output_handle = open("AlignDict.txt", "a")

desc2=dict()

from Bio import SeqIO

for seq_record in SeqIO.parse(handle, "fasta"):

    parts = seq_record.description.split(" ")

    des = [str(parts[0] + ' ' + parts[1] + ' ' + parts[2] + ':' +
seq_record.seq)]

    desc2=(dict(v.split(':') for v in des))

    print ('\n' + str(desc2))

    output_handle.write(str(desc2) + '\n')

output_handle.close()

2) My second issue is figuring out how to do the alignment.  I thought I
would do a local alignment using something like needle (or is there a
better way?) but the script examples I have seen so far use two files with
a single sequence in each and I have one file with multiple sequences.
There is no easy way to separate these out into individual sequences into
different files as the data sets are quite large.

Any help/ideas would be greatly appreciated.

Thank you

  Damian

-- 
Damian Menning, Ph.D.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20150916/493fa86e/attachment.html>