[Biopython] Parsing through multiple DNA sequences one nucleotide at a time

Fomightez fomightez at gmail.com
Fri Dec 18 02:21:58 UTC 2015


Hi Damian,
As discussed in our preliminary communications working out an acceptable
solution, we have the pieces put together correctly now. Instead of
following your implementation, I tried to make it more efficient by
limiting it to one dictionary and them use string methods to compare the
sequence to see if all match 100%.

I am posting the solution here in case it is of use to anyone else.
The code is posted in a gist at
https://gist.github.com/fomightez/abd953068a675f416ed3 .

Next, we can make a version with a setting for the minimum requirement for
the length of consensus block necesary to get printed to the ouput file
next and post it there too.

The main part of my approach follows:

def make_dict_of_alignments(fasta_seq_file):
    '''
    takes a file of aligned sequence records in fasta format and reads each
one
    placing itin a dictionary for easy access later using the id as a key.
    returns the dictionary it produced.
    also returns a string of the first sequence identified to use as the
measure
    of length and for establishing a default consensus sequence for
comparison.
    '''
    # initialize needed variables
    aligned_seq_dict = {}
    example_sequence = "NotDefined"
    for seq_record in SeqIO.parse(fasta_seq_file, "fasta"):
        # extract the needed info and make dictionary
        name, sequence = str(seq_record.id), str(seq_record.seq)
        aligned_seq_dict[name] = sequence
        # grab first aligned sequence encountered as representative
        if example_sequence == "NotDefined":
            example_sequence = str(seq_record.seq)
    return aligned_seq_dict, example_sequence

# Need dictionary of the aligned sequences for accessing and will use
# a representatice sequence from there to extract details such as length and
# default consensus.
alignments_dict, representative_aligned_seq =
make_dict_of_alignments(file_name)

# initialize a string for storing the consensus. This is optional in
addition to
# the outout file being made
consensus_string = ""

# Iterate over the representative sequence string using each character for
# comparison. Idx holds the current index position and will make calling
each
# position in each sequence easy.
# NOTE that the `else` is indented correctly and belongs to the `for` loop.
It
# will only be performed if the for loop completes without a `break` as
# described at https://docs.python.org/2/tutorial/controlflow.html
for idx, extracted_residue in enumerate(representative_aligned_seq):
    for each_key in alignments_dict:
        if alignments_dict[each_key][idx] != extracted_residue:
            # A blank line is written to the line output file
            output_file_handler.write("\n")
            # consensus_string gets a blank character added to it
            consensus_string += " "
            break
    else:
        #if matches in all sequences, write the residue
        output_file_handler.write(extracted_residue)
        consensus_string += extracted_residue



Wayne



On Tue, Dec 15, 2015 at 5:34 AM, <biopython-request at mailman.open-bio.org>
wrote:
>
> ------------------------------
>
> Message: 2
> Date: Mon, 14 Dec 2015 16:45:22 -0900
> From: Damian Menning <dmenning at mail.usf.edu>
> To: Biopython Mailing List <biopython at mailman.open-bio.org>
> Subject: [Biopython] Parsing through multiple DNA sequences one
>         nucleotide      at a time
> Message-ID:
>         <CAA_KZtXdU07B1ro873i6dit4WE2q+at--Y8-WEQEsC=O6=
> RAeg at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hello everyone,
>
>   I have multiple fasta sequences that have been aligned and cut to the
> same length.  I want to go through each nucleotide position one at a time
> to see if it is the same for all of the sequences.  I have written a script
> that can do one nucleotide position at a time but I can't figure out how to
> get it to loop through the entire length of the sequence.
>
> name = "test.fas"
> handle = open(name, 'r')
> conout = open("consensus.txt", 'a')
>
> val = []
> from Bio import SeqIO
> pos = 45                                     #test position
> seqlength = 1542
>
> for seq_record in SeqIO.parse(handle, "fasta"): #parses fasta file
> #    seqlength = len(seq_record.seq)
>     val.append(seq_record.seq[pos])             #creates dictionary with
> key = sequence#, value = nucleotide at position (pos)
> length = len(val)                               #determines the total
> number of key/value pairs
>
> y=0
> z=1
>
> for x in val:                                   #parses through position
> values
>     if val[y] == val[z]:                        #checks to see if adjacent
> values are equal
> #        print val[z]
>         z=z+1
>         if z == length:                         #if all values are the
> same, writes value (nucleotide) to file
>             print "position " + str(pos+1) + " equals " + val[y]
>             conout.write(val[y])
>             break
>     else:                                       #if all values are not the
> same, writes newline to file
>         print "position " + str(pos+1) + " does not have a common
> nucleotide"
>         conout.write('\n')
>         break
>
>   The way I have written the script, if all of the nucleotides in the same
> position are the same it will write the nucleotide to a file.  If they are
> not it will write a newline.  What I want is for the script to go through
> the length of the DNA sequence (1542 bp) and write this information to a
> text file so that I will end up with essentially all of the consensus
> sequences that I can then check for potential primer locations.
>
> When I try to put in a for or while loop I end up getting the first
> nucleotide repeated for each position.  I think I just need to clear/reset
> the val dictionary after each run but it doesn't seem to work.
>
>   Any help would be greatly appreciated.
>
>   Damian
>
> --
> Damian Menning, Ph.D.
>
> "There are two types of academics. Those who use the Oxford comma, those
> who don't and those who should."
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20151217/c9e14429/attachment.html>


More information about the Biopython mailing list