[Biopython] matching sequences from fasta files

Wed Mar 10 10:31:17 UTC 2010

On Wed, Mar 10, 2010 at 3:46 AM, Vincent Davis <vincent at vincentdavis.net> wrote:
> Let me fist say that I am new to biopython and dna/fasta files. I have been
> trying to use blastall to get the results I need but I am doing most of my
> work in python so why use blastall if I can get the results using python.
>
> I need to check if any/all the sequence from one fasta file are in another.
> Looking through the docs I think I could do this.
>
> I then what to find "close matches" and for me this means they differ by 1
> snp and I need to know the location of this differing snp. How would I do
> this?

If you want "close matches", then using a tool like command line tool like
BLAST (or FASTA, or needle etc) may be the fastest option. You can call
these tools from a Python script, and parse their output within the script.
(This is probably what you are already doing.)

If you want to, you can do pairwise sequence alignment from within
Biopython with the Bio.pairwise2 (the module uses C for speed).
This isn't covered in the tutorial, read the module documentation:
http://www.biopython.org/DIST/docs/api/Bio.pairwise2-module.html

For the special case of looking for perfect matches, you would be fine
with just Python - depending on your data files, you may be able to
match on the record identifiers or simply do string comparisons of the
sequences.

If you know in advance the pattern of SNPs, then you would be able
to efficiently search for them using a regular expression. However, it
sounds like you are doing SNP discovery. Here too there should be
existing command line tools designed for just this task (and described
in the literature).

Regards,

Peter