[Biopython] Biopython local blastn query

Tue Jul 30 08:12:09 UTC 2013

On Tue, Jul 30, 2013 at 2:45 AM, Ara Kooser <ghashsnaga at gmail.com> wrote:
> Hello all,
>
>    I goofed up on curating accession numbers for part of my PhD project.
> But I have the sequences in a big fasta file! I wrote a quick script that
> read in one sequence at a time from the file, blasted it and then filtered
> it based on 0 gaps and 100% id match. I did this for just the first 6
> sequences as to not anger the NCBI. This worked great! But it's slow
> (really slow) and I can't submit the whole file.
>
>  I installed a local blast db and wrote this script.(attached as
> meta_data_local.py and the query file, clear_genus_level.fasta ):
>
> ########################################################################################
> #I want to read in one sequence at a time from a fasta file and blast it
> against a local
> #blast db.
>
> from Bio.Blast.Applications import NcbiblastnCommandline
> from Bio.Blast import NCBIXML
> from Bio import SeqIO
> from Bio import Seq
> from Bio.SeqRecord import SeqRecord
>
> nt = "/Users/arakooser/blast/db/nt.00"
> #Where the database is located at
> file_out = open("metadata_genus.level.csv","w+")
>
> #Contains all the data my boss wants on the sequences
> file_in = open("clear_genus_level.fasta")
>
> #The main fasta file that needs to be blasted
>
> fas_rec = SeqIO.parse(file_in,"fasta")
> #Parses the main fasta file
>
> for first_seq in fas_rec:
> #Hopefully grabs the first sequence
> #Takes that sequence from standard in and sumbits it to the blast
> commandline and spits
> #out an xml
>     result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001, outfmt=5,
>                                    out="temp.xml")

You could ask BLAST itself to apply the percentage
identity threshold, blastn has a -perc_identity option.

>     stdout, stderr = result(stdin=first_seq.format("fasta"))
>
> #Reading in the xml file.
> #
>
>     record = open("temp.xml")
>     ...

You never close this file handle, perhaps that is
causing problems reusing the filename?

It might be safer to use a different temporary
file each time (there are standard functions to
generate these names in Python)?

Peter