[Biopython] Biopython local blastn query
Peter Cock
p.j.a.cock at googlemail.com
Tue Jul 30 08:12:09 UTC 2013
On Tue, Jul 30, 2013 at 2:45 AM, Ara Kooser <ghashsnaga at gmail.com> wrote:
> Hello all,
>
> I goofed up on curating accession numbers for part of my PhD project.
> But I have the sequences in a big fasta file! I wrote a quick script that
> read in one sequence at a time from the file, blasted it and then filtered
> it based on 0 gaps and 100% id match. I did this for just the first 6
> sequences as to not anger the NCBI. This worked great! But it's slow
> (really slow) and I can't submit the whole file.
>
> I installed a local blast db and wrote this script.(attached as
> meta_data_local.py and the query file, clear_genus_level.fasta ):
>
> ########################################################################################
> #I want to read in one sequence at a time from a fasta file and blast it
> against a local
> #blast db.
>
> from Bio.Blast.Applications import NcbiblastnCommandline
> from Bio.Blast import NCBIXML
> from Bio import SeqIO
> from Bio import Seq
> from Bio.SeqRecord import SeqRecord
>
> nt = "/Users/arakooser/blast/db/nt.00"
> #Where the database is located at
> file_out = open("metadata_genus.level.csv","w+")
>
> #Contains all the data my boss wants on the sequences
> file_in = open("clear_genus_level.fasta")
>
> #The main fasta file that needs to be blasted
>
> fas_rec = SeqIO.parse(file_in,"fasta")
> #Parses the main fasta file
>
> for first_seq in fas_rec:
> #Hopefully grabs the first sequence
> #Takes that sequence from standard in and sumbits it to the blast
> commandline and spits
> #out an xml
> result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001, outfmt=5,
> out="temp.xml")
You could ask BLAST itself to apply the percentage
identity threshold, blastn has a -perc_identity option.
> stdout, stderr = result(stdin=first_seq.format("fasta"))
>
> #Reading in the xml file.
> #
>
> record = open("temp.xml")
> ...
You never close this file handle, perhaps that is
causing problems reusing the filename?
It might be safer to use a different temporary
file each time (there are standard functions to
generate these names in Python)?
Peter
More information about the Biopython
mailing list