[Biopython] Filter Blast results
Justin Gibbons
jgibbons1 at mail.usf.edu
Tue Feb 26 16:45:03 UTC 2013
I know that there is already a script in the Cookbook for filtering out
blast queries with no hits, but it involves holding all of the sequence
objects in memory, which isn't good if you have to work with a lot of
sequences. I came up with the following function, which works, but I would
appreciate any input for how to improve it. In particular I don't like that
I am appending the sequence objects to file and would like to know of any
alternatives.
The main function is:
def filter_no_hits(blast_xml_results, source_fasta, file_format,
no_hit_file, hit_file):
"""Scans Blast XML results and if the query sequence has no hits prints
the sequence
record in the no_hit_file, otherwise in the hit_file. The
source_fasta is the file
that was used to perform the blast search and is used to retrieve
the sequence record"""
result_handle=open(blast_xml_results) #open the xml file
blast_records=NCBIXML.parse(result_handle) #create the generator object
indexed_fasta=create_indexed_fasta(source_fasta, file_format) #create
the indexed file object
for record in blast_records:
hit_def_list=blast_xml_hit_def(record) #returns list of hit_def
results
record_id=get_id_str_from_desc(record.query) #get the record ID to
search the indexed file
record_object=indexed_fasta.get_raw(record_id) #Use the sequence ID
to get the sequence record
if is_list_null(hit_def_list): #if no hits
append_to_file(no_hit_file, record_object)
else: #if hits
append_to_file(hit_file, record_object)
result_handle.close()
Sub-functions:
def create_indexed_fasta(path, file_format):
"""Makes a fasta file searchable like a dictionary with the sequence Id
as the key"""
return SeqIO.index(path, file_format)
def blast_xml_hit_def(record):
"""Returns a list of hit_def for a record from a NCBI blast XML
report"""
hit_def_list=[]
for alignment in record.alignments:
hit_def_list.append(alignment.hit_def)
return hit_def_list
def get_id_str_from_desc(desc):
"""Returns the Id from a fasta record description"""
parts=desc.split(" ")
return parts[0]
def is_list_null(lst):
"""Returns True if list is empty and False otherwise"""
if len(lst)==0:
return True
else:
return False
def append_to_file(path, string):
with open(path, 'a') as f:
f.write(string)
def record_counter(path, file_format):
"""Input a file path and the format of the file and it returns the
number of records in the file"""
counter=0
for seq_record in SeqIO.parse(path, file_format):
counter+=1
print "%s contains %i records" %(path, counter)
return counter
Thank you
Justin Gibbons
More information about the Biopython
mailing list