[Biopython] problem to find a accesion number in tab delimited file
Fernando
fpiston at gmail.com
Mon Dec 10 20:07:44 UTC 2012
Hello everybody,
I'm trying to perform a GOs annotation using the SIMAP database which is
Blast2GO annotated. Everything is fine, but I have problems when I try
to find the accession number in the file where entry numbers are
associated with their GOs. The problem is that the script does not find
the number in the input file when really there is. I tried several things
without good results (re.match, insert in a list and then extract the element, etc)
File where the GOs are associated with entry numbers has this structure (accession number, GO term, blats2go score):
1f0ba1d119f52ff28e907d2b5ea450db GO:0007154 79
1f0ba1d119f52ff28e907d2b5ea450db GO:0005605 99
The python code:
#!/usr/bin/env python
import re
from Bio.Blast import NCBIXML
from Bio import SeqIO
input_file = open('/home/fpiston/Desktop/test_go/test2.fasta', 'rU')
result_handle = open('/home/fpiston/Desktop/test_go/test2.xml', 'rU')
save_file = open('/home/fpiston/Desktop/test_go/test2.out', 'w')
fh = open('/home/fpiston/Desktop/test_go/Os_Bd_Ta_blat2go_fake', 'rU')
q_dict = SeqIO.to_dict(SeqIO.parse(input_file, "fasta"))
blast_records = NCBIXML.parse(result_handle)
hits = []
for blast_record in blast_records:
if blast_record.alignments:
list = (blast_record.query).split()
if re.match('ENA|\w*|\w*', list[0]) != None:
list2 = list[0].split("|")
save_file.write('%s\t' % list2[1])
else:
save_file.write('%s\t' % list[0])
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
h = alignment.hit_def #at this point all right
for l in fh: #here, 'l' in not found in 'fh'
ls = l.split()
if h in ls:
print h
print 'ok'
save_file.write('%s\t' % ls[1])
save_file.write('\n')
hits.append(blast_record.query.split()[0])
misses =set(q_dict.keys()) - set(hits)
for i in misses:
list = i.split("|")
if len(list) > 1:
save_file.write('%s\t' % list[1])
else:
save_file.write('%s\t' % list)
save_file.write('%s\n' % 'no_match')
save_file.close()
Fernando
--
More information about the Biopython
mailing list