[Biopython] Extracting data genpept files
Ara Kooser
akooser at unm.edu
Tue Nov 23 03:52:15 UTC 2010
Hello all,
I think Peter pointed me to part of this code (shown below) for
extracting data out of a genpept file. I am trying to get a handle on
the formating end of things. My questions is when there is missing
taxonomic data grabbed by tax_records =
gb_record.annotations["taxonomy"] instead of leaving the space blank
the program fills it in with the next piece of data, usually the date.
This throws off the whole spreadsheet when I import as a CSV file.
Is there a way to have the program write in white space when it
encounters missing data instead of the date?
Thanks,
Ara
PS as soon as the formating is sorted out and folks created for input
and such I will post the code up here.
gg = open("raw_genbank.txt","w")
gb_file = "sequence.gp.txt"
for gb_record in SeqIO.parse(open(gb_file,"r"), "genbank"):
gb_feature = gb_record.features[2]
def index_genbank_features(gb_record, feature_type, qualifier) :
answer = dict()
for (index, feature) in enumerate(gb_record.features) :
if feature.type==feature_type :
if qualifier in feature.qualifiers :
for value in feature.qualifiers[qualifier] :
if value in answer :
print "WARNING - Duplicate key %s for %s
features %i and %i" \
% (value, feature_type, answer[value],
index)
else :
answer[value] = index
return answer
locus_tag_cds_index =
index_genbank_features(gb_record,"CDS","locus_tag")
coded_by_cds_index =
index_genbank_features(gb_record,"CDS","coded_by")
name_by_source_index =
index_genbank_features(gb_record,"source","organism")
protein_id_cds_index =
index_genbank_features(gb_record,"CDS","protein_id")
gb_annotations = gb_record.annotations
tax_records = gb_record.annotations["taxonomy"]
accession = gb_record.annotations["accessions"]
date = gb_record.annotations["date"]
function = gb_record.description
gg.write(str([accession, locus_tag_cds_index, coded_by_cds_index,
name_by_source_index, tax_records, date, function]))
gg.write("\n")
gg.close()
More information about the Biopython
mailing list